diff mbox series

[RFT] x86/pat: Fix set_mce_nospec() for pmem

Message ID 162561960776.1149519.9267511644788011712.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State New, archived
Headers show
Series [RFT] x86/pat: Fix set_mce_nospec() for pmem | expand

Commit Message

Dan Williams July 7, 2021, 1:01 a.m. UTC
When poison is discovered and triggers memory_failure() the physical
page is unmapped from all process address space. However, it is not
unmapped from kernel address space. Unlike a typical memory page that
can be retired from use in the page allocator and marked 'not present',
pmem needs to remain accessible given it can not be physically remapped
or retired. set_memory_uc() tries to maintain consistent nominal memtype
mappings for a given pfn, but memory_failure() is an exceptional
condition.

For the same reason that set_memory_np() bypasses memtype checks
because they do not apply in the memory failure case, memtype validation
is not applicable for marking the pmem pfn uncacheable. Use
_set_memory_uc().

Reported-by: Jane Chu <jane.chu@oracle.com>
Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set,clear}_mce_nospec()")
Cc: Luis Chamberlain <mcgrof@suse.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Jane, can you give this a try and see if it cleans up the error you are
seeing?

Thanks for the help.

 arch/x86/include/asm/set_memory.h |    9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Dan Williams Aug. 26, 2021, 7:08 p.m. UTC | #1
On Tue, Jul 6, 2021 at 6:01 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> When poison is discovered and triggers memory_failure() the physical
> page is unmapped from all process address space. However, it is not
> unmapped from kernel address space. Unlike a typical memory page that
> can be retired from use in the page allocator and marked 'not present',
> pmem needs to remain accessible given it can not be physically remapped
> or retired. set_memory_uc() tries to maintain consistent nominal memtype
> mappings for a given pfn, but memory_failure() is an exceptional
> condition.
>
> For the same reason that set_memory_np() bypasses memtype checks
> because they do not apply in the memory failure case, memtype validation
> is not applicable for marking the pmem pfn uncacheable. Use
> _set_memory_uc().
>
> Reported-by: Jane Chu <jane.chu@oracle.com>
> Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set,clear}_mce_nospec()")
> Cc: Luis Chamberlain <mcgrof@suse.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> Jane, can you give this a try and see if it cleans up the error you are
> seeing?
>
> Thanks for the help.

Jane, does this resolve the failure you reported [1]?

[1]: https://lore.kernel.org/r/327f9156-9b28-d20e-2850-21c125ece8c7@oracle.com

>
>  arch/x86/include/asm/set_memory.h |    9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
> index 43fa081a1adb..0bf2274c5186 100644
> --- a/arch/x86/include/asm/set_memory.h
> +++ b/arch/x86/include/asm/set_memory.h
> @@ -114,8 +114,13 @@ static inline int set_mce_nospec(unsigned long pfn, bool unmap)
>
>         if (unmap)
>                 rc = set_memory_np(decoy_addr, 1);
> -       else
> -               rc = set_memory_uc(decoy_addr, 1);
> +       else {
> +               /*
> +                * Bypass memtype checks since memory-failure has shot
> +                * down mappings.
> +                */
> +               rc = _set_memory_uc(decoy_addr, 1);
> +       }
>         if (rc)
>                 pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
>         return rc;
>
Jane Chu Aug. 27, 2021, 7:12 a.m. UTC | #2
Hi, Dan,

On 8/26/2021 12:08 PM, Dan Williams wrote:
> On Tue, Jul 6, 2021 at 6:01 PM Dan Williams<dan.j.williams@intel.com>  wrote:
>> When poison is discovered and triggers memory_failure() the physical
>> page is unmapped from all process address space. However, it is not
>> unmapped from kernel address space. Unlike a typical memory page that
>> can be retired from use in the page allocator and marked 'not present',
>> pmem needs to remain accessible given it can not be physically remapped
>> or retired. set_memory_uc() tries to maintain consistent nominal memtype
>> mappings for a given pfn, but memory_failure() is an exceptional
>> condition.
>>
>> For the same reason that set_memory_np() bypasses memtype checks
>> because they do not apply in the memory failure case, memtype validation
>> is not applicable for marking the pmem pfn uncacheable. Use
>> _set_memory_uc().
>>
>> Reported-by: Jane Chu<jane.chu@oracle.com>
>> Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set,clear}_mce_nospec()")
>> Cc: Luis Chamberlain<mcgrof@suse.com>
>> Cc: Borislav Petkov<bp@alien8.de>
>> Cc: Tony Luck<tony.luck@intel.com>
>> Signed-off-by: Dan Williams<dan.j.williams@intel.com>
>> ---
>> Jane, can you give this a try and see if it cleans up the error you are
>> seeing?
>>
>> Thanks for the help.
> Jane, does this resolve the failure you reported [1]?
> 
> [1]:https://lore.kernel.org/r/327f9156-9b28-d20e-2850-21c125ece8c7@oracle.com
> 

Sorry for taking so long.  With the patch applied, the dmesg is displaying
[ 2111.282759] Memory failure: 0x1850600: recovery action for dax page: 
Recovered
[ 2112.415412] x86/PAT: fsdax_poison_v1:3214 freeing invalid memtype 
[mem 0x1850600000-0x1850600fff]

instead of the problematic

[10683.426147] x86/PAT: fsdax_poison_v1:5018 conflicting memory types 
1850600000-1850601000  uncached-minus<->write-back

Please feel free to add Tested-by: Jane Chu<jane.chu@oracle.com>

Thanks for the fix!

-jane
Borislav Petkov Sept. 13, 2021, 10:29 a.m. UTC | #3
On Tue, Jul 06, 2021 at 06:01:05PM -0700, Dan Williams wrote:
> When poison is discovered and triggers memory_failure() the physical
> page is unmapped from all process address space. However, it is not
> unmapped from kernel address space. Unlike a typical memory page that
> can be retired from use in the page allocator and marked 'not present',
> pmem needs to remain accessible given it can not be physically remapped
> or retired.

I'm surely missing something obvious but why does it need to remain
accessible? Spell it out please.

> set_memory_uc() tries to maintain consistent nominal memtype
> mappings for a given pfn, but memory_failure() is an exceptional
> condition.

That's not clear to me too. So looking at the failure:

[10683.426147] x86/PAT: fsdax_poison_v1:5018 conflicting memory types 1850600000-1850601000  uncached-minus<->write-back

set_memory_uc() marked it UC- but something? wants it to be WB. Why?

I guess I need some more info on the whole memory offlining for pmem and
why that should be done differently than with normal memory.

Thx.
Dan Williams Sept. 14, 2021, 6:08 p.m. UTC | #4
On Mon, Sep 13, 2021 at 3:29 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Jul 06, 2021 at 06:01:05PM -0700, Dan Williams wrote:
> > When poison is discovered and triggers memory_failure() the physical
> > page is unmapped from all process address space. However, it is not
> > unmapped from kernel address space. Unlike a typical memory page that
> > can be retired from use in the page allocator and marked 'not present',
> > pmem needs to remain accessible given it can not be physically remapped
> > or retired.
>
> I'm surely missing something obvious but why does it need to remain
> accessible? Spell it out please.

Sure, I should probably include this following note in all patches
touching the DAX-memory_failure() path, because it is a frequently
asked question. The tl;dr is:

Typical memory_failure() does not assume the physical page can be
recovered and put back into circulation, PMEM memory_failure() allows
for recovery of the page.

The longer description is:
Typical memory_failure() for anonymous, or page-cache pages, has the
flexibility to invalidate bad pages and trigger any users to request a
new page from the page allocator to replace the quarantined one. DAX
removes that flexibility. The page is a handle for a fixed storage
location, i.e. no mechanism to remap a physical page to a different
logical address. Software expects to be able to repair an error in
PMEM by reading around the poisoned cache lines and writing zeros,
fallocate(...FALLOC_FL_PUNCH_HOLE...), to overwrite poison. The page
needs to remain accessible to enable recovery.

>
> > set_memory_uc() tries to maintain consistent nominal memtype
> > mappings for a given pfn, but memory_failure() is an exceptional
> > condition.
>
> That's not clear to me too. So looking at the failure:
>
> [10683.426147] x86/PAT: fsdax_poison_v1:5018 conflicting memory types 1850600000-1850601000  uncached-minus<->write-back
>
> set_memory_uc() marked it UC- but something? wants it to be WB. Why?

PMEM is mapped WB at the beginning of time for nominal operation.
track_pfn_remap() records that driver setting and forwards it to any
track_pfn_insert() of the same pfn, i.e. this is how DAX mappings
inherit the WB cache mode. memory_failure() wants to arrange avoidance
speculative consumption of poison, set_memory_uc() checks with the
track_pfn_remap() setting, but we know this is an exceptional
condition and it is ok to force it UC against the typical memtype
expectation.

> I guess I need some more info on the whole memory offlining for pmem and
> why that should be done differently than with normal memory.

Short answer, PMEM never goes "offline" because it was never "online"
in the first place. Where "online" in this context is specifically
referring to pfns that are under the watchful eye of the core-mm page
allocator.
Borislav Petkov Sept. 15, 2021, 10:41 a.m. UTC | #5
On Tue, Sep 14, 2021 at 11:08:00AM -0700, Dan Williams wrote:
> Sure, I should probably include this following note in all patches
> touching the DAX-memory_failure() path, because it is a frequently
> asked question. The tl;dr is:
> 
> Typical memory_failure() does not assume the physical page can be
> recovered and put back into circulation, PMEM memory_failure() allows
> for recovery of the page.

Hmm, I think by "PMEM memory_failure()" you mean what pmem_do_write()
does with that poison clearing or?

But then I have no clue what the "DAX-memory_failure()" path is.

> The longer description is:
> Typical memory_failure() for anonymous, or page-cache pages, has the
> flexibility to invalidate bad pages and trigger any users to request a
> new page from the page allocator to replace the quarantined one. DAX
> removes that flexibility. The page is a handle for a fixed storage
> location, i.e. no mechanism to remap a physical page to a different
> logical address. Software expects to be able to repair an error in
> PMEM by reading around the poisoned cache lines and writing zeros,
> fallocate(...FALLOC_FL_PUNCH_HOLE...), to overwrite poison. The page
> needs to remain accessible to enable recovery.

Aha, so memory failure there is not offlining he 4K page but simply
zeroing the poison. What happens if the same physical location triggers
more read errors, i.e., the underlying hw is going bad? Don't you want
to exclude that location from ever being written again?

Or maybe such recovery action doesn't make sense for pmem?

> Short answer, PMEM never goes "offline" because it was never "online"
> in the first place. Where "online" in this context is specifically
> referring to pfns that are under the watchful eye of the core-mm page
> allocator.

Ok, so pmem wants to be handled differently during memory failure.
Looking at the patch again, you change the !unmap case to do
_set_memory_uc().

That bool unmap thing gets *not* set in whole_page():

	return MCI_MISC_ADDR_LSB(m->misc) >= PAGE_SHIFT;

so I'm guessing that "Recoverable Address LSB (bits 5:0): The lowest
valid recoverable address bit." thing is < 12 for pmem.

But are you really saying that the hardware would never report a lower
than 12 value for normal memory?

If it does, then that is wrong here.

In any case, I'd prefer if the code would do an explicit check somewhere
saying "is this pmem" in order to do that special action only for when
it really is pmem and not rely on it implicitly.

Thx.
Dan Williams Sept. 16, 2021, 8:33 p.m. UTC | #6
On Wed, Sep 15, 2021 at 3:41 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Sep 14, 2021 at 11:08:00AM -0700, Dan Williams wrote:
> > Sure, I should probably include this following note in all patches
> > touching the DAX-memory_failure() path, because it is a frequently
> > asked question. The tl;dr is:
> >
> > Typical memory_failure() does not assume the physical page can be
> > recovered and put back into circulation, PMEM memory_failure() allows
> > for recovery of the page.
>
> Hmm, I think by "PMEM memory_failure()" you mean what pmem_do_write()
> does with that poison clearing or?

I am specifically talking about the memory_failure_dev_pagemap() path
taken from memory_failure().

> But then I have no clue what the "DAX-memory_failure()" path is.

Sorry, I should not have lazily used {PMEM,DAX} memory_failure() to
refer to the exact routine in question: memory_failure_dev_pagemap().
That path is taken when memory_failure() sees that a pfn was mapped by
memremap_pages(). It determines that the pfn does not represent a
dynamic page allocator page and may refer to a static page of storage
from a device like /dev/pmem, or /dev/dax.

>
> > The longer description is:
> > Typical memory_failure() for anonymous, or page-cache pages, has the
> > flexibility to invalidate bad pages and trigger any users to request a
> > new page from the page allocator to replace the quarantined one. DAX
> > removes that flexibility. The page is a handle for a fixed storage
> > location, i.e. no mechanism to remap a physical page to a different
> > logical address. Software expects to be able to repair an error in
> > PMEM by reading around the poisoned cache lines and writing zeros,
> > fallocate(...FALLOC_FL_PUNCH_HOLE...), to overwrite poison. The page
> > needs to remain accessible to enable recovery.
>
> Aha, so memory failure there is not offlining he 4K page but simply
> zeroing the poison.

The zeroing of the poison comes optionally later if userspace
explicitly asks for it to be cleared.

> What happens if the same physical location triggers
> more read errors, i.e., the underlying hw is going bad? Don't you want
> to exclude that location from ever being written again?

Yes. The device driver avoids reconsumption of the same error by
recording the error in a "badblocks" structure (block/badblocks.c).
The driver consults its badblocks instance on every subsequent access
and preemptively returns EIO. The driver registers for machine-check
notifications and translates those events into badblocks entries. So,
repeat consumption is avoided unless/until the babblocks entry can be
cleared along with the poison itself.

I'll note that going forward the filesystem wants to be more involved
in tracking and managing these errors so the driver will grow the
ability to forward those notifications up the stack.

> Or maybe such recovery action doesn't make sense for pmem?

It does. In addition to machine-check synchronous notification of
poison, there are asynchronous scanning mechanisms that prepopulate
the badblocks entries to get ahead of consumption events.

> > Short answer, PMEM never goes "offline" because it was never "online"
> > in the first place. Where "online" in this context is specifically
> > referring to pfns that are under the watchful eye of the core-mm page
> > allocator.
>
> Ok, so pmem wants to be handled differently during memory failure.
> Looking at the patch again, you change the !unmap case to do
> _set_memory_uc().
>
> That bool unmap thing gets *not* set in whole_page():
>
>         return MCI_MISC_ADDR_LSB(m->misc) >= PAGE_SHIFT;
>
> so I'm guessing that "Recoverable Address LSB (bits 5:0): The lowest
> valid recoverable address bit." thing is < 12 for pmem.
>
> But are you really saying that the hardware would never report a lower
> than 12 value for normal memory?

Ugh, no, I failed to update my mental model of set_mce_nopsec() usage
after commit 17fae1294ad9 "x86/{mce,mm}: Unmap the entire page if the
whole page is affected and poisoned". I was just looking for "where
does set_mce_nospec() do the UC vs NP decision", and did not read the
new calling convention... more below.

> If it does, then that is wrong here.
>
> In any case, I'd prefer if the code would do an explicit check somewhere
> saying "is this pmem" in order to do that special action only for when
> it really is pmem and not rely on it implicitly.

After thinking this through a bit I don't think we want that, but I
did find a bug, and this patch needs a rewritten changelog to say why
pmem does not need to be explicitly called out in this path. The
reasoning is that set_mce_nospec() was introduced as a way to achieve
the goal of preventing speculative consumption of poison *and* allow
for reads around poisoned cachelines for PMEM recovery. UC achieves
that goal for both PMEM and DRAM, unlike what this changelog alluded
to which was NP for DRAM and UC for PMEM. The new consideration of
whole_page() errors means that the code in nfit_handle_mce() is wrong
to assume that the length of the error is just 1 cacheline. When that
bug is fixed the badblocks range will cover the entire page in
whole_page() notification cases. Because the entire page is dead there
is no need for UC because all the cachelines are gone, nothing to
read-around in this page. In short DRAM and PMEM want to share the
exact same policy here and use NP for whole_page() and UC for not.
Just the small matter of ignoring the memtype by using
_set_memory_uc().

Thanks for triggering that deeper look.
Borislav Petkov Sept. 17, 2021, 11:30 a.m. UTC | #7
On Thu, Sep 16, 2021 at 01:33:42PM -0700, Dan Williams wrote:
> I am specifically talking about the memory_failure_dev_pagemap() path
> taken from memory_failure().
> 
> > But then I have no clue what the "DAX-memory_failure()" path is.
> 
> Sorry, I should not have lazily used {PMEM,DAX} memory_failure() to
> refer to the exact routine in question: memory_failure_dev_pagemap().
> That path is taken when memory_failure() sees that a pfn was mapped by
> memremap_pages(). It determines that the pfn does not represent a
> dynamic page allocator page and may refer to a static page of storage
> from a device like /dev/pmem, or /dev/dax.

Aaaha, that's that innocent-looking pfn_to_online_page() call there in
memory_failure() which would return NULL for pmem/dax pages.

> The zeroing of the poison comes optionally later if userspace
> explicitly asks for it to be cleared.

I see.

> Yes. The device driver avoids reconsumption of the same error by
> recording the error in a "badblocks" structure (block/badblocks.c).
> The driver consults its badblocks instance on every subsequent access
> and preemptively returns EIO. The driver registers for machine-check
> notifications and translates those events into badblocks entries. So,
> repeat consumption is avoided unless/until the babblocks entry can be
> cleared along with the poison itself.

Sounds good.

> I'll note that going forward the filesystem wants to be more involved
> in tracking and managing these errors so the driver will grow the
> ability to forward those notifications up the stack.

... which would save the ask-the-driver each time thing. Yap.

> It does. In addition to machine-check synchronous notification of
> poison, there are asynchronous scanning mechanisms that prepopulate
> the badblocks entries to get ahead of consumption events.

Probably something similar to patrol scrubbing on normal DRAM.

> ... Because the entire page is dead there is no need for UC because all
> the cachelines are gone, nothing to read-around in this page. In short
> DRAM and PMEM want to share the exact same policy here and use NP for
> whole_page() and UC for not. Just the small matter of ignoring the
> memtype by using _set_memory_uc().

Hmm, so looking at this more:

        else
                rc = set_memory_uc(decoy_addr, 1);

->  memtype_reserve(__pa(addr), __pa(addr) + numpages * PAGE_SIZE,

->  memtype_reserve(__pa(addr), __pa(addr) + 1,

Looking at

  17fae1294ad9 ("x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned")

it says

    Fix is to check the scope of the poison by checking the MCi_MISC register.
    If the entire page is affected, then unmap the page. If only part of the
    page is affected, then mark the page as uncacheable.

so I guess for normal DRAM we still want to map the page uncacheable so
that other parts except that cacheline can still be read.

And PMEM should be excluded from that treatise here because it needs to
be WB, as you said earlier.

Right?
Dan Williams Sept. 21, 2021, 2:04 a.m. UTC | #8
On Fri, Sep 17, 2021 at 4:30 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, Sep 16, 2021 at 01:33:42PM -0700, Dan Williams wrote:
> > I am specifically talking about the memory_failure_dev_pagemap() path
> > taken from memory_failure().
> >
> > > But then I have no clue what the "DAX-memory_failure()" path is.
> >
> > Sorry, I should not have lazily used {PMEM,DAX} memory_failure() to
> > refer to the exact routine in question: memory_failure_dev_pagemap().
> > That path is taken when memory_failure() sees that a pfn was mapped by
> > memremap_pages(). It determines that the pfn does not represent a
> > dynamic page allocator page and may refer to a static page of storage
> > from a device like /dev/pmem, or /dev/dax.
>
> Aaaha, that's that innocent-looking pfn_to_online_page() call there in
> memory_failure() which would return NULL for pmem/dax pages.

Exactly.

>
> > The zeroing of the poison comes optionally later if userspace
> > explicitly asks for it to be cleared.
>
> I see.
>
> > Yes. The device driver avoids reconsumption of the same error by
> > recording the error in a "badblocks" structure (block/badblocks.c).
> > The driver consults its badblocks instance on every subsequent access
> > and preemptively returns EIO. The driver registers for machine-check
> > notifications and translates those events into badblocks entries. So,
> > repeat consumption is avoided unless/until the babblocks entry can be
> > cleared along with the poison itself.
>
> Sounds good.
>
> > I'll note that going forward the filesystem wants to be more involved
> > in tracking and managing these errors so the driver will grow the
> > ability to forward those notifications up the stack.
>
> ... which would save the ask-the-driver each time thing. Yap.
>
> > It does. In addition to machine-check synchronous notification of
> > poison, there are asynchronous scanning mechanisms that prepopulate
> > the badblocks entries to get ahead of consumption events.
>
> Probably something similar to patrol scrubbing on normal DRAM.

Yes, although I believe that DRAM patrol scrubbing is being done from
the host memory controller, these PMEM DIMMs have firmware and DMA
engines *in the DIMM* to do this scrub work.

> > ... Because the entire page is dead there is no need for UC because all
> > the cachelines are gone, nothing to read-around in this page. In short
> > DRAM and PMEM want to share the exact same policy here and use NP for
> > whole_page() and UC for not. Just the small matter of ignoring the
> > memtype by using _set_memory_uc().
>
> Hmm, so looking at this more:
>
>         else
>                 rc = set_memory_uc(decoy_addr, 1);
>
> ->  memtype_reserve(__pa(addr), __pa(addr) + numpages * PAGE_SIZE,
>
> ->  memtype_reserve(__pa(addr), __pa(addr) + 1,
>
> Looking at
>
>   17fae1294ad9 ("x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned")
>
> it says
>
>     Fix is to check the scope of the poison by checking the MCi_MISC register.
>     If the entire page is affected, then unmap the page. If only part of the
>     page is affected, then mark the page as uncacheable.
>
> so I guess for normal DRAM we still want to map the page uncacheable so
> that other parts except that cacheline can still be read.

Perhaps, but I don't know how you do that if memory_failure() has
"offlined" the DRAM page, in the case of PMEM you can issue a
byte-aligned direct-I/O access to the exact storage locations around
the poisoned cachelines.

> And PMEM should be excluded from that treatise here because it needs to
> be WB, as you said earlier.
>
> Right?

PMEM can still go NP if the entire page is failed, so no need to
exclude PMEM from the treatise because the driver's badblocks
implementation will cover the NP page, and the driver can use
clear_mce_nospec() to recover the WB mapping / access after the poison
has been cleared.
Borislav Petkov Sept. 30, 2021, 5:19 p.m. UTC | #9
On Mon, Sep 20, 2021 at 07:04:50PM -0700, Dan Williams wrote:
> Yes, although I believe that DRAM patrol scrubbing is being done from
> the host memory controller, these PMEM DIMMs have firmware and DMA
> engines *in the DIMM* to do this scrub work.

Oh great. Lemme guess, they even have small OSes inside ... <eyeroll>

> Perhaps, but I don't know how you do that if memory_failure() has
> "offlined" the DRAM page, in the case of PMEM you can issue a
> byte-aligned direct-I/O access to the exact storage locations around
> the poisoned cachelines.

Well, looking at the exactly two call sites of set_mce_nospec(), it
does:

	if (!memory_failure(..))
		set_mce_nospec(pfn, whole_page...);

so IINM, if that thing returns 0, then we have hwpoisoned the page. And
if that is the case, then why are we even diddling with reads around the
poisoned cacheline when the whole page has been poisoned and we can't
access it anymore?

Or am I missing something?

Because the comment over set_mce_nospec() says

"or marking it uncacheable (if we want to try to retrieve data from
non-poisoned lines in the page)."

Question is, can we even access a hwpoisoned page to retrieve that data
or are we completely in the wrong weeds here? Tony?

> PMEM can still go NP if the entire page is failed, so no need to
> exclude PMEM from the treatise because the driver's badblocks
> implementation will cover the NP page, and the driver can use
> clear_mce_nospec() to recover the WB mapping / access after the poison
> has been cleared.

Now I'm all confused again. ;-(
Tony Luck Sept. 30, 2021, 5:28 p.m. UTC | #10
> Question is, can we even access a hwpoisoned page to retrieve that data
> or are we completely in the wrong weeds here? Tony?

Hardware scope for poison is a cache line (64 bytes for DDR, may be larger
for the internals of 3D-Xpoint memory).

So you can access the other lines in the page. You just need to be careful
not to access the poison. Setting the cache mode for the page to uncacheable
ensures that h/w prefetchers and speculative access don't accidently touch
the poison while accessing nearby cache lines.

-Tony
Jane Chu Sept. 30, 2021, 6:15 p.m. UTC | #11
Hi, Dan,

On 9/16/2021 1:33 PM, Dan Williams wrote:
> The new consideration of
> whole_page() errors means that the code in nfit_handle_mce() is wrong
> to assume that the length of the error is just 1 cacheline. When that
> bug is fixed the badblocks range will cover the entire page in
> whole_page() notification cases.

Does this explain what I saw a while ago: inject two poisons
to the same block, and pwrite clears one poison at a time?

thanks!
-jane
Dan Williams Sept. 30, 2021, 7:11 p.m. UTC | #12
On Thu, Sep 30, 2021 at 11:15 AM Jane Chu <jane.chu@oracle.com> wrote:
>
> Hi, Dan,
>
> On 9/16/2021 1:33 PM, Dan Williams wrote:
> > The new consideration of
> > whole_page() errors means that the code in nfit_handle_mce() is wrong
> > to assume that the length of the error is just 1 cacheline. When that
> > bug is fixed the badblocks range will cover the entire page in
> > whole_page() notification cases.
>
> Does this explain what I saw a while ago: inject two poisons
> to the same block, and pwrite clears one poison at a time?

Potentially, yes. If you injected poison over 512 bytes then it could
take 2 pwrites() of 256-bytes each to clear that poison. In the DAX
case nothing is enforcing sector aligned I/O for pwrite().
Borislav Petkov Sept. 30, 2021, 7:30 p.m. UTC | #13
On Thu, Sep 30, 2021 at 05:28:12PM +0000, Luck, Tony wrote:
> > Question is, can we even access a hwpoisoned page to retrieve that data
> > or are we completely in the wrong weeds here? Tony?
> 
> Hardware scope for poison is a cache line (64 bytes for DDR, may be larger
> for the internals of 3D-Xpoint memory).

I don't mean from the hw aspect but from the OS one: my simple thinking
is, *if* a page is marked as HW poison, any further mapping or accessing
of the page frame is prevented by the mm code.

So you can't access *any* bits there so why do we even bother with whole
or not whole page? Page is gone...
Dan Williams Sept. 30, 2021, 7:41 p.m. UTC | #14
On Thu, Sep 30, 2021 at 12:30 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, Sep 30, 2021 at 05:28:12PM +0000, Luck, Tony wrote:
> > > Question is, can we even access a hwpoisoned page to retrieve that data
> > > or are we completely in the wrong weeds here? Tony?
> >
> > Hardware scope for poison is a cache line (64 bytes for DDR, may be larger
> > for the internals of 3D-Xpoint memory).
>
> I don't mean from the hw aspect but from the OS one: my simple thinking
> is, *if* a page is marked as HW poison, any further mapping or accessing
> of the page frame is prevented by the mm code.
>
> So you can't access *any* bits there so why do we even bother with whole
> or not whole page? Page is gone...

I think the disconnect is that in the
typical-memory-from-the-page-allocator case it's ok to throw away the
whole page and get another one, they are interchangeable. In the PMEM
case they are not, they are fixed physical offsets known to things
like filesystem-metadata. So PageHWPoison in this latter case is just
a flag to indicate that poison mitigations are in effect, page marked
UC or NP, and the owner of that page, PMEM driver, knows how to
navigate around that poison to maximize data recovery flows. The owner
of the page in the latter / typical case, page allocator, says "bah,
I'll just throw the whole thing away because there's no chance at
repair and consumers can just as well use a different pfn for the same
role."
Tony Luck Sept. 30, 2021, 7:44 p.m. UTC | #15
>I don't mean from the hw aspect but from the OS one: my simple thinking
> is, *if* a page is marked as HW poison, any further mapping or accessing
> of the page frame is prevented by the mm code.
>
> So you can't access *any* bits there so why do we even bother with whole
> or not whole page? Page is gone...

See the comment above set_mce_nospec() ...

/*
 * Prevent speculative access to the page by either unmapping
 * it (if we do not require access to any part of the page) or
 * marking it uncacheable (if we want to try to retrieve data
 * from non-poisoned lines in the page).
 */
static inline int set_mce_nospec(unsigned long pfn, bool unmap)


It's a choice as to whether the whole page is gone or not. The history for
this is using pmem as storage. The filesystem block size may be less than
the page size. An error in a "block" should only result in that block disappearing
from the file, not the surrounding 4k.

Of course that only works for users of read(2)/write(2) to access the file.
They can't mmap(2) and just get the good bits.

-Tony
Borislav Petkov Sept. 30, 2021, 8:01 p.m. UTC | #16
On Thu, Sep 30, 2021 at 07:44:48PM +0000, Luck, Tony wrote:
> See the comment above set_mce_nospec() ...
> 
> /*
>  * Prevent speculative access to the page by either unmapping
>  * it (if we do not require access to any part of the page) or
>  * marking it uncacheable (if we want to try to retrieve data
>  * from non-poisoned lines in the page).
>  */
> static inline int set_mce_nospec(unsigned long pfn, bool unmap)

I've seen that comment - I've quoted it upthread...

> It's a choice as to whether the whole page is gone or not. The history for
> this is using pmem as storage. The filesystem block size may be less than
> the page size. An error in a "block" should only result in that block disappearing
> from the file, not the surrounding 4k.

So let me cut to the chase:

        if (!memory_failure(..))
                set_mce_nospec(pfn, whole_page...);

when memory_failure() returns 0, is a whole page marked as hwpoison or
not?

Because I see there close to the top of the function:

	if (TestSetPageHWPoison(p)) {
		...

after this, that whole page is hwpoison I'd say. Not a cacheline but the
whole thing.
Tony Luck Sept. 30, 2021, 8:15 p.m. UTC | #17
> So let me cut to the chase:
>
>        if (!memory_failure(..))
>                set_mce_nospec(pfn, whole_page...);
>
> when memory_failure() returns 0, is a whole page marked as hwpoison or
> not?

Depends on what the whole_page() helper said.

static bool whole_page(struct mce *m)
{
        if (!mca_cfg.ser || !(m->status & MCI_STATUS_MISCV))
                return true;

        return MCI_MISC_ADDR_LSB(m->misc) >= PAGE_SHIFT;
}

> Because I see there close to the top of the function:
>
>	if (TestSetPageHWPoison(p)) {
>		...
>
> after this, that whole page is hwpoison I'd say. Not a cacheline but the
> whole thing.

That may now be a confusing name for the page flag bit. Until the
pmem/storage use case we just simply lost the whole page (back
then set_mce_nospec() didn't take an argument, it just marked the
whole page as "not present" in the kernel 1:1 map).

So the meaning of HWPoison has subtly changed from "this whole
page is poisoned" to "there is some poison in some/all[1] of this page"

-Tony

[1] even in the "all" case the poison is likely just in one cache line, but
the MCI_MISC_ADDR_LSB indication in the machine check bank said
the scope was the whole page. This happened on some older systems
for page scrub errors where the memory controller wasn't smart enough
to translate the channel address back to a cache granular system address.
Borislav Petkov Sept. 30, 2021, 8:32 p.m. UTC | #18
On Thu, Sep 30, 2021 at 08:15:51PM +0000, Luck, Tony wrote:
> That may now be a confusing name for the page flag bit. Until the
> pmem/storage use case we just simply lost the whole page (back
> then set_mce_nospec() didn't take an argument, it just marked the
> whole page as "not present" in the kernel 1:1 map).
> 
> So the meaning of HWPoison has subtly changed from "this whole
> page is poisoned" to "there is some poison in some/all[1] of this page"

Is that that PMEM driver case Dan is talking about: "the owner of that
page, PMEM driver, knows how to navigate around that poison to maximize
data recovery flows."?

IOW, even if the page is marked as poison, in the PMEM case the driver
can access those sub-page ranges to salvage data? And in the "normal"
case, we only deal with whole pages anyway because memory_failure()
will mark the whole page as poison and nothing will be able to access
sub-page ranges there to salvage data?

Closer?
Dan Williams Sept. 30, 2021, 8:39 p.m. UTC | #19
On Thu, Sep 30, 2021 at 1:34 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, Sep 30, 2021 at 08:15:51PM +0000, Luck, Tony wrote:
> > That may now be a confusing name for the page flag bit. Until the
> > pmem/storage use case we just simply lost the whole page (back
> > then set_mce_nospec() didn't take an argument, it just marked the
> > whole page as "not present" in the kernel 1:1 map).
> >
> > So the meaning of HWPoison has subtly changed from "this whole
> > page is poisoned" to "there is some poison in some/all[1] of this page"
>
> Is that that PMEM driver case Dan is talking about: "the owner of that
> page, PMEM driver, knows how to navigate around that poison to maximize
> data recovery flows."?
>
> IOW, even if the page is marked as poison, in the PMEM case the driver
> can access those sub-page ranges to salvage data? And in the "normal"
> case, we only deal with whole pages anyway because memory_failure()
> will mark the whole page as poison and nothing will be able to access
> sub-page ranges there to salvage data?
>
> Closer?
>

Yes, that's a good way to think about it. The only way to avoid poison
for page allocator pages is to just ditch the page. In the case of
PMEM the driver can do this fine grained dance because it gets precise
sub-page poison lists to consult and implements a non-mmap path to
access the page contents.
Borislav Petkov Sept. 30, 2021, 8:54 p.m. UTC | #20
On Thu, Sep 30, 2021 at 01:39:03PM -0700, Dan Williams wrote:
> Yes, that's a good way to think about it. The only way to avoid poison
> for page allocator pages is to just ditch the page. In the case of
> PMEM the driver can do this fine grained dance because it gets precise
> sub-page poison lists to consult and implements a non-mmap path to
> access the page contents.

Ok, good.

Now, before we do anything further here, I'd like for this very much
non-obvious situation to be documented in detail so that we know what's
going on there and what that whole_page notion even means. Because this
is at least bending the meaning of page states like poison and what that
really means for the underlying thing - PMEM or general purpose DIMMs.

And then that test could be something like:

	/*
	 * Normal DRAM gets poisoned as a whole page, yadda yadda...
	 /
	if (whole_page) {

	/*
	 * Special handling for PMEM case, driver can handle accessing sub-page ranges
	 * even if the whole "page" is poisoned, blabla
	} else {
		rc = _set_memory_uc(decoy_addr, 1);
	...

so that it is crystal clear what's going on there.

Thx!
Dan Williams Sept. 30, 2021, 9:05 p.m. UTC | #21
On Thu, Sep 30, 2021 at 1:54 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, Sep 30, 2021 at 01:39:03PM -0700, Dan Williams wrote:
> > Yes, that's a good way to think about it. The only way to avoid poison
> > for page allocator pages is to just ditch the page. In the case of
> > PMEM the driver can do this fine grained dance because it gets precise
> > sub-page poison lists to consult and implements a non-mmap path to
> > access the page contents.
>
> Ok, good.
>
> Now, before we do anything further here, I'd like for this very much
> non-obvious situation to be documented in detail so that we know what's
> going on there and what that whole_page notion even means. Because this
> is at least bending the meaning of page states like poison and what that
> really means for the underlying thing - PMEM or general purpose DIMMs.

Hmm, memory type does not matter in this path.

>
> And then that test could be something like:
>
>         /*
>          * Normal DRAM gets poisoned as a whole page, yadda yadda...

No, the whole_page case is equally relevant for PMEM...

>          /
>         if (whole_page) {
>
>         /*
>          * Special handling for PMEM case, driver can handle accessing sub-page ranges
>          * even if the whole "page" is poisoned, blabla
>         } else {

...and UC is acceptable to DRAM.

>                 rc = _set_memory_uc(decoy_addr, 1);
>         ...
>
> so that it is crystal clear what's going on there.
>

The only distinction that matters here is whether the reported
blast-radius of the poison reported by the hardware is a sub-page or
whole-page. Memory type does not matter.

I.e. it's fine if a DRAM page with a single cacheline error only gets
marked UC. Speculation is disabled and the page allocator will still
throw it away and never use it again. Similarly NP is fine for PMEM
when the machine-check-registers indicate that the entire page is full
of poison. The driver will record that and block any attempt to
recover any data in that page.
Borislav Petkov Sept. 30, 2021, 9:20 p.m. UTC | #22
On Thu, Sep 30, 2021 at 02:05:49PM -0700, Dan Williams wrote:
> I.e. it's fine

A lot of things are fine - question is, what makes sense and what makes
this thing as simple as required to be.

> if a DRAM page with a single cacheline error only gets marked UC.
> Speculation is disabled and the page allocator will still throw it
> away and never use it again.

Normal DRAM is poisoned as a whole page, as we established. So whatever
it is marked with - UC or NP - it probably doesn't matter. But since
the page won't be ever used, then that page is practically not present.
So I say, let's mark normal DRAM pages which have been poisoned as not
present and be done with it. Period.

> Similarly NP is fine for PMEM when the machine-check-registers
> indicate that the entire page is full of poison. The driver will
> record that and block any attempt to recover any data in that page.

So this is still kinda weird. We will mark it with either NP or UC but
the driver has some special knowledge how to tiptoe around poison. So
for simplicity, let's mark it with whatever fits best and be done with
it - driver can handle it just fine.

I hope you're cathing my drift: it doesn't really matter what's possible
wrt marking - it matters what the practical side of the whole thing
is wrt further page handling and recovery action. And we should do
here whatever does not impede that further page handling even if other
markings are possible.

Ok?
Jane Chu Sept. 30, 2021, 9:23 p.m. UTC | #23
On 9/30/2021 12:11 PM, Dan Williams wrote:
> On Thu, Sep 30, 2021 at 11:15 AM Jane Chu <jane.chu@oracle.com> wrote:
>>
>> Hi, Dan,
>>
>> On 9/16/2021 1:33 PM, Dan Williams wrote:
>>> The new consideration of
>>> whole_page() errors means that the code in nfit_handle_mce() is wrong
>>> to assume that the length of the error is just 1 cacheline. When that
>>> bug is fixed the badblocks range will cover the entire page in
>>> whole_page() notification cases.
>>
>> Does this explain what I saw a while ago: inject two poisons
>> to the same block, and pwrite clears one poison at a time?
> 
> Potentially, yes. If you injected poison over 512 bytes then it could
> take 2 pwrites() of 256-bytes each to clear that poison. In the DAX
> case nothing is enforcing sector aligned I/O for pwrite().
> 

Thanks!

-jane
Dan Williams Sept. 30, 2021, 9:41 p.m. UTC | #24
On Thu, Sep 30, 2021 at 2:21 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, Sep 30, 2021 at 02:05:49PM -0700, Dan Williams wrote:
> > I.e. it's fine
>
> A lot of things are fine - question is, what makes sense and what makes
> this thing as simple as required to be.
>
> > if a DRAM page with a single cacheline error only gets marked UC.
> > Speculation is disabled and the page allocator will still throw it
> > away and never use it again.
>
> Normal DRAM is poisoned as a whole page, as we established.

It is not in all cases, as I read:

17fae1294ad9 x86/{mce,mm}: Unmap the entire page if the whole page is
affected and poisoned

...it is MSi_MISC that determines what happens and MSi_MISC is not
memory-type aware.

> So whatever
> it is marked with - UC or NP - it probably doesn't matter. But since
> the page won't be ever used, then that page is practically not present.
> So I say, let's mark normal DRAM pages which have been poisoned as not
> present and be done with it. Period.

Where would this routine learn the memory type? Pass it in via
memory_failure(), to what end?

> > Similarly NP is fine for PMEM when the machine-check-registers
> > indicate that the entire page is full of poison. The driver will
> > record that and block any attempt to recover any data in that page.
>
> So this is still kinda weird. We will mark it with either NP or UC but
> the driver has some special knowledge how to tiptoe around poison. So
> for simplicity, let's mark it with whatever fits best and be done with
> it - driver can handle it just fine.
>
> I hope you're cathing my drift: it doesn't really matter what's possible
> wrt marking - it matters what the practical side of the whole thing
> is wrt further page handling and recovery action. And we should do
> here whatever does not impede that further page handling even if other
> markings are possible.
>
> Ok?

I'm sorry, but I'm not tracking.

set_mce_nospec() just wants to prevent speculative consumption of
poison going forward. It has 2 options NP and UC. UC is suitable for
DRAM and desired for PMEM as long as MSi_MISC indicates a sub-page
size poison event.

If you want set_mce_nospec() to specify NP for the page even when
MSi_MISC indicates sub-page blast radius it would require extra
plumbing to convey the memory type to set_mce_nospec().

I fail to see the point of that extra plumbing when MSi_MISC
indicating "whole_page", or not is sufficient. What am I missing?
Borislav Petkov Sept. 30, 2021, 10:35 p.m. UTC | #25
On Thu, Sep 30, 2021 at 02:41:52PM -0700, Dan Williams wrote:
> I fail to see the point of that extra plumbing when MSi_MISC
> indicating "whole_page", or not is sufficient. What am I missing?

I think you're looking at it from the wrong side... (or it is too late
here, but we'll see). Forget how a memory type can be mapped but think
about how the recovery action looks like.

- DRAM: when a DRAM page is poisoned, it is only poisoned as a whole
page by memory_failure(). whole_page is always true here, no matter what
the hardware says because we don't and cannot do any sub-page recovery
actions. So it doesn't matter how we map it, UC, NP... I suggested NP
because the page is practically not present if you want to access it
because mm won't allow it...

- PMEM: reportedly, we can do sub-page recovery here so PMEM should be
mapped in the way it is better for the recovery action to work.

In both cases, the recovery action should control how the memory type is
mapped.

Now, you say we cannot know the memory type when the error gets
reported.

And I say: for simplicity's sake, we simply go and work with whole
pages. Always. That is the case anyway for DRAM.

For PMEM, AFAIU, it doesn't matter whether it is a whole page or not -
the PMEM driver knows how to do those sub-pages accesses.

IOW, set_mce_nospec() should simply do:

	rc = set_memory_np(decoy_addr, 1);

and that's it.
Dan Williams Sept. 30, 2021, 10:44 p.m. UTC | #26
On Thu, Sep 30, 2021 at 3:36 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, Sep 30, 2021 at 02:41:52PM -0700, Dan Williams wrote:
> > I fail to see the point of that extra plumbing when MSi_MISC
> > indicating "whole_page", or not is sufficient. What am I missing?
>
> I think you're looking at it from the wrong side... (or it is too late
> here, but we'll see). Forget how a memory type can be mapped but think
> about how the recovery action looks like.
>
> - DRAM: when a DRAM page is poisoned, it is only poisoned as a whole
> page by memory_failure(). whole_page is always true here, no matter what
> the hardware says because we don't and cannot do any sub-page recovery
> actions. So it doesn't matter how we map it, UC, NP... I suggested NP
> because the page is practically not present if you want to access it
> because mm won't allow it...
>
> - PMEM: reportedly, we can do sub-page recovery here so PMEM should be
> mapped in the way it is better for the recovery action to work.
>
> In both cases, the recovery action should control how the memory type is
> mapped.
>
> Now, you say we cannot know the memory type when the error gets
> reported.
>
> And I say: for simplicity's sake, we simply go and work with whole
> pages. Always. That is the case anyway for DRAM.
>
> For PMEM, AFAIU, it doesn't matter whether it is a whole page or not -
> the PMEM driver knows how to do those sub-pages accesses.
>
> IOW, set_mce_nospec() should simply do:
>
>         rc = set_memory_np(decoy_addr, 1);
>
> and that's it.

The driver uses the direct-map to do the access. It uses the
direct-map because it has also arranged for pfn_to_page() to work for
PMEM pages. So if PMEM is in the direct-map is marked NP then the
sub-page accesses will fault.

Now, the driver could set up and tear down page tables for the pfn
whenever it is asked to do I/O over a potentially poisoned pfn. Is
that what you are suggesting? It seems like a significant amount of
overhead, but it would at least kick this question out of the purview
of the MCE code.
Jane Chu Oct. 1, 2021, 12:43 a.m. UTC | #27
On 9/30/2021 3:35 PM, Borislav Petkov wrote:
> On Thu, Sep 30, 2021 at 02:41:52PM -0700, Dan Williams wrote:
>> I fail to see the point of that extra plumbing when MSi_MISC
>> indicating "whole_page", or not is sufficient. What am I missing?
> 
> I think you're looking at it from the wrong side... (or it is too late
> here, but we'll see). Forget how a memory type can be mapped but think
> about how the recovery action looks like.
> 
> - DRAM: when a DRAM page is poisoned, it is only poisoned as a whole
> page by memory_failure(). whole_page is always true here, no matter what
> the hardware says because we don't and cannot do any sub-page recovery
> actions. So it doesn't matter how we map it, UC, NP... I suggested NP
> because the page is practically not present if you want to access it
> because mm won't allow it...
> 
> - PMEM: reportedly, we can do sub-page recovery here so PMEM should be
> mapped in the way it is better for the recovery action to work.
> 
> In both cases, the recovery action should control how the memory type is
> mapped.
> 
> Now, you say we cannot know the memory type when the error gets
> reported.
> 
> And I say: for simplicity's sake, we simply go and work with whole
> pages. Always. That is the case anyway for DRAM.

Sorry, please correct me if I misunderstand. The DRAM poison handling
at page frame granularity is a helpless compromise due to lack of
guarantee to decipher the precise error blast radius given all
types of DRAM and architectures, right?  But that's not true for
the PMEM case. So why should PMEM poison handling follow the lead
of DRAM?

Regards,
-jane

> 
> For PMEM, AFAIU, it doesn't matter whether it is a whole page or not -
> the PMEM driver knows how to do those sub-pages accesses.
> 
> IOW, set_mce_nospec() should simply do:
> 
> 	rc = set_memory_np(decoy_addr, 1);
> 
> and that's it.
>
Dan Williams Oct. 1, 2021, 2:02 a.m. UTC | #28
On Thu, Sep 30, 2021 at 5:43 PM Jane Chu <jane.chu@oracle.com> wrote:
>
>
> On 9/30/2021 3:35 PM, Borislav Petkov wrote:
> > On Thu, Sep 30, 2021 at 02:41:52PM -0700, Dan Williams wrote:
> >> I fail to see the point of that extra plumbing when MSi_MISC
> >> indicating "whole_page", or not is sufficient. What am I missing?
> >
> > I think you're looking at it from the wrong side... (or it is too late
> > here, but we'll see). Forget how a memory type can be mapped but think
> > about how the recovery action looks like.
> >
> > - DRAM: when a DRAM page is poisoned, it is only poisoned as a whole
> > page by memory_failure(). whole_page is always true here, no matter what
> > the hardware says because we don't and cannot do any sub-page recovery
> > actions. So it doesn't matter how we map it, UC, NP... I suggested NP
> > because the page is practically not present if you want to access it
> > because mm won't allow it...
> >
> > - PMEM: reportedly, we can do sub-page recovery here so PMEM should be
> > mapped in the way it is better for the recovery action to work.
> >
> > In both cases, the recovery action should control how the memory type is
> > mapped.
> >
> > Now, you say we cannot know the memory type when the error gets
> > reported.
> >
> > And I say: for simplicity's sake, we simply go and work with whole
> > pages. Always. That is the case anyway for DRAM.
>
> Sorry, please correct me if I misunderstand. The DRAM poison handling
> at page frame granularity is a helpless compromise due to lack of
> guarantee to decipher the precise error blast radius given all
> types of DRAM and architectures, right?  But that's not true for
> the PMEM case. So why should PMEM poison handling follow the lead
> of DRAM?

If I understand the proposal correctly Boris is basically saying
"figure out how to do your special PMEM stuff in the driver directly
and make it so MCE code has no knowledge of the PMEM case". The flow
is:

memory_failure(pfn, flags)
nfit_handle_mce(...) <--- right now this on mce notifier chain
set_mce_nospec(pfn) <--- drop the "whole page" concept

This poses a problem because not all memory_failure() paths trigger
set_mce_nospec() or the mce notifier chain. If that disconnect
happens, attempts to read PMEM pages that have been signalled to
memory_failure() will now crash in the driver without workarounds for
NP pages.

So memory_failure() needs to ensure that it communicates with the
driver before any possible NP page attribute changes. I.e. the driver
needs to know that regardless of how many cachelines are poisoned the
entire page is always unmapped in the direct map.

Then, when the driver is called with the new RWF_RECOVER_DATA flag, it
can set up a new UC alias mapping for the pfn and access the good data
in the page while being careful to read around the poisoned cache
lines.

In my mind this moves the RWF_RECOVER_DATA flag proposal from "nice to
have" to "critical for properly coordinating with memory_failure() and
mce expectations"
Borislav Petkov Oct. 1, 2021, 10:41 a.m. UTC | #29
On Thu, Sep 30, 2021 at 03:44:40PM -0700, Dan Williams wrote:
> The driver uses the direct-map to do the access. It uses the
> direct-map because it has also arranged for pfn_to_page() to work for
> PMEM pages. So if PMEM is in the direct-map is marked NP then the
> sub-page accesses will fault.

Well, the driver has special knowledge so *actually* it could go and use
the NP marking as "this page has been poisoned" and mark it NC only for
itself, so that it gets the job done. Dunno if that would end up being
too ugly to live and turn into a layering violation or so.

Or we can mark the page NC - that should stop the speculative access for
DRAM and the PMEM driver can do its job. The NC should take care so that
no cachelines are in the cache hierarchy.

> Now, the driver could set up and tear down page tables for the pfn
> whenever it is asked to do I/O over a potentially poisoned pfn. Is
> that what you are suggesting? It seems like a significant amount of
> overhead, but it would at least kick this question out of the purview
> of the MCE code.

Yeah, I'm looking for the simplest solution first which satisfies
everyone. Let's keep that one on the bag for now...

:-)
Borislav Petkov Oct. 1, 2021, 10:50 a.m. UTC | #30
On Thu, Sep 30, 2021 at 07:02:13PM -0700, Dan Williams wrote:
> If I understand the proposal correctly Boris is basically saying
> "figure out how to do your special PMEM stuff in the driver directly
> and make it so MCE code has no knowledge of the PMEM case".

I don't mind if it has to because of a good reason X - I'm just trying
to simplify things and not do stuff which is practically unnecessary.

> The flow
> is:
> 
> memory_failure(pfn, flags)
> nfit_handle_mce(...) <--- right now this on mce notifier chain
> set_mce_nospec(pfn) <--- drop the "whole page" concept
> 
> This poses a problem because not all memory_failure() paths trigger
> set_mce_nospec() or the mce notifier chain.

Hmm, so this sounds like more is needed than this small patch.

> If that disconnect happens, attempts to read PMEM pages that have been
> signalled to memory_failure() will now crash in the driver without
> workarounds for NP pages.
>
> So memory_failure() needs to ensure that it communicates with the
> driver before any possible NP page attribute changes. I.e. the driver
> needs to know that regardless of how many cachelines are poisoned the
> entire page is always unmapped in the direct map.

Ok, so those errors do get reported through MCE so MCE code does need to
know about them.

So the question is, how should MCE mark them so that the driver can deal
with them properly.

> Then, when the driver is called with the new RWF_RECOVER_DATA flag, it
> can set up a new UC alias mapping for the pfn and access the good data
> in the page while being careful to read around the poisoned cache
> lines.

See my other mail - that sounds good.

> In my mind this moves the RWF_RECOVER_DATA flag proposal from "nice to
> have" to "critical for properly coordinating with memory_failure() and
> mce expectations"

Right, so to reiterate: I don't mind if the MCE code knows about this -
in the end of the day, it is that code which does the error reporting.
My goal is to make that reporting and marking optimal so that the driver
can work with that marking and simplify and document *why* we're doing
that so that future MCE changes can keep that in mind.

So to sum up: it sounds to me like it should simply mark *whole* pages
as NC:

1. this should prevent the spec. access for DRAM.
2. PMEM can do UC alias mapping and do sub-page access to salvage data.

How's that?

Thx.
Dan Williams Oct. 1, 2021, 4:52 p.m. UTC | #31
On Fri, Oct 1, 2021 at 3:50 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, Sep 30, 2021 at 07:02:13PM -0700, Dan Williams wrote:
> > If I understand the proposal correctly Boris is basically saying
> > "figure out how to do your special PMEM stuff in the driver directly
> > and make it so MCE code has no knowledge of the PMEM case".
>
> I don't mind if it has to because of a good reason X - I'm just trying
> to simplify things and not do stuff which is practically unnecessary.
>
> > The flow
> > is:
> >
> > memory_failure(pfn, flags)
> > nfit_handle_mce(...) <--- right now this on mce notifier chain
> > set_mce_nospec(pfn) <--- drop the "whole page" concept
> >
> > This poses a problem because not all memory_failure() paths trigger
> > set_mce_nospec() or the mce notifier chain.
>
> Hmm, so this sounds like more is needed than this small patch.
>
> > If that disconnect happens, attempts to read PMEM pages that have been
> > signalled to memory_failure() will now crash in the driver without
> > workarounds for NP pages.
> >
> > So memory_failure() needs to ensure that it communicates with the
> > driver before any possible NP page attribute changes. I.e. the driver
> > needs to know that regardless of how many cachelines are poisoned the
> > entire page is always unmapped in the direct map.
>
> Ok, so those errors do get reported through MCE so MCE code does need to
> know about them.
>
> So the question is, how should MCE mark them so that the driver can deal
> with them properly.
>
> > Then, when the driver is called with the new RWF_RECOVER_DATA flag, it
> > can set up a new UC alias mapping for the pfn and access the good data
> > in the page while being careful to read around the poisoned cache
> > lines.
>
> See my other mail - that sounds good.
>
> > In my mind this moves the RWF_RECOVER_DATA flag proposal from "nice to
> > have" to "critical for properly coordinating with memory_failure() and
> > mce expectations"
>
> Right, so to reiterate: I don't mind if the MCE code knows about this -
> in the end of the day, it is that code which does the error reporting.
> My goal is to make that reporting and marking optimal so that the driver
> can work with that marking and simplify and document *why* we're doing
> that so that future MCE changes can keep that in mind.
>
> So to sum up: it sounds to me like it should simply mark *whole* pages
> as NC:
>
> 1. this should prevent the spec. access for DRAM.
> 2. PMEM can do UC alias mapping and do sub-page access to salvage data.
>
> How's that?

I think that puts us back in the situation that Tony fixed in:

17fae1294ad9 x86/{mce,mm}: Unmap the entire page if the whole page is
affected and poisoned

...where the clflush in _set_memory_uc() causes more unwanted virtual
#MC injection.

I think that means that we have no choice but to mark the page NP
unconditionally and do the work to ensure that the driver has updated
its poisoned cacheline tracking for data recovery requests.

Right?
Borislav Petkov Oct. 1, 2021, 6:11 p.m. UTC | #32
On Fri, Oct 01, 2021 at 09:52:21AM -0700, Dan Williams wrote:
> I think that puts us back in the situation that Tony fixed in:
> 
> 17fae1294ad9 x86/{mce,mm}: Unmap the entire page if the whole page is
> affected and poisoned
> 
> ...where the clflush in _set_memory_uc() causes more unwanted virtual
> #MC injection.

Hmm, lemme read that commit message again: so the guest kernel sees a
*second* MCE while doing set_memory_uc().

So what prevents the guest kernel from seeing a second MCE when it does
set_memory_np() instead?

"Further investigation showed that the VMM had passed in another machine
check because is appeared that the guest was accessing the bad page."

but then the beginning of the commit message says:

"The VMM unmapped the bad page from guest physical space and passed the
machine check to the guest."

so I'm really confused here what actually happens. Did the VMM manage to
unmap it or not really? Because if the VMM had unmapped it, then how was
the guest still accessing the bad page? ... Maybe I'm reading that wrong.

> I think that means that we have no choice but to mark the page NP
> unconditionally and do the work to ensure that the driver has updated
> its poisoned cacheline tracking for data recovery requests.

So a couple of emails earlier I had this:

|Well, the driver has special knowledge so *actually* it could go and use
|the NP marking as "this page has been poisoned" and mark it NC only for
|itself, so that it gets the job done. Dunno if that would end up being
|too ugly to live and turn into a layering violation or so.

So if we have marked the page NP, then nothing would be able to access
it anymore and it will be marked as hwpoison additionally, which will
prevent that access too.

Then, the PMEM driver would go and map the page however it wants to, it
could even remove it from the direct map so that nothing else accesses
it, even in the kernel, and then do all kinds of recovery.

Hmm?
Dan Williams Oct. 1, 2021, 6:29 p.m. UTC | #33
On Fri, Oct 1, 2021 at 11:12 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Fri, Oct 01, 2021 at 09:52:21AM -0700, Dan Williams wrote:
> > I think that puts us back in the situation that Tony fixed in:
> >
> > 17fae1294ad9 x86/{mce,mm}: Unmap the entire page if the whole page is
> > affected and poisoned
> >
> > ...where the clflush in _set_memory_uc() causes more unwanted virtual
> > #MC injection.
>
> Hmm, lemme read that commit message again: so the guest kernel sees a
> *second* MCE while doing set_memory_uc().
>
> So what prevents the guest kernel from seeing a second MCE when it does
> set_memory_np() instead?
>
> "Further investigation showed that the VMM had passed in another machine
> check because is appeared that the guest was accessing the bad page."
>
> but then the beginning of the commit message says:
>
> "The VMM unmapped the bad page from guest physical space and passed the
> machine check to the guest."
>
> so I'm really confused here what actually happens. Did the VMM manage to
> unmap it or not really? Because if the VMM had unmapped it, then how was
> the guest still accessing the bad page? ... Maybe I'm reading that wrong.

My read is that the guest gets virtual #MC on an access to that page.
When the guest tries to do set_memory_uc() and instructs cpa_flush()
to do clean caches that results in taking another fault / exception
perhaps because the VMM unmapped the page from the guest? If the guest
had flipped the page to NP then cpa_flush() says "oh, no caching
change, skip the clflush() loop".

>
> > I think that means that we have no choice but to mark the page NP
> > unconditionally and do the work to ensure that the driver has updated
> > its poisoned cacheline tracking for data recovery requests.
>
> So a couple of emails earlier I had this:
>
> |Well, the driver has special knowledge so *actually* it could go and use
> |the NP marking as "this page has been poisoned" and mark it NC only for
> |itself, so that it gets the job done. Dunno if that would end up being
> |too ugly to live and turn into a layering violation or so.
>
> So if we have marked the page NP, then nothing would be able to access
> it anymore and it will be marked as hwpoison additionally, which will
> prevent that access too.
>
> Then, the PMEM driver would go and map the page however it wants to, it
> could even remove it from the direct map so that nothing else accesses
> it, even in the kernel, and then do all kinds of recovery.
>
> Hmm?

Yeah, I thought UC would make the PMEM driver's life easier, but if it
has to contend with an NP case at all, might as well make it handle
that case all the time.

Safe to say this patch of mine is woefully insufficient and I need to
go look at how to make the guarantees needed by the PMEM driver so it
can handle NP and set up alias maps.

This was a useful discussion. It proves that my commit:

284ce4011ba6 x86/memory_failure: Introduce {set, clear}_mce_nospec()

...was broken from the outset.
Borislav Petkov Oct. 2, 2021, 10:17 a.m. UTC | #34
On Fri, Oct 01, 2021 at 11:29:43AM -0700, Dan Williams wrote:
> My read is that the guest gets virtual #MC on an access to that page.
> When the guest tries to do set_memory_uc() and instructs cpa_flush()
> to do clean caches that results in taking another fault / exception
> perhaps because the VMM unmapped the page from the guest? If the guest
> had flipped the page to NP then cpa_flush() says "oh, no caching
> change, skip the clflush() loop".

... and the CLFLUSH is the insn which caused the second MCE because it
"appeared that the guest was accessing the bad page."

Uuf, that could be. Nasty.

> Yeah, I thought UC would make the PMEM driver's life easier, but if it
> has to contend with an NP case at all, might as well make it handle
> that case all the time.
> 
> Safe to say this patch of mine is woefully insufficient and I need to
> go look at how to make the guarantees needed by the PMEM driver so it
> can handle NP and set up alias maps.
> 
> This was a useful discussion.

Oh yeah, thanks for taking the time!

> It proves that my commit:
> 
> 284ce4011ba6 x86/memory_failure: Introduce {set, clear}_mce_nospec()
> 
> ...was broken from the outset.

Well, the problem with hw errors is that it is always very hard to test
the code. But I hear injection works now soo... :-)

Thx.
Jane Chu Nov. 11, 2021, 12:06 a.m. UTC | #35
Hi, All,

Since I volunteered to add a patch in my series to captured the
idea discussed here about what set_mce_nospec() should do, I went
back reread the discussion.

The conclusion of the discussion is that set_mce_nospec() should
always do set_memory_np(), regardless 'whole_page', or
DRAM vs PMEM which the code cannot tell any way.

I'd like to voice concern about the risk that Dan raised about
this change of behavior to pmem users.  Yes, in theory, it's all
driver's problem, let the driver deal with it. But in reality,
that translates to many unhappy customers when their
mission-critical applications crash and things don't get fixed
the next day.

Is the risk unavoidable? If I'm not mistaken, I thought the
central point of the discussion was about clarity/simplicity
rather than correctness.  How about we address that without
raising risk to existing customers?  Here is my proposed
wording with the fix Dan sent earlier.

   /*
    * set_memory_nospec - make memory type marking in order to prevent
    * speculative access to poisoned page.
    *
    * @pfn - pfn of a page that is either poisoned in the whole, or 
partially
    *       poisoned,
    * @whole_page - indicates whether the entire page is poisoned or only
    *       part of the page is poisoned accoding to the MSi_MISC register.
    *
    * The page might be a DRAM or a PMEM page which the code cannot tell.
    * The page might have already been unmapped (when 'whole_page') is true
    * and may not be accessed in any case (e.g. guest page).
    *
    * The page might be partially poisoned and hence the non-poisoned
    * cachelines could be safely accessed _in theory_, although practically,
    * when a DRAM page is marked PageHWPoison, the mm-subsystem prevents
    * it from being accessed, but when a PMEM page is marked PageHWPoison,
    * it could practically be accessed as it is not entirely under the
    * mm-subsystem management.
    *
    * Setting mem_type of the page to either 'NP' or 'UC' prevents the
    * prefetcher from accessing the page, henec the rest of the decision
    * is based on minimizing the risk and maximizing the flexibility, 
that is,
    * in case of 'whole_page', set mem_type to 'NP', but for partial page
    * poisoning, set mem_type to 'UC', regardless the page is DRAM or PMEM.
    */
static inline int set_mce_nospec(unsigned long pfn, bool whole_page)
<snip>
         if (whole_page)
                 rc = set_memory_np(decoy_addr, 1);
         else {
                rc = _set_memory_uc(decoy_addr, 1);
         }
<snip>

Comments? Suggestions?

thanks,
-jane




On 10/2/2021 3:17 AM, Borislav Petkov wrote:
> On Fri, Oct 01, 2021 at 11:29:43AM -0700, Dan Williams wrote:
>> My read is that the guest gets virtual #MC on an access to that page.
>> When the guest tries to do set_memory_uc() and instructs cpa_flush()
>> to do clean caches that results in taking another fault / exception
>> perhaps because the VMM unmapped the page from the guest? If the guest
>> had flipped the page to NP then cpa_flush() says "oh, no caching
>> change, skip the clflush() loop".
> 
> ... and the CLFLUSH is the insn which caused the second MCE because it
> "appeared that the guest was accessing the bad page."
> 
> Uuf, that could be. Nasty.
> 
>> Yeah, I thought UC would make the PMEM driver's life easier, but if it
>> has to contend with an NP case at all, might as well make it handle
>> that case all the time.
>>
>> Safe to say this patch of mine is woefully insufficient and I need to
>> go look at how to make the guarantees needed by the PMEM driver so it
>> can handle NP and set up alias maps.
>>
>> This was a useful discussion.
> 
> Oh yeah, thanks for taking the time!
> 
>> It proves that my commit:
>>
>> 284ce4011ba6 x86/memory_failure: Introduce {set, clear}_mce_nospec()
>>
>> ...was broken from the outset.
> 
> Well, the problem with hw errors is that it is always very hard to test
> the code. But I hear injection works now soo... :-)
> 
> Thx.
>
Jane Chu Nov. 12, 2021, 12:30 a.m. UTC | #36
Just a quick update -

I managed to test the 'NP' and 'UC' effect on a pmem dax file.
The result is, as expected, both setting 'NP' and 'UC' works
well in preventing the prefetcher from accessing the poisoned
pmem page.

I injected back-to-back poisons to the 3rd block(512B) of
the 3rd page in my dax file.  With 'NP', the 'mc_safe read'
stops  after reading the 1st and 2nd pages, with 'UC',
the 'mc_safe read' was able to read [2 pages + 2 blocks] on
my test machine.

thanks,
-jane


On 11/10/2021 4:06 PM, Jane Chu wrote:
> Hi, All,
> 
> Since I volunteered to add a patch in my series to captured the
> idea discussed here about what set_mce_nospec() should do, I went
> back reread the discussion.
> 
> The conclusion of the discussion is that set_mce_nospec() should
> always do set_memory_np(), regardless 'whole_page', or
> DRAM vs PMEM which the code cannot tell any way.
> 
> I'd like to voice concern about the risk that Dan raised about
> this change of behavior to pmem users.  Yes, in theory, it's all
> driver's problem, let the driver deal with it. But in reality,
> that translates to many unhappy customers when their
> mission-critical applications crash and things don't get fixed
> the next day.
> 
> Is the risk unavoidable? If I'm not mistaken, I thought the
> central point of the discussion was about clarity/simplicity
> rather than correctness.  How about we address that without
> raising risk to existing customers?  Here is my proposed
> wording with the fix Dan sent earlier.
> 
>     /*
>      * set_memory_nospec - make memory type marking in order to prevent
>      * speculative access to poisoned page.
>      *
>      * @pfn - pfn of a page that is either poisoned in the whole, or
> partially
>      *       poisoned,
>      * @whole_page - indicates whether the entire page is poisoned or only
>      *       part of the page is poisoned accoding to the MSi_MISC register.
>      *
>      * The page might be a DRAM or a PMEM page which the code cannot tell.
>      * The page might have already been unmapped (when 'whole_page') is true
>      * and may not be accessed in any case (e.g. guest page).
>      *
>      * The page might be partially poisoned and hence the non-poisoned
>      * cachelines could be safely accessed _in theory_, although practically,
>      * when a DRAM page is marked PageHWPoison, the mm-subsystem prevents
>      * it from being accessed, but when a PMEM page is marked PageHWPoison,
>      * it could practically be accessed as it is not entirely under the
>      * mm-subsystem management.
>      *
>      * Setting mem_type of the page to either 'NP' or 'UC' prevents the
>      * prefetcher from accessing the page, henec the rest of the decision
>      * is based on minimizing the risk and maximizing the flexibility,
> that is,
>      * in case of 'whole_page', set mem_type to 'NP', but for partial page
>      * poisoning, set mem_type to 'UC', regardless the page is DRAM or PMEM.
>      */
> static inline int set_mce_nospec(unsigned long pfn, bool whole_page)
> <snip>
>           if (whole_page)
>                   rc = set_memory_np(decoy_addr, 1);
>           else {
>                  rc = _set_memory_uc(decoy_addr, 1);
>           }
> <snip>
> 
> Comments? Suggestions?
> 
> thanks,
> -jane
> 
> 
> 
> 
> On 10/2/2021 3:17 AM, Borislav Petkov wrote:
>> On Fri, Oct 01, 2021 at 11:29:43AM -0700, Dan Williams wrote:
>>> My read is that the guest gets virtual #MC on an access to that page.
>>> When the guest tries to do set_memory_uc() and instructs cpa_flush()
>>> to do clean caches that results in taking another fault / exception
>>> perhaps because the VMM unmapped the page from the guest? If the guest
>>> had flipped the page to NP then cpa_flush() says "oh, no caching
>>> change, skip the clflush() loop".
>>
>> ... and the CLFLUSH is the insn which caused the second MCE because it
>> "appeared that the guest was accessing the bad page."
>>
>> Uuf, that could be. Nasty.
>>
>>> Yeah, I thought UC would make the PMEM driver's life easier, but if it
>>> has to contend with an NP case at all, might as well make it handle
>>> that case all the time.
>>>
>>> Safe to say this patch of mine is woefully insufficient and I need to
>>> go look at how to make the guarantees needed by the PMEM driver so it
>>> can handle NP and set up alias maps.
>>>
>>> This was a useful discussion.
>>
>> Oh yeah, thanks for taking the time!
>>
>>> It proves that my commit:
>>>
>>> 284ce4011ba6 x86/memory_failure: Introduce {set, clear}_mce_nospec()
>>>
>>> ...was broken from the outset.
>>
>> Well, the problem with hw errors is that it is always very hard to test
>> the code. But I hear injection works now soo... :-)
>>
>> Thx.
>>
>
Dan Williams Nov. 12, 2021, 12:51 a.m. UTC | #37
On Thu, Nov 11, 2021 at 4:30 PM Jane Chu <jane.chu@oracle.com> wrote:
>
> Just a quick update -
>
> I managed to test the 'NP' and 'UC' effect on a pmem dax file.
> The result is, as expected, both setting 'NP' and 'UC' works
> well in preventing the prefetcher from accessing the poisoned
> pmem page.
>
> I injected back-to-back poisons to the 3rd block(512B) of
> the 3rd page in my dax file.  With 'NP', the 'mc_safe read'
> stops  after reading the 1st and 2nd pages, with 'UC',
> the 'mc_safe read' was able to read [2 pages + 2 blocks] on
> my test machine.

My expectation is that dax_direct_access() / dax_recovery_read() has
installed a temporary UC alias for the pfn, or has temporarily flipped
NP to UC. Outside of dax_recovery_read() the page will always be NP.
Jane Chu Nov. 12, 2021, 5:57 p.m. UTC | #38
On 11/11/2021 4:51 PM, Dan Williams wrote:
> On Thu, Nov 11, 2021 at 4:30 PM Jane Chu <jane.chu@oracle.com> wrote:
>>
>> Just a quick update -
>>
>> I managed to test the 'NP' and 'UC' effect on a pmem dax file.
>> The result is, as expected, both setting 'NP' and 'UC' works
>> well in preventing the prefetcher from accessing the poisoned
>> pmem page.
>>
>> I injected back-to-back poisons to the 3rd block(512B) of
>> the 3rd page in my dax file.  With 'NP', the 'mc_safe read'
>> stops  after reading the 1st and 2nd pages, with 'UC',
>> the 'mc_safe read' was able to read [2 pages + 2 blocks] on
>> my test machine.
> 
> My expectation is that dax_direct_access() / dax_recovery_read() has
> installed a temporary UC alias for the pfn, or has temporarily flipped
> NP to UC. Outside of dax_recovery_read() the page will always be NP.
> 

Okay.  Could we only flip the memtype within dax_recovery_read, and
not within dax_direct_access?  dax_direct_access does not need to
access the page.

thanks,
-jane
Dan Williams Nov. 12, 2021, 7:24 p.m. UTC | #39
On Fri, Nov 12, 2021 at 9:58 AM Jane Chu <jane.chu@oracle.com> wrote:
>
> On 11/11/2021 4:51 PM, Dan Williams wrote:
> > On Thu, Nov 11, 2021 at 4:30 PM Jane Chu <jane.chu@oracle.com> wrote:
> >>
> >> Just a quick update -
> >>
> >> I managed to test the 'NP' and 'UC' effect on a pmem dax file.
> >> The result is, as expected, both setting 'NP' and 'UC' works
> >> well in preventing the prefetcher from accessing the poisoned
> >> pmem page.
> >>
> >> I injected back-to-back poisons to the 3rd block(512B) of
> >> the 3rd page in my dax file.  With 'NP', the 'mc_safe read'
> >> stops  after reading the 1st and 2nd pages, with 'UC',
> >> the 'mc_safe read' was able to read [2 pages + 2 blocks] on
> >> my test machine.
> >
> > My expectation is that dax_direct_access() / dax_recovery_read() has
> > installed a temporary UC alias for the pfn, or has temporarily flipped
> > NP to UC. Outside of dax_recovery_read() the page will always be NP.
> >
>
> Okay.  Could we only flip the memtype within dax_recovery_read, and
> not within dax_direct_access?  dax_direct_access does not need to
> access the page.

True, dax_direct_access() does not need to do the page permission
change, it just needs to indicate if dax_recovery_{read,write}() may
be attempted. I was thinking that the DAX pages only float between NP
and WB depending on whether poison is present in the page. If
dax_recovery_read() wants to do UC reads around the poison it can use
ioremap() or vmap() to create a temporary UC alias. The temporary UC
alias is only possible if there might be non-clobbered data remaining
in the page. I.e. the current "whole_page()" determination in
uc_decode_notifier() needs to be plumbed into the PMEM driver so that
it can cooperate with a virtualized environment that injects virtual
#MC at page granularity. I.e. nfit_handle_mce() is broken in that it
only assumes a single cacheline around the failure address is
poisoned, it needs that same whole_page() logic.
Jane Chu Nov. 12, 2021, 10:35 p.m. UTC | #40
On 11/12/2021 11:24 AM, Dan Williams wrote:
> On Fri, Nov 12, 2021 at 9:58 AM Jane Chu <jane.chu@oracle.com> wrote:
>>
>> On 11/11/2021 4:51 PM, Dan Williams wrote:
>>> On Thu, Nov 11, 2021 at 4:30 PM Jane Chu <jane.chu@oracle.com> wrote:
>>>>
>>>> Just a quick update -
>>>>
>>>> I managed to test the 'NP' and 'UC' effect on a pmem dax file.
>>>> The result is, as expected, both setting 'NP' and 'UC' works
>>>> well in preventing the prefetcher from accessing the poisoned
>>>> pmem page.
>>>>
>>>> I injected back-to-back poisons to the 3rd block(512B) of
>>>> the 3rd page in my dax file.  With 'NP', the 'mc_safe read'
>>>> stops  after reading the 1st and 2nd pages, with 'UC',
>>>> the 'mc_safe read' was able to read [2 pages + 2 blocks] on
>>>> my test machine.
>>>
>>> My expectation is that dax_direct_access() / dax_recovery_read() has
>>> installed a temporary UC alias for the pfn, or has temporarily flipped
>>> NP to UC. Outside of dax_recovery_read() the page will always be NP.
>>>
>>
>> Okay.  Could we only flip the memtype within dax_recovery_read, and
>> not within dax_direct_access?  dax_direct_access does not need to
>> access the page.
> 
> True, dax_direct_access() does not need to do the page permission
> change, it just needs to indicate if dax_recovery_{read,write}() may
> be attempted. I was thinking that the DAX pages only float between NP
> and WB depending on whether poison is present in the page. If
> dax_recovery_read() wants to do UC reads around the poison it can use
> ioremap() or vmap() to create a temporary UC alias. The temporary UC
> alias is only possible if there might be non-clobbered data remaining
> in the page. I.e. the current "whole_page()" determination in
> uc_decode_notifier() needs to be plumbed into the PMEM driver so that
> it can cooperate with a virtualized environment that injects virtual
> #MC at page granularity. I.e. nfit_handle_mce() is broken in that it
> only assumes a single cacheline around the failure address is
> poisoned, it needs that same whole_page() logic.
> 

I'll have to take some time to digest what you proposed, but alas, why
couldn't we let the correct decision (about NP vs UC) being made when
the 'whole_page' test has access to the MSi_MISC register, instead of
having to risk mistakenly change NP->UC in dax_recovery_read() and
risk to repeat the bug that Tony has fixed?  I mean a PMEM page might
be legitimately not-accessible due to it might have been unmapped by
the host, but dax_recovery_read() has no way to know, right?

The whole UC <> NP arguments to me seems to be a
  "UC being harmless/workable solution to DRAM and PMEM"  versus
  "NP being simpler regardless what risk it brings to PMEM".
To us, PMEM is not just another driver, it is treated as memory by our
customer, so why?

thanks!
-jane
Jane Chu Nov. 12, 2021, 10:50 p.m. UTC | #41
On 11/12/2021 2:35 PM, Jane Chu wrote:
> On 11/12/2021 11:24 AM, Dan Williams wrote:
>> On Fri, Nov 12, 2021 at 9:58 AM Jane Chu <jane.chu@oracle.com> wrote:
>>>
>>> On 11/11/2021 4:51 PM, Dan Williams wrote:
>>>> On Thu, Nov 11, 2021 at 4:30 PM Jane Chu <jane.chu@oracle.com> wrote:
>>>>>
>>>>> Just a quick update -
>>>>>
>>>>> I managed to test the 'NP' and 'UC' effect on a pmem dax file.
>>>>> The result is, as expected, both setting 'NP' and 'UC' works
>>>>> well in preventing the prefetcher from accessing the poisoned
>>>>> pmem page.
>>>>>
>>>>> I injected back-to-back poisons to the 3rd block(512B) of
>>>>> the 3rd page in my dax file.  With 'NP', the 'mc_safe read'
>>>>> stops  after reading the 1st and 2nd pages, with 'UC',
>>>>> the 'mc_safe read' was able to read [2 pages + 2 blocks] on
>>>>> my test machine.
>>>>
>>>> My expectation is that dax_direct_access() / dax_recovery_read() has
>>>> installed a temporary UC alias for the pfn, or has temporarily flipped
>>>> NP to UC. Outside of dax_recovery_read() the page will always be NP.
>>>>
>>>
>>> Okay.  Could we only flip the memtype within dax_recovery_read, and
>>> not within dax_direct_access?  dax_direct_access does not need to
>>> access the page.
>>
>> True, dax_direct_access() does not need to do the page permission
>> change, it just needs to indicate if dax_recovery_{read,write}() may
>> be attempted. I was thinking that the DAX pages only float between NP
>> and WB depending on whether poison is present in the page. If
>> dax_recovery_read() wants to do UC reads around the poison it can use
>> ioremap() or vmap() to create a temporary UC alias. The temporary UC
>> alias is only possible if there might be non-clobbered data remaining
>> in the page. I.e. the current "whole_page()" determination in
>> uc_decode_notifier() needs to be plumbed into the PMEM driver so that
>> it can cooperate with a virtualized environment that injects virtual
>> #MC at page granularity. I.e. nfit_handle_mce() is broken in that it
>> only assumes a single cacheline around the failure address is
>> poisoned, it needs that same whole_page() logic.
>>
> 
> I'll have to take some time to digest what you proposed, but alas, why
> couldn't we let the correct decision (about NP vs UC) being made when
> the 'whole_page' test has access to the MSi_MISC register, instead of
> having to risk mistakenly change NP->UC in dax_recovery_read() and
> risk to repeat the bug that Tony has fixed?  I mean a PMEM page might
> be legitimately not-accessible due to it might have been unmapped by
> the host, but dax_recovery_read() has no way to know, right?
> 
> The whole UC <> NP arguments to me seems to be a
>    "UC being harmless/workable solution to DRAM and PMEM"  versus
>    "NP being simpler regardless what risk it brings to PMEM".

That wasn't clear, let me try again -

"reject the if-whole_page-then-set-NP-otherwise-set-UC solution
  on the ground of DRAM can't benefit from UC since to DRAM the UC effect
  is practically the same as the NP effect, but harmless anyway"

versus

"accept set-NP-always regardless whatever risk it brings to PMEM"

> To us, PMEM is not just another driver, it is treated as memory by our
> customer, so why?
> 
> thanks!
> -jane
> 
> 
> 
thanks,
-jane
Dan Williams Nov. 12, 2021, 11:08 p.m. UTC | #42
On Fri, Nov 12, 2021 at 2:35 PM Jane Chu <jane.chu@oracle.com> wrote:
>
> On 11/12/2021 11:24 AM, Dan Williams wrote:
> > On Fri, Nov 12, 2021 at 9:58 AM Jane Chu <jane.chu@oracle.com> wrote:
> >>
> >> On 11/11/2021 4:51 PM, Dan Williams wrote:
> >>> On Thu, Nov 11, 2021 at 4:30 PM Jane Chu <jane.chu@oracle.com> wrote:
> >>>>
> >>>> Just a quick update -
> >>>>
> >>>> I managed to test the 'NP' and 'UC' effect on a pmem dax file.
> >>>> The result is, as expected, both setting 'NP' and 'UC' works
> >>>> well in preventing the prefetcher from accessing the poisoned
> >>>> pmem page.
> >>>>
> >>>> I injected back-to-back poisons to the 3rd block(512B) of
> >>>> the 3rd page in my dax file.  With 'NP', the 'mc_safe read'
> >>>> stops  after reading the 1st and 2nd pages, with 'UC',
> >>>> the 'mc_safe read' was able to read [2 pages + 2 blocks] on
> >>>> my test machine.
> >>>
> >>> My expectation is that dax_direct_access() / dax_recovery_read() has
> >>> installed a temporary UC alias for the pfn, or has temporarily flipped
> >>> NP to UC. Outside of dax_recovery_read() the page will always be NP.
> >>>
> >>
> >> Okay.  Could we only flip the memtype within dax_recovery_read, and
> >> not within dax_direct_access?  dax_direct_access does not need to
> >> access the page.
> >
> > True, dax_direct_access() does not need to do the page permission
> > change, it just needs to indicate if dax_recovery_{read,write}() may
> > be attempted. I was thinking that the DAX pages only float between NP
> > and WB depending on whether poison is present in the page. If
> > dax_recovery_read() wants to do UC reads around the poison it can use
> > ioremap() or vmap() to create a temporary UC alias. The temporary UC
> > alias is only possible if there might be non-clobbered data remaining
> > in the page. I.e. the current "whole_page()" determination in
> > uc_decode_notifier() needs to be plumbed into the PMEM driver so that
> > it can cooperate with a virtualized environment that injects virtual
> > #MC at page granularity. I.e. nfit_handle_mce() is broken in that it
> > only assumes a single cacheline around the failure address is
> > poisoned, it needs that same whole_page() logic.
> >
>
> I'll have to take some time to digest what you proposed, but alas, why
> couldn't we let the correct decision (about NP vs UC) being made when
> the 'whole_page' test has access to the MSi_MISC register, instead of
> having to risk mistakenly change NP->UC in dax_recovery_read() and
> risk to repeat the bug that Tony has fixed?  I mean a PMEM page might
> be legitimately not-accessible due to it might have been unmapped by
> the host, but dax_recovery_read() has no way to know, right?

It should know because the MCE that unmapped the page will have
communicated a "whole_page()" MCE. When dax_recovery_read() goes to
consult the badblocks list to try to read the remaining good data it
will see that every single cacheline is covered by badblocks, so
nothing to read, and no need to establish the UC mapping. So the the
"Tony fix" was incomplete in retrospect. It neglected to update the
NVDIMM badblocks tracking for the whole page case.

> The whole UC <> NP arguments to me seems to be a
>   "UC being harmless/workable solution to DRAM and PMEM"  versus
>   "NP being simpler regardless what risk it brings to PMEM".
> To us, PMEM is not just another driver, it is treated as memory by our
> customer, so why?

It's really not a question of UC vs NP, it's a question of accurately
tracking how many cachelines are clobbered in an MCE event so that
hypervisors can punch out entire pages from any future guest access.
This also raises another problem in my mind, how does the hypervisor
learn that the poison was repaired? Presumably the "Tony fix" was for
a case where the guest thought the page was still accessible, but the
hypervisor wanted the whole thing treated as poison. It may be the
case that we're missing a mechanism to ask the hypervisor to consider
that the guest has cleared the poison. At least for PMEM described by
ACPI the existing ClearError DSM in the guest could be trapped by the
hypervisor to handle this case, but I'm not sure what to do about
guests that later want to use MOVDIR64B to clear errors.
Jane Chu Nov. 13, 2021, 5:50 a.m. UTC | #43
On 11/12/2021 3:08 PM, Dan Williams wrote:
> On Fri, Nov 12, 2021 at 2:35 PM Jane Chu <jane.chu@oracle.com> wrote:
>>
>> On 11/12/2021 11:24 AM, Dan Williams wrote:
>>> On Fri, Nov 12, 2021 at 9:58 AM Jane Chu <jane.chu@oracle.com> wrote:
>>>>
>>>> On 11/11/2021 4:51 PM, Dan Williams wrote:
>>>>> On Thu, Nov 11, 2021 at 4:30 PM Jane Chu <jane.chu@oracle.com> wrote:
>>>>>>
>>>>>> Just a quick update -
>>>>>>
>>>>>> I managed to test the 'NP' and 'UC' effect on a pmem dax file.
>>>>>> The result is, as expected, both setting 'NP' and 'UC' works
>>>>>> well in preventing the prefetcher from accessing the poisoned
>>>>>> pmem page.
>>>>>>
>>>>>> I injected back-to-back poisons to the 3rd block(512B) of
>>>>>> the 3rd page in my dax file.  With 'NP', the 'mc_safe read'
>>>>>> stops  after reading the 1st and 2nd pages, with 'UC',
>>>>>> the 'mc_safe read' was able to read [2 pages + 2 blocks] on
>>>>>> my test machine.
>>>>>
>>>>> My expectation is that dax_direct_access() / dax_recovery_read() has
>>>>> installed a temporary UC alias for the pfn, or has temporarily flipped
>>>>> NP to UC. Outside of dax_recovery_read() the page will always be NP.
>>>>>
>>>>
>>>> Okay.  Could we only flip the memtype within dax_recovery_read, and
>>>> not within dax_direct_access?  dax_direct_access does not need to
>>>> access the page.
>>>
>>> True, dax_direct_access() does not need to do the page permission
>>> change, it just needs to indicate if dax_recovery_{read,write}() may
>>> be attempted. I was thinking that the DAX pages only float between NP
>>> and WB depending on whether poison is present in the page. If
>>> dax_recovery_read() wants to do UC reads around the poison it can use
>>> ioremap() or vmap() to create a temporary UC alias. The temporary UC
>>> alias is only possible if there might be non-clobbered data remaining
>>> in the page. I.e. the current "whole_page()" determination in
>>> uc_decode_notifier() needs to be plumbed into the PMEM driver so that
>>> it can cooperate with a virtualized environment that injects virtual
>>> #MC at page granularity. I.e. nfit_handle_mce() is broken in that it
>>> only assumes a single cacheline around the failure address is
>>> poisoned, it needs that same whole_page() logic.
>>>
>>
>> I'll have to take some time to digest what you proposed, but alas, why
>> couldn't we let the correct decision (about NP vs UC) being made when
>> the 'whole_page' test has access to the MSi_MISC register, instead of
>> having to risk mistakenly change NP->UC in dax_recovery_read() and
>> risk to repeat the bug that Tony has fixed?  I mean a PMEM page might
>> be legitimately not-accessible due to it might have been unmapped by
>> the host, but dax_recovery_read() has no way to know, right?
> 
> It should know because the MCE that unmapped the page will have
> communicated a "whole_page()" MCE. When dax_recovery_read() goes to
> consult the badblocks list to try to read the remaining good data it
> will see that every single cacheline is covered by badblocks, so
> nothing to read, and no need to establish the UC mapping. So the the
> "Tony fix" was incomplete in retrospect. It neglected to update the
> NVDIMM badblocks tracking for the whole page case.

So the call in nfit_handle_mce():
   nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
                 ALIGN(mce->addr, L1_CACHE_BYTES),
                 L1_CACHE_BYTES);
should be replaced by
   nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
                 ALIGN(mce->addr, L1_CACHE_BYTES),
                 (1 << MCI_MISC_ADDR_LSB(m->misc)));
right?

And when dax_recovery_read() calls
   badblocks_check(bb, sector, len / 512, &first_bad, &num_bad)
it should always, in case of 'NP', discover that 'first_bad'
is the first sector in the poisoned page,  hence no need
to switch to 'UC', right?

In case the 'first_bad' is in the middle of the poisoned page,
that is, dax_recover_read() could potentially read some clean
sectors, is there problem to
   call _set_memory_UC(pfn, 1),
   do the mc_safe read,
   and then call set_memory_NP(pfn, 1)
?
Why do we need to call ioremap() or vmap()?

> 
>> The whole UC <> NP arguments to me seems to be a
>>    "UC being harmless/workable solution to DRAM and PMEM"  versus
>>    "NP being simpler regardless what risk it brings to PMEM".
>> To us, PMEM is not just another driver, it is treated as memory by our
>> customer, so why?
> 
> It's really not a question of UC vs NP, it's a question of accurately
> tracking how many cachelines are clobbered in an MCE event so that
> hypervisors can punch out entire pages from any future guest access.
> This also raises another problem in my mind, how does the hypervisor
> learn that the poison was repaired? 

Good question!

Presumably the "Tony fix" was for
> a case where the guest thought the page was still accessible, but the
> hypervisor wanted the whole thing treated as poison. It may be the
> case that we're missing a mechanism to ask the hypervisor to consider
> that the guest has cleared the poison. At least for PMEM described by
> ACPI the existing ClearError DSM in the guest could be trapped by the
> hypervisor to handle this case, 

I didn't even know that guest could clear poison by trapping hypervisor
with the ClearError DSM method, I thought guest isn't privileged with
that. Would you mind to elaborate about the mechanism and maybe point
out the code, and perhaps if you have test case to share?

but I'm not sure what to do about
> guests that later want to use MOVDIR64B to clear errors.
> 

Yeah, perhaps there is no way to prevent guest from accidentally
clear error via MOVDIR64B, given some application rely on MOVDIR64B
for fast data movement (straight to the media). I guess in that case,
the consequence is false alarm, nothing disastrous, right?

How about allowing the potential bad-block bookkeeping gap, and
manage to close the gap at certain checkpoints? I guess one of
the checkpoints might be when page fault discovers a poisoned
page?

thanks!
-jane
Dan Williams Nov. 13, 2021, 8:47 p.m. UTC | #44
On Fri, Nov 12, 2021 at 9:50 PM Jane Chu <jane.chu@oracle.com> wrote:
>
> On 11/12/2021 3:08 PM, Dan Williams wrote:
> > On Fri, Nov 12, 2021 at 2:35 PM Jane Chu <jane.chu@oracle.com> wrote:
> >>
> >> On 11/12/2021 11:24 AM, Dan Williams wrote:
> >>> On Fri, Nov 12, 2021 at 9:58 AM Jane Chu <jane.chu@oracle.com> wrote:
> >>>>
> >>>> On 11/11/2021 4:51 PM, Dan Williams wrote:
> >>>>> On Thu, Nov 11, 2021 at 4:30 PM Jane Chu <jane.chu@oracle.com> wrote:
> >>>>>>
> >>>>>> Just a quick update -
> >>>>>>
> >>>>>> I managed to test the 'NP' and 'UC' effect on a pmem dax file.
> >>>>>> The result is, as expected, both setting 'NP' and 'UC' works
> >>>>>> well in preventing the prefetcher from accessing the poisoned
> >>>>>> pmem page.
> >>>>>>
> >>>>>> I injected back-to-back poisons to the 3rd block(512B) of
> >>>>>> the 3rd page in my dax file.  With 'NP', the 'mc_safe read'
> >>>>>> stops  after reading the 1st and 2nd pages, with 'UC',
> >>>>>> the 'mc_safe read' was able to read [2 pages + 2 blocks] on
> >>>>>> my test machine.
> >>>>>
> >>>>> My expectation is that dax_direct_access() / dax_recovery_read() has
> >>>>> installed a temporary UC alias for the pfn, or has temporarily flipped
> >>>>> NP to UC. Outside of dax_recovery_read() the page will always be NP.
> >>>>>
> >>>>
> >>>> Okay.  Could we only flip the memtype within dax_recovery_read, and
> >>>> not within dax_direct_access?  dax_direct_access does not need to
> >>>> access the page.
> >>>
> >>> True, dax_direct_access() does not need to do the page permission
> >>> change, it just needs to indicate if dax_recovery_{read,write}() may
> >>> be attempted. I was thinking that the DAX pages only float between NP
> >>> and WB depending on whether poison is present in the page. If
> >>> dax_recovery_read() wants to do UC reads around the poison it can use
> >>> ioremap() or vmap() to create a temporary UC alias. The temporary UC
> >>> alias is only possible if there might be non-clobbered data remaining
> >>> in the page. I.e. the current "whole_page()" determination in
> >>> uc_decode_notifier() needs to be plumbed into the PMEM driver so that
> >>> it can cooperate with a virtualized environment that injects virtual
> >>> #MC at page granularity. I.e. nfit_handle_mce() is broken in that it
> >>> only assumes a single cacheline around the failure address is
> >>> poisoned, it needs that same whole_page() logic.
> >>>
> >>
> >> I'll have to take some time to digest what you proposed, but alas, why
> >> couldn't we let the correct decision (about NP vs UC) being made when
> >> the 'whole_page' test has access to the MSi_MISC register, instead of
> >> having to risk mistakenly change NP->UC in dax_recovery_read() and
> >> risk to repeat the bug that Tony has fixed?  I mean a PMEM page might
> >> be legitimately not-accessible due to it might have been unmapped by
> >> the host, but dax_recovery_read() has no way to know, right?
> >
> > It should know because the MCE that unmapped the page will have
> > communicated a "whole_page()" MCE. When dax_recovery_read() goes to
> > consult the badblocks list to try to read the remaining good data it
> > will see that every single cacheline is covered by badblocks, so
> > nothing to read, and no need to establish the UC mapping. So the the
> > "Tony fix" was incomplete in retrospect. It neglected to update the
> > NVDIMM badblocks tracking for the whole page case.
>
> So the call in nfit_handle_mce():
>    nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
>                  ALIGN(mce->addr, L1_CACHE_BYTES),
>                  L1_CACHE_BYTES);
> should be replaced by
>    nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
>                  ALIGN(mce->addr, L1_CACHE_BYTES),
>                  (1 << MCI_MISC_ADDR_LSB(m->misc)));
> right?

Yes.

>
> And when dax_recovery_read() calls
>    badblocks_check(bb, sector, len / 512, &first_bad, &num_bad)
> it should always, in case of 'NP', discover that 'first_bad'
> is the first sector in the poisoned page,  hence no need
> to switch to 'UC', right?

Yes.

>
> In case the 'first_bad' is in the middle of the poisoned page,
> that is, dax_recover_read() could potentially read some clean
> sectors, is there problem to
>    call _set_memory_UC(pfn, 1),
>    do the mc_safe read,
>    and then call set_memory_NP(pfn, 1)
> ?
> Why do we need to call ioremap() or vmap()?

I'm worried about concurrent operations and enabling access to threads
outside of the one currently in dax_recovery_read(). If a local vmap()
/ ioremap() is used it effectively makes the access thread local.
There might still need to be an rwsem to allow dax_recovery_write() to
fixup the pfn access and syncrhonize with dax_recovery_read()
operations.

>
> >
> >> The whole UC <> NP arguments to me seems to be a
> >>    "UC being harmless/workable solution to DRAM and PMEM"  versus
> >>    "NP being simpler regardless what risk it brings to PMEM".
> >> To us, PMEM is not just another driver, it is treated as memory by our
> >> customer, so why?
> >
> > It's really not a question of UC vs NP, it's a question of accurately
> > tracking how many cachelines are clobbered in an MCE event so that
> > hypervisors can punch out entire pages from any future guest access.
> > This also raises another problem in my mind, how does the hypervisor
> > learn that the poison was repaired?
>
> Good question!
>
> Presumably the "Tony fix" was for
> > a case where the guest thought the page was still accessible, but the
> > hypervisor wanted the whole thing treated as poison. It may be the
> > case that we're missing a mechanism to ask the hypervisor to consider
> > that the guest has cleared the poison. At least for PMEM described by
> > ACPI the existing ClearError DSM in the guest could be trapped by the
> > hypervisor to handle this case,
>
> I didn't even know that guest could clear poison by trapping hypervisor
> with the ClearError DSM method,

The guest can call the Clear Error DSM if the virtual BIOS provides
it. Whether that actually clears errors or not is up to the
hypervisor.

> I thought guest isn't privileged with that.

The guest does not have access to the bare metal DSM path, but the
hypervisor can certainly offer translation service for that operation.

> Would you mind to elaborate about the mechanism and maybe point
> out the code, and perhaps if you have test case to share?

I don't have a test case because until Tony's fix I did not realize
that a virtual #MC would allow the guest to learn of poisoned
locations without necessarily allowing the guest to trigger actual
poison consumption.

In other words I was operating under the assumption that telling
guests where poison is located is potentially handing the guest a way
to DoS the VMM. However, Tony's fix shows that when the hypervisor
unmaps the guest physical page it can prevent the guest from accessing
it again. So it follows that it should be ok to inject virtual #MC to
the guest, and unmap the guest physical range, but later allow that
guest physical range to be repaired if the guest asks the hypervisor
to repair the page.

Tony, does this match your understanding?

>
> but I'm not sure what to do about
> > guests that later want to use MOVDIR64B to clear errors.
> >
>
> Yeah, perhaps there is no way to prevent guest from accidentally
> clear error via MOVDIR64B, given some application rely on MOVDIR64B
> for fast data movement (straight to the media). I guess in that case,
> the consequence is false alarm, nothing disastrous, right?

You'll just continue to get false positive failures because the error
tracking will be out-of-sync with reality.

> How about allowing the potential bad-block bookkeeping gap, and
> manage to close the gap at certain checkpoints? I guess one of
> the checkpoints might be when page fault discovers a poisoned
> page?

Not sure how that would work... it's already the case that new error
entries are appended to the list at #MC time, the problem is knowing
when to clear those stale entries. I still think that needs to be at
dax_recovery_write() time.
Jane Chu Nov. 18, 2021, 7:03 p.m. UTC | #45
On 11/13/2021 12:47 PM, Dan Williams wrote:
<snip>
>>> It should know because the MCE that unmapped the page will have
>>> communicated a "whole_page()" MCE. When dax_recovery_read() goes to
>>> consult the badblocks list to try to read the remaining good data it
>>> will see that every single cacheline is covered by badblocks, so
>>> nothing to read, and no need to establish the UC mapping. So the the
>>> "Tony fix" was incomplete in retrospect. It neglected to update the
>>> NVDIMM badblocks tracking for the whole page case.
>>
>> So the call in nfit_handle_mce():
>>     nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
>>                   ALIGN(mce->addr, L1_CACHE_BYTES),
>>                   L1_CACHE_BYTES);
>> should be replaced by
>>     nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
>>                   ALIGN(mce->addr, L1_CACHE_BYTES),
>>                   (1 << MCI_MISC_ADDR_LSB(m->misc)));
>> right?
> 
> Yes.
> 
>>
>> And when dax_recovery_read() calls
>>     badblocks_check(bb, sector, len / 512, &first_bad, &num_bad)
>> it should always, in case of 'NP', discover that 'first_bad'
>> is the first sector in the poisoned page,  hence no need
>> to switch to 'UC', right?
> 
> Yes.
> 
>>
>> In case the 'first_bad' is in the middle of the poisoned page,
>> that is, dax_recover_read() could potentially read some clean
>> sectors, is there problem to
>>     call _set_memory_UC(pfn, 1),
>>     do the mc_safe read,
>>     and then call set_memory_NP(pfn, 1)
>> ?
>> Why do we need to call ioremap() or vmap()?
> 
> I'm worried about concurrent operations and enabling access to threads
> outside of the one currently in dax_recovery_read(). If a local vmap()
> / ioremap() is used it effectively makes the access thread local.
> There might still need to be an rwsem to allow dax_recovery_write() to
> fixup the pfn access and syncrhonize with dax_recovery_read()
> operations.
> 

<snip>
>> I didn't even know that guest could clear poison by trapping hypervisor
>> with the ClearError DSM method,
> 
> The guest can call the Clear Error DSM if the virtual BIOS provides
> it. Whether that actually clears errors or not is up to the
> hypervisor.
> 
>> I thought guest isn't privileged with that.
> 
> The guest does not have access to the bare metal DSM path, but the
> hypervisor can certainly offer translation service for that operation.
> 
>> Would you mind to elaborate about the mechanism and maybe point
>> out the code, and perhaps if you have test case to share?
> 
> I don't have a test case because until Tony's fix I did not realize
> that a virtual #MC would allow the guest to learn of poisoned
> locations without necessarily allowing the guest to trigger actual
> poison consumption.
> 
> In other words I was operating under the assumption that telling
> guests where poison is located is potentially handing the guest a way
> to DoS the VMM. However, Tony's fix shows that when the hypervisor
> unmaps the guest physical page it can prevent the guest from accessing
> it again. So it follows that it should be ok to inject virtual #MC to
> the guest, and unmap the guest physical range, but later allow that
> guest physical range to be repaired if the guest asks the hypervisor
> to repair the page.
> 
> Tony, does this match your understanding?
> 
>>
>> but I'm not sure what to do about
>>> guests that later want to use MOVDIR64B to clear errors.
>>>
>>
>> Yeah, perhaps there is no way to prevent guest from accidentally
>> clear error via MOVDIR64B, given some application rely on MOVDIR64B
>> for fast data movement (straight to the media). I guess in that case,
>> the consequence is false alarm, nothing disastrous, right?
> 
> You'll just continue to get false positive failures because the error
> tracking will be out-of-sync with reality.
> 
>> How about allowing the potential bad-block bookkeeping gap, and
>> manage to close the gap at certain checkpoints? I guess one of
>> the checkpoints might be when page fault discovers a poisoned
>> page?
> 
> Not sure how that would work... it's already the case that new error
> entries are appended to the list at #MC time, the problem is knowing
> when to clear those stale entries. I still think that needs to be at
> dax_recovery_write() time.
> 

Thanks Dan for taking the time elaborating so much details!

After some amount of digging, I have a feel that we need to take
dax error handling in phases.

Phase-1: the simplest dax_recovery_write on page granularity, along
          with fix to set poisoned page to 'NP', serialize
          dax_recovery_write threads.
Phase-2: provide dax_recovery_read support and hence shrink the error
          recovery granularity.  As ioremap returns __iomem pointer
          that is only allowed to be referenced with helpers like
          readl() which do not have a mc_safe variant, and I'm
          not sure whether there should be.  Also the synchronization
          between dax_recovery_read and dax_recovery_write threads.
Phase-3: the hypervisor error-record keeping issue, suppose there is
          an issue, I'll need to figure out how to setup a test case.
Phase-4: the how-to-mitigate-MOVDIR64B-false-alarm issue.

Right now, it seems to me providing Phase-1 solution is urgent, to give
something that customers can rely on.

How does this sound to you?

thanks,
-jane
Dan Williams Nov. 25, 2021, 12:16 a.m. UTC | #46
On Thu, Nov 18, 2021 at 11:04 AM Jane Chu <jane.chu@oracle.com> wrote:
>
> On 11/13/2021 12:47 PM, Dan Williams wrote:
> <snip>
> >>> It should know because the MCE that unmapped the page will have
> >>> communicated a "whole_page()" MCE. When dax_recovery_read() goes to
> >>> consult the badblocks list to try to read the remaining good data it
> >>> will see that every single cacheline is covered by badblocks, so
> >>> nothing to read, and no need to establish the UC mapping. So the the
> >>> "Tony fix" was incomplete in retrospect. It neglected to update the
> >>> NVDIMM badblocks tracking for the whole page case.
> >>
> >> So the call in nfit_handle_mce():
> >>     nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
> >>                   ALIGN(mce->addr, L1_CACHE_BYTES),
> >>                   L1_CACHE_BYTES);
> >> should be replaced by
> >>     nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
> >>                   ALIGN(mce->addr, L1_CACHE_BYTES),
> >>                   (1 << MCI_MISC_ADDR_LSB(m->misc)));
> >> right?
> >
> > Yes.
> >
> >>
> >> And when dax_recovery_read() calls
> >>     badblocks_check(bb, sector, len / 512, &first_bad, &num_bad)
> >> it should always, in case of 'NP', discover that 'first_bad'
> >> is the first sector in the poisoned page,  hence no need
> >> to switch to 'UC', right?
> >
> > Yes.
> >
> >>
> >> In case the 'first_bad' is in the middle of the poisoned page,
> >> that is, dax_recover_read() could potentially read some clean
> >> sectors, is there problem to
> >>     call _set_memory_UC(pfn, 1),
> >>     do the mc_safe read,
> >>     and then call set_memory_NP(pfn, 1)
> >> ?
> >> Why do we need to call ioremap() or vmap()?
> >
> > I'm worried about concurrent operations and enabling access to threads
> > outside of the one currently in dax_recovery_read(). If a local vmap()
> > / ioremap() is used it effectively makes the access thread local.
> > There might still need to be an rwsem to allow dax_recovery_write() to
> > fixup the pfn access and syncrhonize with dax_recovery_read()
> > operations.
> >
>
> <snip>
> >> I didn't even know that guest could clear poison by trapping hypervisor
> >> with the ClearError DSM method,
> >
> > The guest can call the Clear Error DSM if the virtual BIOS provides
> > it. Whether that actually clears errors or not is up to the
> > hypervisor.
> >
> >> I thought guest isn't privileged with that.
> >
> > The guest does not have access to the bare metal DSM path, but the
> > hypervisor can certainly offer translation service for that operation.
> >
> >> Would you mind to elaborate about the mechanism and maybe point
> >> out the code, and perhaps if you have test case to share?
> >
> > I don't have a test case because until Tony's fix I did not realize
> > that a virtual #MC would allow the guest to learn of poisoned
> > locations without necessarily allowing the guest to trigger actual
> > poison consumption.
> >
> > In other words I was operating under the assumption that telling
> > guests where poison is located is potentially handing the guest a way
> > to DoS the VMM. However, Tony's fix shows that when the hypervisor
> > unmaps the guest physical page it can prevent the guest from accessing
> > it again. So it follows that it should be ok to inject virtual #MC to
> > the guest, and unmap the guest physical range, but later allow that
> > guest physical range to be repaired if the guest asks the hypervisor
> > to repair the page.
> >
> > Tony, does this match your understanding?
> >
> >>
> >> but I'm not sure what to do about
> >>> guests that later want to use MOVDIR64B to clear errors.
> >>>
> >>
> >> Yeah, perhaps there is no way to prevent guest from accidentally
> >> clear error via MOVDIR64B, given some application rely on MOVDIR64B
> >> for fast data movement (straight to the media). I guess in that case,
> >> the consequence is false alarm, nothing disastrous, right?
> >
> > You'll just continue to get false positive failures because the error
> > tracking will be out-of-sync with reality.
> >
> >> How about allowing the potential bad-block bookkeeping gap, and
> >> manage to close the gap at certain checkpoints? I guess one of
> >> the checkpoints might be when page fault discovers a poisoned
> >> page?
> >
> > Not sure how that would work... it's already the case that new error
> > entries are appended to the list at #MC time, the problem is knowing
> > when to clear those stale entries. I still think that needs to be at
> > dax_recovery_write() time.
> >
>
> Thanks Dan for taking the time elaborating so much details!
>
> After some amount of digging, I have a feel that we need to take
> dax error handling in phases.
>
> Phase-1: the simplest dax_recovery_write on page granularity, along
>           with fix to set poisoned page to 'NP', serialize
>           dax_recovery_write threads.

You mean special case PAGE_SIZE overwrites when dax_direct_access()
fails, but leave out the sub-page error handling and
read-around-poison support?

That makes sense to me. Incremental is good.

> Phase-2: provide dax_recovery_read support and hence shrink the error
>           recovery granularity.  As ioremap returns __iomem pointer
>           that is only allowed to be referenced with helpers like
>           readl() which do not have a mc_safe variant, and I'm
>           not sure whether there should be.  Also the synchronization
>           between dax_recovery_read and dax_recovery_write threads.

You can just use memremap() like the driver does to drop the iomem annotation.

> Phase-3: the hypervisor error-record keeping issue, suppose there is
>           an issue, I'll need to figure out how to setup a test case.
> Phase-4: the how-to-mitigate-MOVDIR64B-false-alarm issue.

My expectation is that CXL supports MOVDIR64B error clearing without
needing to send the Clear Poison command. I think this can be phase3,
phase4 is the more difficult question about how / if to coordinate
with VMM poison tracking. Right now I don't see a choice but to make
it paravirtualized.

>
> Right now, it seems to me providing Phase-1 solution is urgent, to give
> something that customers can rely on.
>
> How does this sound to you?

Sounds good.
Jane Chu Nov. 30, 2021, 11 p.m. UTC | #47
On 11/24/2021 4:16 PM, Dan Williams wrote:
> On Thu, Nov 18, 2021 at 11:04 AM Jane Chu <jane.chu@oracle.com> wrote:
>>
>> On 11/13/2021 12:47 PM, Dan Williams wrote:
>> <snip>
<snip>
>>
>> Thanks Dan for taking the time elaborating so much details!
>>
>> After some amount of digging, I have a feel that we need to take
>> dax error handling in phases.
>>
>> Phase-1: the simplest dax_recovery_write on page granularity, along
>>            with fix to set poisoned page to 'NP', serialize
>>            dax_recovery_write threads.
> 
> You mean special case PAGE_SIZE overwrites when dax_direct_access()
> fails, but leave out the sub-page error handling and
> read-around-poison support?

Yes.
> 
> That makes sense to me. Incremental is good.

Thanks!
> 
>> Phase-2: provide dax_recovery_read support and hence shrink the error
>>            recovery granularity.  As ioremap returns __iomem pointer
>>            that is only allowed to be referenced with helpers like
>>            readl() which do not have a mc_safe variant, and I'm
>>            not sure whether there should be.  Also the synchronization
>>            between dax_recovery_read and dax_recovery_write threads.
> 
> You can just use memremap() like the driver does to drop the iomem annotation.

Okay, will investigate in phase 2.

> 
>> Phase-3: the hypervisor error-record keeping issue, suppose there is
>>            an issue, I'll need to figure out how to setup a test case.
>> Phase-4: the how-to-mitigate-MOVDIR64B-false-alarm issue.
> 
> My expectation is that CXL supports MOVDIR64B error clearing without
> needing to send the Clear Poison command. I think this can be phase3,
> phase4 is the more difficult question about how / if to coordinate
> with VMM poison tracking. Right now I don't see a choice but to make
> it paravirtualized.
> 
>>
>> Right now, it seems to me providing Phase-1 solution is urgent, to give
>> something that customers can rely on.
>>
>> How does this sound to you?
> 
> Sounds good.
> 

Thanks!
-jane
diff mbox series

Patch

diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 43fa081a1adb..0bf2274c5186 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -114,8 +114,13 @@  static inline int set_mce_nospec(unsigned long pfn, bool unmap)
 
 	if (unmap)
 		rc = set_memory_np(decoy_addr, 1);
-	else
-		rc = set_memory_uc(decoy_addr, 1);
+	else {
+		/*
+		 * Bypass memtype checks since memory-failure has shot
+		 * down mappings.
+		 */
+		rc = _set_memory_uc(decoy_addr, 1);
+	}
 	if (rc)
 		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
 	return rc;