diff mbox series

[-next] memregion: Add arch_flush_memregion() interface

Message ID 20220829212918.4039240-1-dave@stgolabs.net
State New, archived
Headers show
Series [-next] memregion: Add arch_flush_memregion() interface | expand

Commit Message

Davidlohr Bueso Aug. 29, 2022, 9:29 p.m. UTC
With CXL security features, global CPU cache flushing nvdimm requirements
are no longer specific to that subsystem, even beyond the scope of
security_ops. CXL will need such semantics for features not necessarily
limited to persistent memory.

The functionality this is enabling is to be able to instantaneously
secure erase potentially terabytes of memory at once and the kernel
needs to be sure that none of the data from before the secure is still
present in the cache. It is also used when unlocking a memory device
where speculative reads and firmware accesses could have cached poison
from before the device was unlocked.

This capability is typically only used once per-boot (for unlock), or
once per bare metal provisioning event (secure erase), like when handing
off the system to another tenant or decommissioning a device.

Users must first call arch_has_flush_memregion() to know whether this
functionality is available on the architecture. Only enable it on x86-64
via the wbinvd() hammer.

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
---

Changes from v2 (https://lore.kernel.org/all/20220819171024.1766857-1-dave@stgolabs.net/):
- Redid to use memregion based interfaces + VMM check on x86 (Dan)
- Restricted the flushing to x86-64.

Note: Since we still are dealing with a physical "range" at this level,
added the spa range for nfit even though this is unused.

 arch/x86/Kconfig             |  1 +
 arch/x86/mm/pat/set_memory.c | 14 +++++++++++
 drivers/acpi/nfit/intel.c    | 45 ++++++++++++++++++------------------
 include/linux/memregion.h    | 25 ++++++++++++++++++++
 lib/Kconfig                  |  3 +++
 5 files changed, 65 insertions(+), 23 deletions(-)

Comments

Dan Williams Aug. 29, 2022, 11:02 p.m. UTC | #1
[ add Christoph ]

Davidlohr Bueso wrote:
> With CXL security features, global CPU cache flushing nvdimm requirements
> are no longer specific to that subsystem, even beyond the scope of
> security_ops. CXL will need such semantics for features not necessarily
> limited to persistent memory.
> 
> The functionality this is enabling is to be able to instantaneously
> secure erase potentially terabytes of memory at once and the kernel
> needs to be sure that none of the data from before the secure is still
> present in the cache. It is also used when unlocking a memory device
> where speculative reads and firmware accesses could have cached poison
> from before the device was unlocked.
> 
> This capability is typically only used once per-boot (for unlock), or
> once per bare metal provisioning event (secure erase), like when handing
> off the system to another tenant or decommissioning a device.
> 
> Users must first call arch_has_flush_memregion() to know whether this
> functionality is available on the architecture. Only enable it on x86-64
> via the wbinvd() hammer.
> 
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
> ---
> 
> Changes from v2 (https://lore.kernel.org/all/20220819171024.1766857-1-dave@stgolabs.net/):
> - Redid to use memregion based interfaces + VMM check on x86 (Dan)
> - Restricted the flushing to x86-64.
> 
> Note: Since we still are dealing with a physical "range" at this level,
> added the spa range for nfit even though this is unused.

Looks reasonable to me.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Davidlohr Bueso Sept. 7, 2022, 2:36 p.m. UTC | #2
Not sure the proper way to route this (akpm?). But unless any remaining
objections, could this be picked up?

Thanks,
Davidlohr
Borislav Petkov Sept. 7, 2022, 4:05 p.m. UTC | #3
On Mon, Aug 29, 2022 at 02:29:18PM -0700, Davidlohr Bueso wrote:
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 1abd5438f126..18463cb704fb 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -330,6 +330,20 @@ void arch_invalidate_pmem(void *addr, size_t size)
>  EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
>  #endif
>  
> +#ifdef CONFIG_ARCH_HAS_MEMREGION_INVALIDATE
> +bool arch_has_flush_memregion(void)
> +{
> +	return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR);

This looks really weird. Why does this need to care about HV at all?

Does that nfit stuff even run in guests?

> +EXPORT_SYMBOL(arch_has_flush_memregion);

...

> +EXPORT_SYMBOL(arch_flush_memregion);

Why aren't those exports _GPL?

Thx.
Davidlohr Bueso Sept. 7, 2022, 4:22 p.m. UTC | #4
On Wed, 07 Sep 2022, Borislav Petkov wrote:

>On Mon, Aug 29, 2022 at 02:29:18PM -0700, Davidlohr Bueso wrote:
>> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
>> index 1abd5438f126..18463cb704fb 100644
>> --- a/arch/x86/mm/pat/set_memory.c
>> +++ b/arch/x86/mm/pat/set_memory.c
>> @@ -330,6 +330,20 @@ void arch_invalidate_pmem(void *addr, size_t size)
>>  EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
>>  #endif
>>
>> +#ifdef CONFIG_ARCH_HAS_MEMREGION_INVALIDATE
>> +bool arch_has_flush_memregion(void)
>> +{
>> +	return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR);
>
>This looks really weird. Why does this need to care about HV at all?

So the context here is:

e2efb6359e62 ("ACPICA: Avoid cache flush inside virtual machines")

>
>Does that nfit stuff even run in guests?

No, nor does cxl. This was mostly in general a precautionary check such
that the api is unavailable in VMs.

>
>> +EXPORT_SYMBOL(arch_has_flush_memregion);
>
>...
>
>> +EXPORT_SYMBOL(arch_flush_memregion);
>
>Why aren't those exports _GPL?

Fine by me.

Thanks,
Davidlohr
Dan Williams Sept. 7, 2022, 4:46 p.m. UTC | #5
Davidlohr Bueso wrote:
> Not sure the proper way to route this (akpm?). But unless any remaining
> objections, could this be picked up?

My plan was, barring objections, to take it through the CXL tree with
its first user, the CXL security commands.
Dan Williams Sept. 7, 2022, 4:52 p.m. UTC | #6
Davidlohr Bueso wrote:
> On Wed, 07 Sep 2022, Borislav Petkov wrote:
> 
> >On Mon, Aug 29, 2022 at 02:29:18PM -0700, Davidlohr Bueso wrote:
> >> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> >> index 1abd5438f126..18463cb704fb 100644
> >> --- a/arch/x86/mm/pat/set_memory.c
> >> +++ b/arch/x86/mm/pat/set_memory.c
> >> @@ -330,6 +330,20 @@ void arch_invalidate_pmem(void *addr, size_t size)
> >>  EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
> >>  #endif
> >>
> >> +#ifdef CONFIG_ARCH_HAS_MEMREGION_INVALIDATE
> >> +bool arch_has_flush_memregion(void)
> >> +{
> >> +	return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR);
> >
> >This looks really weird. Why does this need to care about HV at all?
> 
> So the context here is:
> 
> e2efb6359e62 ("ACPICA: Avoid cache flush inside virtual machines")
> 
> >
> >Does that nfit stuff even run in guests?
> 
> No, nor does cxl. This was mostly in general a precautionary check such
> that the api is unavailable in VMs.

To be clear nfit stuff and CXL does run in guests, but they do not
support secure-erase in a guest.

However, the QEMU CXL enabling is building the ability to do *guest
physical* address space management, but in that case the driver can be
paravirtualized to realize that it is not managing host-physical address
space and does not need to flush caches. That will need some indicator
to differentiate virtual CXL memory expanders from assigned devices. Is
there such a thing as a PCIe-virtio extended capability to differentiate
physical vs emulated devices?
Davidlohr Bueso Sept. 7, 2022, 5:24 p.m. UTC | #7
On Wed, 07 Sep 2022, Dan Williams wrote:

>Davidlohr Bueso wrote:
>> On Wed, 07 Sep 2022, Borislav Petkov wrote:
>>
>> >On Mon, Aug 29, 2022 at 02:29:18PM -0700, Davidlohr Bueso wrote:
>> >> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
>> >> index 1abd5438f126..18463cb704fb 100644
>> >> --- a/arch/x86/mm/pat/set_memory.c
>> >> +++ b/arch/x86/mm/pat/set_memory.c
>> >> @@ -330,6 +330,20 @@ void arch_invalidate_pmem(void *addr, size_t size)
>> >>  EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
>> >>  #endif
>> >>
>> >> +#ifdef CONFIG_ARCH_HAS_MEMREGION_INVALIDATE
>> >> +bool arch_has_flush_memregion(void)
>> >> +{
>> >> +	return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR);
>> >
>> >This looks really weird. Why does this need to care about HV at all?
>>
>> So the context here is:
>>
>> e2efb6359e62 ("ACPICA: Avoid cache flush inside virtual machines")
>>
>> >
>> >Does that nfit stuff even run in guests?
>>
>> No, nor does cxl. This was mostly in general a precautionary check such
>> that the api is unavailable in VMs.
>
>To be clear nfit stuff and CXL does run in guests, but they do not
>support secure-erase in a guest.

Yes, I meant the feats this api enables.

>However, the QEMU CXL enabling is building the ability to do *guest
>physical* address space management, but in that case the driver can be
>paravirtualized to realize that it is not managing host-physical address
>space and does not need to flush caches. That will need some indicator
>to differentiate virtual CXL memory expanders from assigned devices. Is
>there such a thing as a PCIe-virtio extended capability to differentiate
>physical vs emulated devices?

In any case such check would be specific to each user (cxl in this case),
and outside the scope of _this_ particular api. Here we just really want to
avoid the broken TDX guest bits.

Thanks,
Davidlohr
Andrew Morton Sept. 7, 2022, 10:30 p.m. UTC | #8
I really dislike the term "flush".  Sometimes it means writeback,
sometimes it means invalidate.  Perhaps at other times it means
both.

Can we please be very clear in comments and changelogs about exactly
what this "flush" does.   With bonus points for being more specific in the 
function naming?
Dan Williams Sept. 8, 2022, 1:07 a.m. UTC | #9
Andrew Morton wrote:
> I really dislike the term "flush".  Sometimes it means writeback,
> sometimes it means invalidate.  Perhaps at other times it means
> both.
> 
> Can we please be very clear in comments and changelogs about exactly
> what this "flush" does.   With bonus points for being more specific in the 
> function naming?
> 

That's a good point, "flush" has been cargo-culted along in Linux's
cache management APIs to mean write-back-and-invalidate. In this case I
think this API is purely about invalidate. It just so happens that x86
has not historically had a global invalidate instruction readily
available which leads to the overuse of wbinvd.

It would be nice to make clear that this API is purely about
invalidating any data cached for a physical address impacted by address
space management event (secure erase / new region provision). Write-back
is an unnecessary side-effect.

So how about:

s/arch_flush_memregion/cpu_cache_invalidate_memregion/?
Borislav Petkov Sept. 8, 2022, 4:31 a.m. UTC | #10
On Wed, Sep 07, 2022 at 09:22:45AM -0700, Davidlohr Bueso wrote:
> So the context here is:
> 
> e2efb6359e62 ("ACPICA: Avoid cache flush inside virtual machines")

Please add this to your commit message so that at least we know how this
HV dependency came about.

Thx.
Borislav Petkov Sept. 8, 2022, 4:35 a.m. UTC | #11
On Wed, Sep 07, 2022 at 09:52:17AM -0700, Dan Williams wrote:
> To be clear nfit stuff and CXL does run in guests, but they do not
> support secure-erase in a guest.
> 
> However, the QEMU CXL enabling is building the ability to do *guest
> physical* address space management, but in that case the driver can be
> paravirtualized to realize that it is not managing host-physical address
> space and does not need to flush caches. That will need some indicator
> to differentiate virtual CXL memory expanders from assigned devices.

Sounds to me like that check should be improved later to ask
whether the kernel is managing host-physical address space, maybe
arch_flush_memregion() should check whether the address it is supposed
to flush is host-physical and exit early if not...
Dan Williams Sept. 8, 2022, 6:53 a.m. UTC | #12
Borislav Petkov wrote:
> On Wed, Sep 07, 2022 at 09:52:17AM -0700, Dan Williams wrote:
> > To be clear nfit stuff and CXL does run in guests, but they do not
> > support secure-erase in a guest.
> > 
> > However, the QEMU CXL enabling is building the ability to do *guest
> > physical* address space management, but in that case the driver can be
> > paravirtualized to realize that it is not managing host-physical address
> > space and does not need to flush caches. That will need some indicator
> > to differentiate virtual CXL memory expanders from assigned devices.
> 
> Sounds to me like that check should be improved later to ask
> whether the kernel is managing host-physical address space, maybe
> arch_flush_memregion() should check whether the address it is supposed
> to flush is host-physical and exit early if not...

Even though I raised the possibility of guest passthrough of a CXL
memory expander, I do not think it could work in practice without it
being a gigantic security nightmare. So it is probably safe to just do
the hypervisor check and assume that there's no such thing as guest
management of host physical address space.
Jonathan Cameron Sept. 8, 2022, 1 p.m. UTC | #13
On Wed, 7 Sep 2022 23:53:31 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> Borislav Petkov wrote:
> > On Wed, Sep 07, 2022 at 09:52:17AM -0700, Dan Williams wrote:  
> > > To be clear nfit stuff and CXL does run in guests, but they do not
> > > support secure-erase in a guest.
> > > 
> > > However, the QEMU CXL enabling is building the ability to do *guest
> > > physical* address space management, but in that case the driver can be
> > > paravirtualized to realize that it is not managing host-physical address
> > > space and does not need to flush caches. That will need some indicator
> > > to differentiate virtual CXL memory expanders from assigned devices.  
> > 
> > Sounds to me like that check should be improved later to ask
> > whether the kernel is managing host-physical address space, maybe
> > arch_flush_memregion() should check whether the address it is supposed
> > to flush is host-physical and exit early if not...  
> 
> Even though I raised the possibility of guest passthrough of a CXL
> memory expander, I do not think it could work in practice without it
> being a gigantic security nightmare. So it is probably safe to just do
> the hypervisor check and assume that there's no such thing as guest
> management of host physical address space.

Agreed.  Other than occasional convenience of doing driver development
in a VM (they reboot quickly ;) can't see why a product system would need
to pass a CXL type 3 device through and as you say security would be 'interesting'
if it were done.  Might be GPU usecases down the line but I'm doubtful on that
as well.

Jonathan
Jonathan Cameron Sept. 8, 2022, 1:13 p.m. UTC | #14
On Wed, 7 Sep 2022 18:07:31 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> Andrew Morton wrote:
> > I really dislike the term "flush".  Sometimes it means writeback,
> > sometimes it means invalidate.  Perhaps at other times it means
> > both.
> > 
> > Can we please be very clear in comments and changelogs about exactly
> > what this "flush" does.   With bonus points for being more specific in the 
> > function naming?
> >   
> 
> That's a good point, "flush" has been cargo-culted along in Linux's
> cache management APIs to mean write-back-and-invalidate. In this case I
> think this API is purely about invalidate. It just so happens that x86
> has not historically had a global invalidate instruction readily
> available which leads to the overuse of wbinvd.
> 
> It would be nice to make clear that this API is purely about
> invalidating any data cached for a physical address impacted by address
> space management event (secure erase / new region provision). Write-back
> is an unnecessary side-effect.
> 
> So how about:
> 
> s/arch_flush_memregion/cpu_cache_invalidate_memregion/?

Want to indicate it 'might' write back perhaps?
So could be invalidate or clean and invalidate (using arm ARM terms just to add
to the confusion ;)

Feels like there will be potential race conditions where that matters as we might
force stale data to be written back.

Perhaps a comment is enough for that. Anyone have the "famous last words" feeling?

Jonathan
Dan Williams Sept. 8, 2022, 10:51 p.m. UTC | #15
Jonathan Cameron wrote:
> On Wed, 7 Sep 2022 18:07:31 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Andrew Morton wrote:
> > > I really dislike the term "flush".  Sometimes it means writeback,
> > > sometimes it means invalidate.  Perhaps at other times it means
> > > both.
> > > 
> > > Can we please be very clear in comments and changelogs about exactly
> > > what this "flush" does.   With bonus points for being more specific in the 
> > > function naming?
> > >   
> > 
> > That's a good point, "flush" has been cargo-culted along in Linux's
> > cache management APIs to mean write-back-and-invalidate. In this case I
> > think this API is purely about invalidate. It just so happens that x86
> > has not historically had a global invalidate instruction readily
> > available which leads to the overuse of wbinvd.
> > 
> > It would be nice to make clear that this API is purely about
> > invalidating any data cached for a physical address impacted by address
> > space management event (secure erase / new region provision). Write-back
> > is an unnecessary side-effect.
> > 
> > So how about:
> > 
> > s/arch_flush_memregion/cpu_cache_invalidate_memregion/?
> 
> Want to indicate it 'might' write back perhaps?
> So could be invalidate or clean and invalidate (using arm ARM terms just to add
> to the confusion ;)
> 
> Feels like there will be potential race conditions where that matters as we might
> force stale data to be written back.
> 
> Perhaps a comment is enough for that. Anyone have the "famous last words" feeling?

Is "invalidate" not clear that write-back is optional? Maybe not.

Also, I realized that we tried to include the address range to allow for
the possibility of flushing by virtual address range, but that
overcomplicates the use. I.e. if someone issue secure erase and the
region association is not established does that mean that mean that the
cache invalidation is not needed? It could be the case that someone
disables a device, does the secure erase, and then reattaches to the
same region. The cache invalidation is needed, but at the time of the
secure erase the HPA was unknown.

All this to say that I feel the bikeshedding will need to continue until
morale improves.

I notice that the DMA API uses 'sync' to indicate, "make this memory
consistent/coherent for the CPU or the device", so how about an API like

    memregion_sync_for_cpu(int res_desc)

...where the @res_desc would be IORES_DESC_CXL for all CXL and
IORES_DESC_PERSISTENT_MEMORY for the current nvdimm use case.
Andrew Morton Sept. 8, 2022, 11 p.m. UTC | #16
On Thu, 8 Sep 2022 15:51:50 -0700 Dan Williams <dan.j.williams@intel.com> wrote:

> Jonathan Cameron wrote:
> > On Wed, 7 Sep 2022 18:07:31 -0700
> > Dan Williams <dan.j.williams@intel.com> wrote:
> > 
> > > Andrew Morton wrote:
> > > > I really dislike the term "flush".  Sometimes it means writeback,
> > > > sometimes it means invalidate.  Perhaps at other times it means
> > > > both.
> > > > 
> > > > Can we please be very clear in comments and changelogs about exactly
> > > > what this "flush" does.   With bonus points for being more specific in the 
> > > > function naming?
> > > >   
> > > 
> > > That's a good point, "flush" has been cargo-culted along in Linux's
> > > cache management APIs to mean write-back-and-invalidate. In this case I
> > > think this API is purely about invalidate. It just so happens that x86
> > > has not historically had a global invalidate instruction readily
> > > available which leads to the overuse of wbinvd.
> > > 
> > > It would be nice to make clear that this API is purely about
> > > invalidating any data cached for a physical address impacted by address
> > > space management event (secure erase / new region provision). Write-back
> > > is an unnecessary side-effect.
> > > 
> > > So how about:
> > > 
> > > s/arch_flush_memregion/cpu_cache_invalidate_memregion/?
> > 
> > Want to indicate it 'might' write back perhaps?
> > So could be invalidate or clean and invalidate (using arm ARM terms just to add
> > to the confusion ;)
> > 
> > Feels like there will be potential race conditions where that matters as we might
> > force stale data to be written back.
> > 
> > Perhaps a comment is enough for that. Anyone have the "famous last words" feeling?
> 
> Is "invalidate" not clear that write-back is optional? Maybe not.

Yes, I'd say that "invalidate" means "dirty stuff may of may not have
been written back".  Ditto for invalidate_inode_pages2().

> Also, I realized that we tried to include the address range to allow for
> the possibility of flushing by virtual address range, but that
> overcomplicates the use. I.e. if someone issue secure erase and the
> region association is not established does that mean that mean that the
> cache invalidation is not needed? It could be the case that someone
> disables a device, does the secure erase, and then reattaches to the
> same region. The cache invalidation is needed, but at the time of the
> secure erase the HPA was unknown.
> 
> All this to say that I feel the bikeshedding will need to continue until
> morale improves.
> 
> I notice that the DMA API uses 'sync' to indicate, "make this memory
> consistent/coherent for the CPU or the device", so how about an API like
> 
>     memregion_sync_for_cpu(int res_desc)
> 
> ...where the @res_desc would be IORES_DESC_CXL for all CXL and
> IORES_DESC_PERSISTENT_MEMORY for the current nvdimm use case.

"sync" is another of my pet peeves ;) In filesystem land, at least. 
Does it mean "start writeback and return" or does it mean "start
writeback, wait for its completion then return".
Dan Williams Sept. 8, 2022, 11:22 p.m. UTC | #17
Andrew Morton wrote:
> On Thu, 8 Sep 2022 15:51:50 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Jonathan Cameron wrote:
> > > On Wed, 7 Sep 2022 18:07:31 -0700
> > > Dan Williams <dan.j.williams@intel.com> wrote:
> > > 
> > > > Andrew Morton wrote:
> > > > > I really dislike the term "flush".  Sometimes it means writeback,
> > > > > sometimes it means invalidate.  Perhaps at other times it means
> > > > > both.
> > > > > 
> > > > > Can we please be very clear in comments and changelogs about exactly
> > > > > what this "flush" does.   With bonus points for being more specific in the 
> > > > > function naming?
> > > > >   
> > > > 
> > > > That's a good point, "flush" has been cargo-culted along in Linux's
> > > > cache management APIs to mean write-back-and-invalidate. In this case I
> > > > think this API is purely about invalidate. It just so happens that x86
> > > > has not historically had a global invalidate instruction readily
> > > > available which leads to the overuse of wbinvd.
> > > > 
> > > > It would be nice to make clear that this API is purely about
> > > > invalidating any data cached for a physical address impacted by address
> > > > space management event (secure erase / new region provision). Write-back
> > > > is an unnecessary side-effect.
> > > > 
> > > > So how about:
> > > > 
> > > > s/arch_flush_memregion/cpu_cache_invalidate_memregion/?
> > > 
> > > Want to indicate it 'might' write back perhaps?
> > > So could be invalidate or clean and invalidate (using arm ARM terms just to add
> > > to the confusion ;)
> > > 
> > > Feels like there will be potential race conditions where that matters as we might
> > > force stale data to be written back.
> > > 
> > > Perhaps a comment is enough for that. Anyone have the "famous last words" feeling?
> > 
> > Is "invalidate" not clear that write-back is optional? Maybe not.
> 
> Yes, I'd say that "invalidate" means "dirty stuff may of may not have
> been written back".  Ditto for invalidate_inode_pages2().
> 
> > Also, I realized that we tried to include the address range to allow for
> > the possibility of flushing by virtual address range, but that
> > overcomplicates the use. I.e. if someone issue secure erase and the
> > region association is not established does that mean that mean that the
> > cache invalidation is not needed? It could be the case that someone
> > disables a device, does the secure erase, and then reattaches to the
> > same region. The cache invalidation is needed, but at the time of the
> > secure erase the HPA was unknown.
> > 
> > All this to say that I feel the bikeshedding will need to continue until
> > morale improves.
> > 
> > I notice that the DMA API uses 'sync' to indicate, "make this memory
> > consistent/coherent for the CPU or the device", so how about an API like
> > 
> >     memregion_sync_for_cpu(int res_desc)
> > 
> > ...where the @res_desc would be IORES_DESC_CXL for all CXL and
> > IORES_DESC_PERSISTENT_MEMORY for the current nvdimm use case.
> 
> "sync" is another of my pet peeves ;) In filesystem land, at least. 
> Does it mean "start writeback and return" or does it mean "start
> writeback, wait for its completion then return".  

Ok, no "sync" :).

/**
 * cpu_cache_invalidate_memregion - drop any CPU cached data for
 *     memregions described by @res_des
 * @res_desc: one of the IORES_DESC_* types
 *
 * Perform cache maintenance after a memory event / operation that
 * changes the contents of physical memory in a cache-incoherent manner.
 * For example, memory-device secure erase, or provisioning new CXL
 * regions. This routine may or may not write back any dirty contents
 * while performing the invalidation.
 *
 * Returns 0 on success or negative error code on a failure to perform
 * the cache maintenance.
 */
int cpu_cache_invalidate_memregion(int res_desc)

??
Jonathan Cameron Sept. 9, 2022, 11:43 a.m. UTC | #18
On Thu, 8 Sep 2022 16:22:26 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> Andrew Morton wrote:
> > On Thu, 8 Sep 2022 15:51:50 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
> >   
> > > Jonathan Cameron wrote:  
> > > > On Wed, 7 Sep 2022 18:07:31 -0700
> > > > Dan Williams <dan.j.williams@intel.com> wrote:
> > > >   
> > > > > Andrew Morton wrote:  
> > > > > > I really dislike the term "flush".  Sometimes it means writeback,
> > > > > > sometimes it means invalidate.  Perhaps at other times it means
> > > > > > both.
> > > > > > 
> > > > > > Can we please be very clear in comments and changelogs about exactly
> > > > > > what this "flush" does.   With bonus points for being more specific in the 
> > > > > > function naming?
> > > > > >     
> > > > > 
> > > > > That's a good point, "flush" has been cargo-culted along in Linux's
> > > > > cache management APIs to mean write-back-and-invalidate. In this case I
> > > > > think this API is purely about invalidate. It just so happens that x86
> > > > > has not historically had a global invalidate instruction readily
> > > > > available which leads to the overuse of wbinvd.
> > > > > 
> > > > > It would be nice to make clear that this API is purely about
> > > > > invalidating any data cached for a physical address impacted by address
> > > > > space management event (secure erase / new region provision). Write-back
> > > > > is an unnecessary side-effect.
> > > > > 
> > > > > So how about:
> > > > > 
> > > > > s/arch_flush_memregion/cpu_cache_invalidate_memregion/?  
> > > > 
> > > > Want to indicate it 'might' write back perhaps?
> > > > So could be invalidate or clean and invalidate (using arm ARM terms just to add
> > > > to the confusion ;)
> > > > 
> > > > Feels like there will be potential race conditions where that matters as we might
> > > > force stale data to be written back.
> > > > 
> > > > Perhaps a comment is enough for that. Anyone have the "famous last words" feeling?  
> > > 
> > > Is "invalidate" not clear that write-back is optional? Maybe not.  
> > 
> > Yes, I'd say that "invalidate" means "dirty stuff may of may not have
> > been written back".  Ditto for invalidate_inode_pages2().
> >   
> > > Also, I realized that we tried to include the address range to allow for
> > > the possibility of flushing by virtual address range, but that
> > > overcomplicates the use. I.e. if someone issue secure erase and the
> > > region association is not established does that mean that mean that the
> > > cache invalidation is not needed? It could be the case that someone
> > > disables a device, does the secure erase, and then reattaches to the
> > > same region. The cache invalidation is needed, but at the time of the
> > > secure erase the HPA was unknown.
> > > 
> > > All this to say that I feel the bikeshedding will need to continue until
> > > morale improves.
> > > 
> > > I notice that the DMA API uses 'sync' to indicate, "make this memory
> > > consistent/coherent for the CPU or the device", so how about an API like
> > > 
> > >     memregion_sync_for_cpu(int res_desc)
> > > 
> > > ...where the @res_desc would be IORES_DESC_CXL for all CXL and
> > > IORES_DESC_PERSISTENT_MEMORY for the current nvdimm use case.  
> > 
> > "sync" is another of my pet peeves ;) In filesystem land, at least. 
> > Does it mean "start writeback and return" or does it mean "start
> > writeback, wait for its completion then return".    
> 
> Ok, no "sync" :).
> 
> /**
>  * cpu_cache_invalidate_memregion - drop any CPU cached data for
>  *     memregions described by @res_des
>  * @res_desc: one of the IORES_DESC_* types
>  *
>  * Perform cache maintenance after a memory event / operation that
>  * changes the contents of physical memory in a cache-incoherent manner.
>  * For example, memory-device secure erase, or provisioning new CXL
>  * regions. This routine may or may not write back any dirty contents
>  * while performing the invalidation.
>  *
>  * Returns 0 on success or negative error code on a failure to perform
>  * the cache maintenance.
>  */
> int cpu_cache_invalidate_memregion(int res_desc)

lgtm

> 
> ??
Davidlohr Bueso Sept. 13, 2022, 3:56 p.m. UTC | #19
On Fri, 09 Sep 2022, Jonathan Cameron wrote:

>On Thu, 8 Sep 2022 16:22:26 -0700
>Dan Williams <dan.j.williams@intel.com> wrote:
>
>> Andrew Morton wrote:
>> > On Thu, 8 Sep 2022 15:51:50 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
>> >
>> > > Jonathan Cameron wrote:
>> > > > On Wed, 7 Sep 2022 18:07:31 -0700
>> > > > Dan Williams <dan.j.williams@intel.com> wrote:
>> > > >
>> > > > > Andrew Morton wrote:
>> > > > > > I really dislike the term "flush".  Sometimes it means writeback,
>> > > > > > sometimes it means invalidate.  Perhaps at other times it means
>> > > > > > both.
>> > > > > >
>> > > > > > Can we please be very clear in comments and changelogs about exactly
>> > > > > > what this "flush" does.   With bonus points for being more specific in the
>> > > > > > function naming?
>> > > > > >
>> > > > >
>> > > > > That's a good point, "flush" has been cargo-culted along in Linux's
>> > > > > cache management APIs to mean write-back-and-invalidate. In this case I
>> > > > > think this API is purely about invalidate. It just so happens that x86
>> > > > > has not historically had a global invalidate instruction readily
>> > > > > available which leads to the overuse of wbinvd.
>> > > > >
>> > > > > It would be nice to make clear that this API is purely about
>> > > > > invalidating any data cached for a physical address impacted by address
>> > > > > space management event (secure erase / new region provision). Write-back
>> > > > > is an unnecessary side-effect.
>> > > > >
>> > > > > So how about:
>> > > > >
>> > > > > s/arch_flush_memregion/cpu_cache_invalidate_memregion/?
>> > > >
>> > > > Want to indicate it 'might' write back perhaps?
>> > > > So could be invalidate or clean and invalidate (using arm ARM terms just to add
>> > > > to the confusion ;)
>> > > >
>> > > > Feels like there will be potential race conditions where that matters as we might
>> > > > force stale data to be written back.
>> > > >
>> > > > Perhaps a comment is enough for that. Anyone have the "famous last words" feeling?
>> > >
>> > > Is "invalidate" not clear that write-back is optional? Maybe not.
>> >
>> > Yes, I'd say that "invalidate" means "dirty stuff may of may not have
>> > been written back".  Ditto for invalidate_inode_pages2().
>> >
>> > > Also, I realized that we tried to include the address range to allow for
>> > > the possibility of flushing by virtual address range, but that
>> > > overcomplicates the use. I.e. if someone issue secure erase and the
>> > > region association is not established does that mean that mean that the
>> > > cache invalidation is not needed? It could be the case that someone
>> > > disables a device, does the secure erase, and then reattaches to the
>> > > same region. The cache invalidation is needed, but at the time of the
>> > > secure erase the HPA was unknown.
>> > >
>> > > All this to say that I feel the bikeshedding will need to continue until
>> > > morale improves.
>> > >
>> > > I notice that the DMA API uses 'sync' to indicate, "make this memory
>> > > consistent/coherent for the CPU or the device", so how about an API like
>> > >
>> > >     memregion_sync_for_cpu(int res_desc)
>> > >
>> > > ...where the @res_desc would be IORES_DESC_CXL for all CXL and
>> > > IORES_DESC_PERSISTENT_MEMORY for the current nvdimm use case.
>> >
>> > "sync" is another of my pet peeves ;) In filesystem land, at least.
>> > Does it mean "start writeback and return" or does it mean "start
>> > writeback, wait for its completion then return".
>>
>> Ok, no "sync" :).
>>
>> /**
>>  * cpu_cache_invalidate_memregion - drop any CPU cached data for
>>  *     memregions described by @res_des
>>  * @res_desc: one of the IORES_DESC_* types
>>  *
>>  * Perform cache maintenance after a memory event / operation that
>>  * changes the contents of physical memory in a cache-incoherent manner.
>>  * For example, memory-device secure erase, or provisioning new CXL
>>  * regions. This routine may or may not write back any dirty contents
>>  * while performing the invalidation.
>>  *
>>  * Returns 0 on success or negative error code on a failure to perform
>>  * the cache maintenance.
>>  */
>> int cpu_cache_invalidate_memregion(int res_desc)
>
>lgtm

Likewise, and I don't see anyone else objecting so I'll go ahead and send
a new iteration.

Thanks,
Davidlohr
diff mbox series

Patch

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f9920f1341c8..594e6b6a4925 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -81,6 +81,7 @@  config X86
 	select ARCH_HAS_KCOV			if X86_64
 	select ARCH_HAS_MEM_ENCRYPT
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
+	select ARCH_HAS_MEMREGION_INVALIDATE    if X86_64
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PMEM_API		if X86_64
 	select ARCH_HAS_PTE_DEVMAP		if X86_64
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 1abd5438f126..18463cb704fb 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -330,6 +330,20 @@  void arch_invalidate_pmem(void *addr, size_t size)
 EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
 #endif
 
+#ifdef CONFIG_ARCH_HAS_MEMREGION_INVALIDATE
+bool arch_has_flush_memregion(void)
+{
+	return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR);
+}
+EXPORT_SYMBOL(arch_has_flush_memregion);
+
+void arch_flush_memregion(phys_addr_t phys, resource_size_t size)
+{
+	wbinvd_on_all_cpus();
+}
+EXPORT_SYMBOL(arch_flush_memregion);
+#endif
+
 static void __cpa_flush_all(void *arg)
 {
 	unsigned long cache = (unsigned long)arg;
diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c
index 8dd792a55730..32e622f51cde 100644
--- a/drivers/acpi/nfit/intel.c
+++ b/drivers/acpi/nfit/intel.c
@@ -3,6 +3,7 @@ 
 #include <linux/libnvdimm.h>
 #include <linux/ndctl.h>
 #include <linux/acpi.h>
+#include <linux/memregion.h>
 #include <asm/smp.h>
 #include "intel.h"
 #include "nfit.h"
@@ -190,12 +191,11 @@  static int intel_security_change_key(struct nvdimm *nvdimm,
 	}
 }
 
-static void nvdimm_invalidate_cache(void);
-
 static int __maybe_unused intel_security_unlock(struct nvdimm *nvdimm,
 		const struct nvdimm_key_data *key_data)
 {
 	struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
+	struct acpi_nfit_system_address *spa = nfit_mem->spa_dcr;
 	struct {
 		struct nd_cmd_pkg pkg;
 		struct nd_intel_unlock_unit cmd;
@@ -213,6 +213,9 @@  static int __maybe_unused intel_security_unlock(struct nvdimm *nvdimm,
 	if (!test_bit(NVDIMM_INTEL_UNLOCK_UNIT, &nfit_mem->dsm_mask))
 		return -ENOTTY;
 
+	if (!arch_has_flush_memregion())
+		return -EINVAL;
+
 	memcpy(nd_cmd.cmd.passphrase, key_data->data,
 			sizeof(nd_cmd.cmd.passphrase));
 	rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_cmd), NULL);
@@ -228,7 +231,7 @@  static int __maybe_unused intel_security_unlock(struct nvdimm *nvdimm,
 	}
 
 	/* DIMM unlocked, invalidate all CPU caches before we read it */
-	nvdimm_invalidate_cache();
+	arch_flush_memregion(spa->address, spa->length);
 
 	return 0;
 }
@@ -279,6 +282,7 @@  static int __maybe_unused intel_security_erase(struct nvdimm *nvdimm,
 {
 	int rc;
 	struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
+	struct acpi_nfit_system_address *spa = nfit_mem->spa_dcr;
 	unsigned int cmd = ptype == NVDIMM_MASTER ?
 		NVDIMM_INTEL_MASTER_SECURE_ERASE : NVDIMM_INTEL_SECURE_ERASE;
 	struct {
@@ -297,8 +301,11 @@  static int __maybe_unused intel_security_erase(struct nvdimm *nvdimm,
 	if (!test_bit(cmd, &nfit_mem->dsm_mask))
 		return -ENOTTY;
 
+	if (!arch_has_flush_memregion())
+		return -EINVAL;
+
 	/* flush all cache before we erase DIMM */
-	nvdimm_invalidate_cache();
+	arch_flush_memregion(spa->address, spa->length);
 	memcpy(nd_cmd.cmd.passphrase, key->data,
 			sizeof(nd_cmd.cmd.passphrase));
 	rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_cmd), NULL);
@@ -318,7 +325,7 @@  static int __maybe_unused intel_security_erase(struct nvdimm *nvdimm,
 	}
 
 	/* DIMM erased, invalidate all CPU caches before we read it */
-	nvdimm_invalidate_cache();
+	arch_flush_memregion(spa->address, spa->length);
 	return 0;
 }
 
@@ -326,6 +333,7 @@  static int __maybe_unused intel_security_query_overwrite(struct nvdimm *nvdimm)
 {
 	int rc;
 	struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
+	struct acpi_nfit_system_address *spa = nfit_mem->spa_dcr;
 	struct {
 		struct nd_cmd_pkg pkg;
 		struct nd_intel_query_overwrite cmd;
@@ -341,6 +349,9 @@  static int __maybe_unused intel_security_query_overwrite(struct nvdimm *nvdimm)
 	if (!test_bit(NVDIMM_INTEL_QUERY_OVERWRITE, &nfit_mem->dsm_mask))
 		return -ENOTTY;
 
+	if (!arch_has_flush_memregion())
+		return -EINVAL;
+
 	rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_cmd), NULL);
 	if (rc < 0)
 		return rc;
@@ -355,7 +366,7 @@  static int __maybe_unused intel_security_query_overwrite(struct nvdimm *nvdimm)
 	}
 
 	/* flush all cache before we make the nvdimms available */
-	nvdimm_invalidate_cache();
+	arch_flush_memregion(spa->address, spa->length);
 	return 0;
 }
 
@@ -364,6 +375,7 @@  static int __maybe_unused intel_security_overwrite(struct nvdimm *nvdimm,
 {
 	int rc;
 	struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
+	struct acpi_nfit_system_address *spa = nfit_mem->spa_dcr;
 	struct {
 		struct nd_cmd_pkg pkg;
 		struct nd_intel_overwrite cmd;
@@ -380,8 +392,11 @@  static int __maybe_unused intel_security_overwrite(struct nvdimm *nvdimm,
 	if (!test_bit(NVDIMM_INTEL_OVERWRITE, &nfit_mem->dsm_mask))
 		return -ENOTTY;
 
+	if (!arch_has_flush_memregion())
+		return -EINVAL;
+
 	/* flush all cache before we erase DIMM */
-	nvdimm_invalidate_cache();
+	arch_flush_memregion(spa->address, spa->length);
 	memcpy(nd_cmd.cmd.passphrase, nkey->data,
 			sizeof(nd_cmd.cmd.passphrase));
 	rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_cmd), NULL);
@@ -401,22 +416,6 @@  static int __maybe_unused intel_security_overwrite(struct nvdimm *nvdimm,
 	}
 }
 
-/*
- * TODO: define a cross arch wbinvd equivalent when/if
- * NVDIMM_FAMILY_INTEL command support arrives on another arch.
- */
-#ifdef CONFIG_X86
-static void nvdimm_invalidate_cache(void)
-{
-	wbinvd_on_all_cpus();
-}
-#else
-static void nvdimm_invalidate_cache(void)
-{
-	WARN_ON_ONCE("cache invalidation required after unlock\n");
-}
-#endif
-
 static const struct nvdimm_security_ops __intel_security_ops = {
 	.get_flags = intel_security_flags,
 	.freeze = intel_security_freeze,
diff --git a/include/linux/memregion.h b/include/linux/memregion.h
index c04c4fd2e209..c35201c0696f 100644
--- a/include/linux/memregion.h
+++ b/include/linux/memregion.h
@@ -20,4 +20,29 @@  static inline void memregion_free(int id)
 {
 }
 #endif
+
+/*
+ * Device memory technologies like NVDIMM and CXL have events like
+ * secure erase and dynamic region provision that can invalidate an
+ * entire physical memory address range at once. Limit that
+ * functionality to architectures that have an efficient way to
+ * writeback and invalidate potentially terabytes of memory at once.
+ *
+ * To ensure this, users must first call arch_has_flush_memregion()
+ * before anything, to verify the operation is feasible.
+ */
+#ifdef CONFIG_ARCH_HAS_MEMREGION_INVALIDATE
+void arch_flush_memregion(phys_addr_t phys, resource_size_t size);
+bool arch_has_flush_memregion(void);
+#else
+static inline bool arch_has_flush_memregion(void)
+{
+       return false;
+}
+static inline void arch_flush_memregion(phys_addr_t phys, resource_size_t size)
+{
+	WARN_ON_ONCE("cache invalidation required");
+}
+#endif
+
 #endif /* _MEMREGION_H_ */
diff --git a/lib/Kconfig b/lib/Kconfig
index dc1ab2ed1dc6..8319e7731e7b 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -662,6 +662,9 @@  config ARCH_HAS_PMEM_API
 config MEMREGION
 	bool
 
+config ARCH_HAS_MEMREGION_INVALIDATE
+       bool
+
 config ARCH_HAS_MEMREMAP_COMPAT_ALIGN
 	bool