Message ID | 20220829212918.4039240-1-dave@stgolabs.net |
---|---|
State | New, archived |
Headers | show |
Series | [-next] memregion: Add arch_flush_memregion() interface | expand |
[ add Christoph ] Davidlohr Bueso wrote: > With CXL security features, global CPU cache flushing nvdimm requirements > are no longer specific to that subsystem, even beyond the scope of > security_ops. CXL will need such semantics for features not necessarily > limited to persistent memory. > > The functionality this is enabling is to be able to instantaneously > secure erase potentially terabytes of memory at once and the kernel > needs to be sure that none of the data from before the secure is still > present in the cache. It is also used when unlocking a memory device > where speculative reads and firmware accesses could have cached poison > from before the device was unlocked. > > This capability is typically only used once per-boot (for unlock), or > once per bare metal provisioning event (secure erase), like when handing > off the system to another tenant or decommissioning a device. > > Users must first call arch_has_flush_memregion() to know whether this > functionality is available on the architecture. Only enable it on x86-64 > via the wbinvd() hammer. > > Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> > --- > > Changes from v2 (https://lore.kernel.org/all/20220819171024.1766857-1-dave@stgolabs.net/): > - Redid to use memregion based interfaces + VMM check on x86 (Dan) > - Restricted the flushing to x86-64. > > Note: Since we still are dealing with a physical "range" at this level, > added the spa range for nfit even though this is unused. Looks reasonable to me. Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Not sure the proper way to route this (akpm?). But unless any remaining objections, could this be picked up? Thanks, Davidlohr
On Mon, Aug 29, 2022 at 02:29:18PM -0700, Davidlohr Bueso wrote: > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c > index 1abd5438f126..18463cb704fb 100644 > --- a/arch/x86/mm/pat/set_memory.c > +++ b/arch/x86/mm/pat/set_memory.c > @@ -330,6 +330,20 @@ void arch_invalidate_pmem(void *addr, size_t size) > EXPORT_SYMBOL_GPL(arch_invalidate_pmem); > #endif > > +#ifdef CONFIG_ARCH_HAS_MEMREGION_INVALIDATE > +bool arch_has_flush_memregion(void) > +{ > + return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR); This looks really weird. Why does this need to care about HV at all? Does that nfit stuff even run in guests? > +EXPORT_SYMBOL(arch_has_flush_memregion); ... > +EXPORT_SYMBOL(arch_flush_memregion); Why aren't those exports _GPL? Thx.
On Wed, 07 Sep 2022, Borislav Petkov wrote: >On Mon, Aug 29, 2022 at 02:29:18PM -0700, Davidlohr Bueso wrote: >> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c >> index 1abd5438f126..18463cb704fb 100644 >> --- a/arch/x86/mm/pat/set_memory.c >> +++ b/arch/x86/mm/pat/set_memory.c >> @@ -330,6 +330,20 @@ void arch_invalidate_pmem(void *addr, size_t size) >> EXPORT_SYMBOL_GPL(arch_invalidate_pmem); >> #endif >> >> +#ifdef CONFIG_ARCH_HAS_MEMREGION_INVALIDATE >> +bool arch_has_flush_memregion(void) >> +{ >> + return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR); > >This looks really weird. Why does this need to care about HV at all? So the context here is: e2efb6359e62 ("ACPICA: Avoid cache flush inside virtual machines") > >Does that nfit stuff even run in guests? No, nor does cxl. This was mostly in general a precautionary check such that the api is unavailable in VMs. > >> +EXPORT_SYMBOL(arch_has_flush_memregion); > >... > >> +EXPORT_SYMBOL(arch_flush_memregion); > >Why aren't those exports _GPL? Fine by me. Thanks, Davidlohr
Davidlohr Bueso wrote: > Not sure the proper way to route this (akpm?). But unless any remaining > objections, could this be picked up? My plan was, barring objections, to take it through the CXL tree with its first user, the CXL security commands.
Davidlohr Bueso wrote: > On Wed, 07 Sep 2022, Borislav Petkov wrote: > > >On Mon, Aug 29, 2022 at 02:29:18PM -0700, Davidlohr Bueso wrote: > >> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c > >> index 1abd5438f126..18463cb704fb 100644 > >> --- a/arch/x86/mm/pat/set_memory.c > >> +++ b/arch/x86/mm/pat/set_memory.c > >> @@ -330,6 +330,20 @@ void arch_invalidate_pmem(void *addr, size_t size) > >> EXPORT_SYMBOL_GPL(arch_invalidate_pmem); > >> #endif > >> > >> +#ifdef CONFIG_ARCH_HAS_MEMREGION_INVALIDATE > >> +bool arch_has_flush_memregion(void) > >> +{ > >> + return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR); > > > >This looks really weird. Why does this need to care about HV at all? > > So the context here is: > > e2efb6359e62 ("ACPICA: Avoid cache flush inside virtual machines") > > > > >Does that nfit stuff even run in guests? > > No, nor does cxl. This was mostly in general a precautionary check such > that the api is unavailable in VMs. To be clear nfit stuff and CXL does run in guests, but they do not support secure-erase in a guest. However, the QEMU CXL enabling is building the ability to do *guest physical* address space management, but in that case the driver can be paravirtualized to realize that it is not managing host-physical address space and does not need to flush caches. That will need some indicator to differentiate virtual CXL memory expanders from assigned devices. Is there such a thing as a PCIe-virtio extended capability to differentiate physical vs emulated devices?
On Wed, 07 Sep 2022, Dan Williams wrote: >Davidlohr Bueso wrote: >> On Wed, 07 Sep 2022, Borislav Petkov wrote: >> >> >On Mon, Aug 29, 2022 at 02:29:18PM -0700, Davidlohr Bueso wrote: >> >> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c >> >> index 1abd5438f126..18463cb704fb 100644 >> >> --- a/arch/x86/mm/pat/set_memory.c >> >> +++ b/arch/x86/mm/pat/set_memory.c >> >> @@ -330,6 +330,20 @@ void arch_invalidate_pmem(void *addr, size_t size) >> >> EXPORT_SYMBOL_GPL(arch_invalidate_pmem); >> >> #endif >> >> >> >> +#ifdef CONFIG_ARCH_HAS_MEMREGION_INVALIDATE >> >> +bool arch_has_flush_memregion(void) >> >> +{ >> >> + return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR); >> > >> >This looks really weird. Why does this need to care about HV at all? >> >> So the context here is: >> >> e2efb6359e62 ("ACPICA: Avoid cache flush inside virtual machines") >> >> > >> >Does that nfit stuff even run in guests? >> >> No, nor does cxl. This was mostly in general a precautionary check such >> that the api is unavailable in VMs. > >To be clear nfit stuff and CXL does run in guests, but they do not >support secure-erase in a guest. Yes, I meant the feats this api enables. >However, the QEMU CXL enabling is building the ability to do *guest >physical* address space management, but in that case the driver can be >paravirtualized to realize that it is not managing host-physical address >space and does not need to flush caches. That will need some indicator >to differentiate virtual CXL memory expanders from assigned devices. Is >there such a thing as a PCIe-virtio extended capability to differentiate >physical vs emulated devices? In any case such check would be specific to each user (cxl in this case), and outside the scope of _this_ particular api. Here we just really want to avoid the broken TDX guest bits. Thanks, Davidlohr
I really dislike the term "flush". Sometimes it means writeback, sometimes it means invalidate. Perhaps at other times it means both. Can we please be very clear in comments and changelogs about exactly what this "flush" does. With bonus points for being more specific in the function naming?
Andrew Morton wrote: > I really dislike the term "flush". Sometimes it means writeback, > sometimes it means invalidate. Perhaps at other times it means > both. > > Can we please be very clear in comments and changelogs about exactly > what this "flush" does. With bonus points for being more specific in the > function naming? > That's a good point, "flush" has been cargo-culted along in Linux's cache management APIs to mean write-back-and-invalidate. In this case I think this API is purely about invalidate. It just so happens that x86 has not historically had a global invalidate instruction readily available which leads to the overuse of wbinvd. It would be nice to make clear that this API is purely about invalidating any data cached for a physical address impacted by address space management event (secure erase / new region provision). Write-back is an unnecessary side-effect. So how about: s/arch_flush_memregion/cpu_cache_invalidate_memregion/?
On Wed, Sep 07, 2022 at 09:22:45AM -0700, Davidlohr Bueso wrote: > So the context here is: > > e2efb6359e62 ("ACPICA: Avoid cache flush inside virtual machines") Please add this to your commit message so that at least we know how this HV dependency came about. Thx.
On Wed, Sep 07, 2022 at 09:52:17AM -0700, Dan Williams wrote: > To be clear nfit stuff and CXL does run in guests, but they do not > support secure-erase in a guest. > > However, the QEMU CXL enabling is building the ability to do *guest > physical* address space management, but in that case the driver can be > paravirtualized to realize that it is not managing host-physical address > space and does not need to flush caches. That will need some indicator > to differentiate virtual CXL memory expanders from assigned devices. Sounds to me like that check should be improved later to ask whether the kernel is managing host-physical address space, maybe arch_flush_memregion() should check whether the address it is supposed to flush is host-physical and exit early if not...
Borislav Petkov wrote: > On Wed, Sep 07, 2022 at 09:52:17AM -0700, Dan Williams wrote: > > To be clear nfit stuff and CXL does run in guests, but they do not > > support secure-erase in a guest. > > > > However, the QEMU CXL enabling is building the ability to do *guest > > physical* address space management, but in that case the driver can be > > paravirtualized to realize that it is not managing host-physical address > > space and does not need to flush caches. That will need some indicator > > to differentiate virtual CXL memory expanders from assigned devices. > > Sounds to me like that check should be improved later to ask > whether the kernel is managing host-physical address space, maybe > arch_flush_memregion() should check whether the address it is supposed > to flush is host-physical and exit early if not... Even though I raised the possibility of guest passthrough of a CXL memory expander, I do not think it could work in practice without it being a gigantic security nightmare. So it is probably safe to just do the hypervisor check and assume that there's no such thing as guest management of host physical address space.
On Wed, 7 Sep 2022 23:53:31 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > Borislav Petkov wrote: > > On Wed, Sep 07, 2022 at 09:52:17AM -0700, Dan Williams wrote: > > > To be clear nfit stuff and CXL does run in guests, but they do not > > > support secure-erase in a guest. > > > > > > However, the QEMU CXL enabling is building the ability to do *guest > > > physical* address space management, but in that case the driver can be > > > paravirtualized to realize that it is not managing host-physical address > > > space and does not need to flush caches. That will need some indicator > > > to differentiate virtual CXL memory expanders from assigned devices. > > > > Sounds to me like that check should be improved later to ask > > whether the kernel is managing host-physical address space, maybe > > arch_flush_memregion() should check whether the address it is supposed > > to flush is host-physical and exit early if not... > > Even though I raised the possibility of guest passthrough of a CXL > memory expander, I do not think it could work in practice without it > being a gigantic security nightmare. So it is probably safe to just do > the hypervisor check and assume that there's no such thing as guest > management of host physical address space. Agreed. Other than occasional convenience of doing driver development in a VM (they reboot quickly ;) can't see why a product system would need to pass a CXL type 3 device through and as you say security would be 'interesting' if it were done. Might be GPU usecases down the line but I'm doubtful on that as well. Jonathan
On Wed, 7 Sep 2022 18:07:31 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > Andrew Morton wrote: > > I really dislike the term "flush". Sometimes it means writeback, > > sometimes it means invalidate. Perhaps at other times it means > > both. > > > > Can we please be very clear in comments and changelogs about exactly > > what this "flush" does. With bonus points for being more specific in the > > function naming? > > > > That's a good point, "flush" has been cargo-culted along in Linux's > cache management APIs to mean write-back-and-invalidate. In this case I > think this API is purely about invalidate. It just so happens that x86 > has not historically had a global invalidate instruction readily > available which leads to the overuse of wbinvd. > > It would be nice to make clear that this API is purely about > invalidating any data cached for a physical address impacted by address > space management event (secure erase / new region provision). Write-back > is an unnecessary side-effect. > > So how about: > > s/arch_flush_memregion/cpu_cache_invalidate_memregion/? Want to indicate it 'might' write back perhaps? So could be invalidate or clean and invalidate (using arm ARM terms just to add to the confusion ;) Feels like there will be potential race conditions where that matters as we might force stale data to be written back. Perhaps a comment is enough for that. Anyone have the "famous last words" feeling? Jonathan
Jonathan Cameron wrote: > On Wed, 7 Sep 2022 18:07:31 -0700 > Dan Williams <dan.j.williams@intel.com> wrote: > > > Andrew Morton wrote: > > > I really dislike the term "flush". Sometimes it means writeback, > > > sometimes it means invalidate. Perhaps at other times it means > > > both. > > > > > > Can we please be very clear in comments and changelogs about exactly > > > what this "flush" does. With bonus points for being more specific in the > > > function naming? > > > > > > > That's a good point, "flush" has been cargo-culted along in Linux's > > cache management APIs to mean write-back-and-invalidate. In this case I > > think this API is purely about invalidate. It just so happens that x86 > > has not historically had a global invalidate instruction readily > > available which leads to the overuse of wbinvd. > > > > It would be nice to make clear that this API is purely about > > invalidating any data cached for a physical address impacted by address > > space management event (secure erase / new region provision). Write-back > > is an unnecessary side-effect. > > > > So how about: > > > > s/arch_flush_memregion/cpu_cache_invalidate_memregion/? > > Want to indicate it 'might' write back perhaps? > So could be invalidate or clean and invalidate (using arm ARM terms just to add > to the confusion ;) > > Feels like there will be potential race conditions where that matters as we might > force stale data to be written back. > > Perhaps a comment is enough for that. Anyone have the "famous last words" feeling? Is "invalidate" not clear that write-back is optional? Maybe not. Also, I realized that we tried to include the address range to allow for the possibility of flushing by virtual address range, but that overcomplicates the use. I.e. if someone issue secure erase and the region association is not established does that mean that mean that the cache invalidation is not needed? It could be the case that someone disables a device, does the secure erase, and then reattaches to the same region. The cache invalidation is needed, but at the time of the secure erase the HPA was unknown. All this to say that I feel the bikeshedding will need to continue until morale improves. I notice that the DMA API uses 'sync' to indicate, "make this memory consistent/coherent for the CPU or the device", so how about an API like memregion_sync_for_cpu(int res_desc) ...where the @res_desc would be IORES_DESC_CXL for all CXL and IORES_DESC_PERSISTENT_MEMORY for the current nvdimm use case.
On Thu, 8 Sep 2022 15:51:50 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > Jonathan Cameron wrote: > > On Wed, 7 Sep 2022 18:07:31 -0700 > > Dan Williams <dan.j.williams@intel.com> wrote: > > > > > Andrew Morton wrote: > > > > I really dislike the term "flush". Sometimes it means writeback, > > > > sometimes it means invalidate. Perhaps at other times it means > > > > both. > > > > > > > > Can we please be very clear in comments and changelogs about exactly > > > > what this "flush" does. With bonus points for being more specific in the > > > > function naming? > > > > > > > > > > That's a good point, "flush" has been cargo-culted along in Linux's > > > cache management APIs to mean write-back-and-invalidate. In this case I > > > think this API is purely about invalidate. It just so happens that x86 > > > has not historically had a global invalidate instruction readily > > > available which leads to the overuse of wbinvd. > > > > > > It would be nice to make clear that this API is purely about > > > invalidating any data cached for a physical address impacted by address > > > space management event (secure erase / new region provision). Write-back > > > is an unnecessary side-effect. > > > > > > So how about: > > > > > > s/arch_flush_memregion/cpu_cache_invalidate_memregion/? > > > > Want to indicate it 'might' write back perhaps? > > So could be invalidate or clean and invalidate (using arm ARM terms just to add > > to the confusion ;) > > > > Feels like there will be potential race conditions where that matters as we might > > force stale data to be written back. > > > > Perhaps a comment is enough for that. Anyone have the "famous last words" feeling? > > Is "invalidate" not clear that write-back is optional? Maybe not. Yes, I'd say that "invalidate" means "dirty stuff may of may not have been written back". Ditto for invalidate_inode_pages2(). > Also, I realized that we tried to include the address range to allow for > the possibility of flushing by virtual address range, but that > overcomplicates the use. I.e. if someone issue secure erase and the > region association is not established does that mean that mean that the > cache invalidation is not needed? It could be the case that someone > disables a device, does the secure erase, and then reattaches to the > same region. The cache invalidation is needed, but at the time of the > secure erase the HPA was unknown. > > All this to say that I feel the bikeshedding will need to continue until > morale improves. > > I notice that the DMA API uses 'sync' to indicate, "make this memory > consistent/coherent for the CPU or the device", so how about an API like > > memregion_sync_for_cpu(int res_desc) > > ...where the @res_desc would be IORES_DESC_CXL for all CXL and > IORES_DESC_PERSISTENT_MEMORY for the current nvdimm use case. "sync" is another of my pet peeves ;) In filesystem land, at least. Does it mean "start writeback and return" or does it mean "start writeback, wait for its completion then return".
Andrew Morton wrote: > On Thu, 8 Sep 2022 15:51:50 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > > > Jonathan Cameron wrote: > > > On Wed, 7 Sep 2022 18:07:31 -0700 > > > Dan Williams <dan.j.williams@intel.com> wrote: > > > > > > > Andrew Morton wrote: > > > > > I really dislike the term "flush". Sometimes it means writeback, > > > > > sometimes it means invalidate. Perhaps at other times it means > > > > > both. > > > > > > > > > > Can we please be very clear in comments and changelogs about exactly > > > > > what this "flush" does. With bonus points for being more specific in the > > > > > function naming? > > > > > > > > > > > > > That's a good point, "flush" has been cargo-culted along in Linux's > > > > cache management APIs to mean write-back-and-invalidate. In this case I > > > > think this API is purely about invalidate. It just so happens that x86 > > > > has not historically had a global invalidate instruction readily > > > > available which leads to the overuse of wbinvd. > > > > > > > > It would be nice to make clear that this API is purely about > > > > invalidating any data cached for a physical address impacted by address > > > > space management event (secure erase / new region provision). Write-back > > > > is an unnecessary side-effect. > > > > > > > > So how about: > > > > > > > > s/arch_flush_memregion/cpu_cache_invalidate_memregion/? > > > > > > Want to indicate it 'might' write back perhaps? > > > So could be invalidate or clean and invalidate (using arm ARM terms just to add > > > to the confusion ;) > > > > > > Feels like there will be potential race conditions where that matters as we might > > > force stale data to be written back. > > > > > > Perhaps a comment is enough for that. Anyone have the "famous last words" feeling? > > > > Is "invalidate" not clear that write-back is optional? Maybe not. > > Yes, I'd say that "invalidate" means "dirty stuff may of may not have > been written back". Ditto for invalidate_inode_pages2(). > > > Also, I realized that we tried to include the address range to allow for > > the possibility of flushing by virtual address range, but that > > overcomplicates the use. I.e. if someone issue secure erase and the > > region association is not established does that mean that mean that the > > cache invalidation is not needed? It could be the case that someone > > disables a device, does the secure erase, and then reattaches to the > > same region. The cache invalidation is needed, but at the time of the > > secure erase the HPA was unknown. > > > > All this to say that I feel the bikeshedding will need to continue until > > morale improves. > > > > I notice that the DMA API uses 'sync' to indicate, "make this memory > > consistent/coherent for the CPU or the device", so how about an API like > > > > memregion_sync_for_cpu(int res_desc) > > > > ...where the @res_desc would be IORES_DESC_CXL for all CXL and > > IORES_DESC_PERSISTENT_MEMORY for the current nvdimm use case. > > "sync" is another of my pet peeves ;) In filesystem land, at least. > Does it mean "start writeback and return" or does it mean "start > writeback, wait for its completion then return". Ok, no "sync" :). /** * cpu_cache_invalidate_memregion - drop any CPU cached data for * memregions described by @res_des * @res_desc: one of the IORES_DESC_* types * * Perform cache maintenance after a memory event / operation that * changes the contents of physical memory in a cache-incoherent manner. * For example, memory-device secure erase, or provisioning new CXL * regions. This routine may or may not write back any dirty contents * while performing the invalidation. * * Returns 0 on success or negative error code on a failure to perform * the cache maintenance. */ int cpu_cache_invalidate_memregion(int res_desc) ??
On Thu, 8 Sep 2022 16:22:26 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > Andrew Morton wrote: > > On Thu, 8 Sep 2022 15:51:50 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > > > > > Jonathan Cameron wrote: > > > > On Wed, 7 Sep 2022 18:07:31 -0700 > > > > Dan Williams <dan.j.williams@intel.com> wrote: > > > > > > > > > Andrew Morton wrote: > > > > > > I really dislike the term "flush". Sometimes it means writeback, > > > > > > sometimes it means invalidate. Perhaps at other times it means > > > > > > both. > > > > > > > > > > > > Can we please be very clear in comments and changelogs about exactly > > > > > > what this "flush" does. With bonus points for being more specific in the > > > > > > function naming? > > > > > > > > > > > > > > > > That's a good point, "flush" has been cargo-culted along in Linux's > > > > > cache management APIs to mean write-back-and-invalidate. In this case I > > > > > think this API is purely about invalidate. It just so happens that x86 > > > > > has not historically had a global invalidate instruction readily > > > > > available which leads to the overuse of wbinvd. > > > > > > > > > > It would be nice to make clear that this API is purely about > > > > > invalidating any data cached for a physical address impacted by address > > > > > space management event (secure erase / new region provision). Write-back > > > > > is an unnecessary side-effect. > > > > > > > > > > So how about: > > > > > > > > > > s/arch_flush_memregion/cpu_cache_invalidate_memregion/? > > > > > > > > Want to indicate it 'might' write back perhaps? > > > > So could be invalidate or clean and invalidate (using arm ARM terms just to add > > > > to the confusion ;) > > > > > > > > Feels like there will be potential race conditions where that matters as we might > > > > force stale data to be written back. > > > > > > > > Perhaps a comment is enough for that. Anyone have the "famous last words" feeling? > > > > > > Is "invalidate" not clear that write-back is optional? Maybe not. > > > > Yes, I'd say that "invalidate" means "dirty stuff may of may not have > > been written back". Ditto for invalidate_inode_pages2(). > > > > > Also, I realized that we tried to include the address range to allow for > > > the possibility of flushing by virtual address range, but that > > > overcomplicates the use. I.e. if someone issue secure erase and the > > > region association is not established does that mean that mean that the > > > cache invalidation is not needed? It could be the case that someone > > > disables a device, does the secure erase, and then reattaches to the > > > same region. The cache invalidation is needed, but at the time of the > > > secure erase the HPA was unknown. > > > > > > All this to say that I feel the bikeshedding will need to continue until > > > morale improves. > > > > > > I notice that the DMA API uses 'sync' to indicate, "make this memory > > > consistent/coherent for the CPU or the device", so how about an API like > > > > > > memregion_sync_for_cpu(int res_desc) > > > > > > ...where the @res_desc would be IORES_DESC_CXL for all CXL and > > > IORES_DESC_PERSISTENT_MEMORY for the current nvdimm use case. > > > > "sync" is another of my pet peeves ;) In filesystem land, at least. > > Does it mean "start writeback and return" or does it mean "start > > writeback, wait for its completion then return". > > Ok, no "sync" :). > > /** > * cpu_cache_invalidate_memregion - drop any CPU cached data for > * memregions described by @res_des > * @res_desc: one of the IORES_DESC_* types > * > * Perform cache maintenance after a memory event / operation that > * changes the contents of physical memory in a cache-incoherent manner. > * For example, memory-device secure erase, or provisioning new CXL > * regions. This routine may or may not write back any dirty contents > * while performing the invalidation. > * > * Returns 0 on success or negative error code on a failure to perform > * the cache maintenance. > */ > int cpu_cache_invalidate_memregion(int res_desc) lgtm > > ??
On Fri, 09 Sep 2022, Jonathan Cameron wrote: >On Thu, 8 Sep 2022 16:22:26 -0700 >Dan Williams <dan.j.williams@intel.com> wrote: > >> Andrew Morton wrote: >> > On Thu, 8 Sep 2022 15:51:50 -0700 Dan Williams <dan.j.williams@intel.com> wrote: >> > >> > > Jonathan Cameron wrote: >> > > > On Wed, 7 Sep 2022 18:07:31 -0700 >> > > > Dan Williams <dan.j.williams@intel.com> wrote: >> > > > >> > > > > Andrew Morton wrote: >> > > > > > I really dislike the term "flush". Sometimes it means writeback, >> > > > > > sometimes it means invalidate. Perhaps at other times it means >> > > > > > both. >> > > > > > >> > > > > > Can we please be very clear in comments and changelogs about exactly >> > > > > > what this "flush" does. With bonus points for being more specific in the >> > > > > > function naming? >> > > > > > >> > > > > >> > > > > That's a good point, "flush" has been cargo-culted along in Linux's >> > > > > cache management APIs to mean write-back-and-invalidate. In this case I >> > > > > think this API is purely about invalidate. It just so happens that x86 >> > > > > has not historically had a global invalidate instruction readily >> > > > > available which leads to the overuse of wbinvd. >> > > > > >> > > > > It would be nice to make clear that this API is purely about >> > > > > invalidating any data cached for a physical address impacted by address >> > > > > space management event (secure erase / new region provision). Write-back >> > > > > is an unnecessary side-effect. >> > > > > >> > > > > So how about: >> > > > > >> > > > > s/arch_flush_memregion/cpu_cache_invalidate_memregion/? >> > > > >> > > > Want to indicate it 'might' write back perhaps? >> > > > So could be invalidate or clean and invalidate (using arm ARM terms just to add >> > > > to the confusion ;) >> > > > >> > > > Feels like there will be potential race conditions where that matters as we might >> > > > force stale data to be written back. >> > > > >> > > > Perhaps a comment is enough for that. Anyone have the "famous last words" feeling? >> > > >> > > Is "invalidate" not clear that write-back is optional? Maybe not. >> > >> > Yes, I'd say that "invalidate" means "dirty stuff may of may not have >> > been written back". Ditto for invalidate_inode_pages2(). >> > >> > > Also, I realized that we tried to include the address range to allow for >> > > the possibility of flushing by virtual address range, but that >> > > overcomplicates the use. I.e. if someone issue secure erase and the >> > > region association is not established does that mean that mean that the >> > > cache invalidation is not needed? It could be the case that someone >> > > disables a device, does the secure erase, and then reattaches to the >> > > same region. The cache invalidation is needed, but at the time of the >> > > secure erase the HPA was unknown. >> > > >> > > All this to say that I feel the bikeshedding will need to continue until >> > > morale improves. >> > > >> > > I notice that the DMA API uses 'sync' to indicate, "make this memory >> > > consistent/coherent for the CPU or the device", so how about an API like >> > > >> > > memregion_sync_for_cpu(int res_desc) >> > > >> > > ...where the @res_desc would be IORES_DESC_CXL for all CXL and >> > > IORES_DESC_PERSISTENT_MEMORY for the current nvdimm use case. >> > >> > "sync" is another of my pet peeves ;) In filesystem land, at least. >> > Does it mean "start writeback and return" or does it mean "start >> > writeback, wait for its completion then return". >> >> Ok, no "sync" :). >> >> /** >> * cpu_cache_invalidate_memregion - drop any CPU cached data for >> * memregions described by @res_des >> * @res_desc: one of the IORES_DESC_* types >> * >> * Perform cache maintenance after a memory event / operation that >> * changes the contents of physical memory in a cache-incoherent manner. >> * For example, memory-device secure erase, or provisioning new CXL >> * regions. This routine may or may not write back any dirty contents >> * while performing the invalidation. >> * >> * Returns 0 on success or negative error code on a failure to perform >> * the cache maintenance. >> */ >> int cpu_cache_invalidate_memregion(int res_desc) > >lgtm Likewise, and I don't see anyone else objecting so I'll go ahead and send a new iteration. Thanks, Davidlohr
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index f9920f1341c8..594e6b6a4925 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -81,6 +81,7 @@ config X86 select ARCH_HAS_KCOV if X86_64 select ARCH_HAS_MEM_ENCRYPT select ARCH_HAS_MEMBARRIER_SYNC_CORE + select ARCH_HAS_MEMREGION_INVALIDATE if X86_64 select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE select ARCH_HAS_PMEM_API if X86_64 select ARCH_HAS_PTE_DEVMAP if X86_64 diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 1abd5438f126..18463cb704fb 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -330,6 +330,20 @@ void arch_invalidate_pmem(void *addr, size_t size) EXPORT_SYMBOL_GPL(arch_invalidate_pmem); #endif +#ifdef CONFIG_ARCH_HAS_MEMREGION_INVALIDATE +bool arch_has_flush_memregion(void) +{ + return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR); +} +EXPORT_SYMBOL(arch_has_flush_memregion); + +void arch_flush_memregion(phys_addr_t phys, resource_size_t size) +{ + wbinvd_on_all_cpus(); +} +EXPORT_SYMBOL(arch_flush_memregion); +#endif + static void __cpa_flush_all(void *arg) { unsigned long cache = (unsigned long)arg; diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c index 8dd792a55730..32e622f51cde 100644 --- a/drivers/acpi/nfit/intel.c +++ b/drivers/acpi/nfit/intel.c @@ -3,6 +3,7 @@ #include <linux/libnvdimm.h> #include <linux/ndctl.h> #include <linux/acpi.h> +#include <linux/memregion.h> #include <asm/smp.h> #include "intel.h" #include "nfit.h" @@ -190,12 +191,11 @@ static int intel_security_change_key(struct nvdimm *nvdimm, } } -static void nvdimm_invalidate_cache(void); - static int __maybe_unused intel_security_unlock(struct nvdimm *nvdimm, const struct nvdimm_key_data *key_data) { struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm); + struct acpi_nfit_system_address *spa = nfit_mem->spa_dcr; struct { struct nd_cmd_pkg pkg; struct nd_intel_unlock_unit cmd; @@ -213,6 +213,9 @@ static int __maybe_unused intel_security_unlock(struct nvdimm *nvdimm, if (!test_bit(NVDIMM_INTEL_UNLOCK_UNIT, &nfit_mem->dsm_mask)) return -ENOTTY; + if (!arch_has_flush_memregion()) + return -EINVAL; + memcpy(nd_cmd.cmd.passphrase, key_data->data, sizeof(nd_cmd.cmd.passphrase)); rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_cmd), NULL); @@ -228,7 +231,7 @@ static int __maybe_unused intel_security_unlock(struct nvdimm *nvdimm, } /* DIMM unlocked, invalidate all CPU caches before we read it */ - nvdimm_invalidate_cache(); + arch_flush_memregion(spa->address, spa->length); return 0; } @@ -279,6 +282,7 @@ static int __maybe_unused intel_security_erase(struct nvdimm *nvdimm, { int rc; struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm); + struct acpi_nfit_system_address *spa = nfit_mem->spa_dcr; unsigned int cmd = ptype == NVDIMM_MASTER ? NVDIMM_INTEL_MASTER_SECURE_ERASE : NVDIMM_INTEL_SECURE_ERASE; struct { @@ -297,8 +301,11 @@ static int __maybe_unused intel_security_erase(struct nvdimm *nvdimm, if (!test_bit(cmd, &nfit_mem->dsm_mask)) return -ENOTTY; + if (!arch_has_flush_memregion()) + return -EINVAL; + /* flush all cache before we erase DIMM */ - nvdimm_invalidate_cache(); + arch_flush_memregion(spa->address, spa->length); memcpy(nd_cmd.cmd.passphrase, key->data, sizeof(nd_cmd.cmd.passphrase)); rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_cmd), NULL); @@ -318,7 +325,7 @@ static int __maybe_unused intel_security_erase(struct nvdimm *nvdimm, } /* DIMM erased, invalidate all CPU caches before we read it */ - nvdimm_invalidate_cache(); + arch_flush_memregion(spa->address, spa->length); return 0; } @@ -326,6 +333,7 @@ static int __maybe_unused intel_security_query_overwrite(struct nvdimm *nvdimm) { int rc; struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm); + struct acpi_nfit_system_address *spa = nfit_mem->spa_dcr; struct { struct nd_cmd_pkg pkg; struct nd_intel_query_overwrite cmd; @@ -341,6 +349,9 @@ static int __maybe_unused intel_security_query_overwrite(struct nvdimm *nvdimm) if (!test_bit(NVDIMM_INTEL_QUERY_OVERWRITE, &nfit_mem->dsm_mask)) return -ENOTTY; + if (!arch_has_flush_memregion()) + return -EINVAL; + rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_cmd), NULL); if (rc < 0) return rc; @@ -355,7 +366,7 @@ static int __maybe_unused intel_security_query_overwrite(struct nvdimm *nvdimm) } /* flush all cache before we make the nvdimms available */ - nvdimm_invalidate_cache(); + arch_flush_memregion(spa->address, spa->length); return 0; } @@ -364,6 +375,7 @@ static int __maybe_unused intel_security_overwrite(struct nvdimm *nvdimm, { int rc; struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm); + struct acpi_nfit_system_address *spa = nfit_mem->spa_dcr; struct { struct nd_cmd_pkg pkg; struct nd_intel_overwrite cmd; @@ -380,8 +392,11 @@ static int __maybe_unused intel_security_overwrite(struct nvdimm *nvdimm, if (!test_bit(NVDIMM_INTEL_OVERWRITE, &nfit_mem->dsm_mask)) return -ENOTTY; + if (!arch_has_flush_memregion()) + return -EINVAL; + /* flush all cache before we erase DIMM */ - nvdimm_invalidate_cache(); + arch_flush_memregion(spa->address, spa->length); memcpy(nd_cmd.cmd.passphrase, nkey->data, sizeof(nd_cmd.cmd.passphrase)); rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_cmd), NULL); @@ -401,22 +416,6 @@ static int __maybe_unused intel_security_overwrite(struct nvdimm *nvdimm, } } -/* - * TODO: define a cross arch wbinvd equivalent when/if - * NVDIMM_FAMILY_INTEL command support arrives on another arch. - */ -#ifdef CONFIG_X86 -static void nvdimm_invalidate_cache(void) -{ - wbinvd_on_all_cpus(); -} -#else -static void nvdimm_invalidate_cache(void) -{ - WARN_ON_ONCE("cache invalidation required after unlock\n"); -} -#endif - static const struct nvdimm_security_ops __intel_security_ops = { .get_flags = intel_security_flags, .freeze = intel_security_freeze, diff --git a/include/linux/memregion.h b/include/linux/memregion.h index c04c4fd2e209..c35201c0696f 100644 --- a/include/linux/memregion.h +++ b/include/linux/memregion.h @@ -20,4 +20,29 @@ static inline void memregion_free(int id) { } #endif + +/* + * Device memory technologies like NVDIMM and CXL have events like + * secure erase and dynamic region provision that can invalidate an + * entire physical memory address range at once. Limit that + * functionality to architectures that have an efficient way to + * writeback and invalidate potentially terabytes of memory at once. + * + * To ensure this, users must first call arch_has_flush_memregion() + * before anything, to verify the operation is feasible. + */ +#ifdef CONFIG_ARCH_HAS_MEMREGION_INVALIDATE +void arch_flush_memregion(phys_addr_t phys, resource_size_t size); +bool arch_has_flush_memregion(void); +#else +static inline bool arch_has_flush_memregion(void) +{ + return false; +} +static inline void arch_flush_memregion(phys_addr_t phys, resource_size_t size) +{ + WARN_ON_ONCE("cache invalidation required"); +} +#endif + #endif /* _MEMREGION_H_ */ diff --git a/lib/Kconfig b/lib/Kconfig index dc1ab2ed1dc6..8319e7731e7b 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -662,6 +662,9 @@ config ARCH_HAS_PMEM_API config MEMREGION bool +config ARCH_HAS_MEMREGION_INVALIDATE + bool + config ARCH_HAS_MEMREMAP_COMPAT_ALIGN bool
With CXL security features, global CPU cache flushing nvdimm requirements are no longer specific to that subsystem, even beyond the scope of security_ops. CXL will need such semantics for features not necessarily limited to persistent memory. The functionality this is enabling is to be able to instantaneously secure erase potentially terabytes of memory at once and the kernel needs to be sure that none of the data from before the secure is still present in the cache. It is also used when unlocking a memory device where speculative reads and firmware accesses could have cached poison from before the device was unlocked. This capability is typically only used once per-boot (for unlock), or once per bare metal provisioning event (secure erase), like when handing off the system to another tenant or decommissioning a device. Users must first call arch_has_flush_memregion() to know whether this functionality is available on the architecture. Only enable it on x86-64 via the wbinvd() hammer. Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> --- Changes from v2 (https://lore.kernel.org/all/20220819171024.1766857-1-dave@stgolabs.net/): - Redid to use memregion based interfaces + VMM check on x86 (Dan) - Restricted the flushing to x86-64. Note: Since we still are dealing with a physical "range" at this level, added the spa range for nfit even though this is unused. arch/x86/Kconfig | 1 + arch/x86/mm/pat/set_memory.c | 14 +++++++++++ drivers/acpi/nfit/intel.c | 45 ++++++++++++++++++------------------ include/linux/memregion.h | 25 ++++++++++++++++++++ lib/Kconfig | 3 +++ 5 files changed, 65 insertions(+), 23 deletions(-)