Message ID | 5d8e1f752051173d2d1b5c3e14b54eb3506ed3ef.1684892404.git-series.apopple@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] mmu_notifiers: Restore documentation for .invalidate_range() | expand |
On Wed, May 24, 2023 at 11:47:29AM +1000, Alistair Popple wrote: > ARM64 requires TLB invalidates when upgrading pte permission from > read-only to read-write. However mmu_notifiers assume upgrades do not > need notifications and none are sent. This causes problems when a > secondary TLB such as implemented by an ARM SMMU doesn't support > broadcast TLB maintenance (BTM) and caches a read-only PTE. I don't really like this design, but I see how you get here.. mmu notifiers behavior should not be tied to the architecture, they are supposed to be generic reflections of what the MM is doing so that they can be hooked into by general purpose drivers. If you want to hardwire invalidate_range to be only for SVA cases that actually share the page table itself and rely on some arch-defined invalidation, then we should give the op a much better name and discourage anyone else from abusing the new ops variable behavior. > As no notification is sent and the SMMU does not snoop TLB invalidates > it will continue to return read-only entries to a device even though > the CPU page table contains a writable entry. This leads to a > continually faulting device and no way of handling the fault. Doesn't the fault generate a PRI/etc? If we get a PRI maybe we should just have the iommu driver push an iotlb invalidation command before it acks it? PRI is already really slow so I'm not sure a pipelined invalidation is going to be a problem? Does the SMMU architecture permit negative caching which would suggest we need it anyhow? Jason
Jason Gunthorpe <jgg@nvidia.com> writes: > On Wed, May 24, 2023 at 11:47:29AM +1000, Alistair Popple wrote: >> ARM64 requires TLB invalidates when upgrading pte permission from >> read-only to read-write. However mmu_notifiers assume upgrades do not >> need notifications and none are sent. This causes problems when a >> secondary TLB such as implemented by an ARM SMMU doesn't support >> broadcast TLB maintenance (BTM) and caches a read-only PTE. > > I don't really like this design, but I see how you get here.. Not going to argue with that, I don't love it either but it seemed like the most straight forward approach. > mmu notifiers behavior should not be tied to the architecture, they > are supposed to be generic reflections of what the MM is doing so that > they can be hooked into by general purpose drivers. Interesting. I've always assumed mmu notifiers were primarly about keeping cache invalidations in sync with what the MM is doing. This is based on the fact that you can't use mmu notifiers to establish mappings and we instead have this rather complex dance with hmm_range_fault() to establish a mapping. My initial version [1] effectivly did add a generic event. Admittedly it was somewhat incomplete, because I didn't hook up the new mmu notifier event type to every user that could possibliy ignore it (eg. KVM). But that was part of the problem - it was hard to figure out which mmu notifier users can safely ignore it versus ones that can't, and that may depend on what architecture it's running on. Hence why I hooked it up to ptep_set_access_flags, because you get arch specific filtering as required. Perhaps the better option is to introduce a new mmu notifier op and let drivers opt-in? > If you want to hardwire invalidate_range to be only for SVA cases that > actually share the page table itself and rely on some arch-defined > invalidation, then we should give the op a much better name and > discourage anyone else from abusing the new ops variable behavior. Well that's the only use case I currently care about because we have hit this issue, so for now at least I'd much rather a straight forward fix we can backport. The problem is an invalidation isn't well defined. If we are to make this architecture independent then we need to be sending an invalidation for any PTE state change (ie. clean/dirty/writeable/read-only/present/not-present/etc) which we don't do currently. >> As no notification is sent and the SMMU does not snoop TLB invalidates >> it will continue to return read-only entries to a device even though >> the CPU page table contains a writable entry. This leads to a >> continually faulting device and no way of handling the fault. > > Doesn't the fault generate a PRI/etc? If we get a PRI maybe we should > just have the iommu driver push an iotlb invalidation command before > it acks it? PRI is already really slow so I'm not sure a pipelined > invalidation is going to be a problem? Does the SMMU architecture > permit negative caching which would suggest we need it anyhow? Yes, SMMU architecture (which matches the ARM architecture in regards to TLB maintenance requirements) permits negative caching of some mapping attributes including the read-only attribute. Hence without the flushing we fault continuously. > Jason [1] - https://lore.kernel.org/linux-mm/ZGxg+I8FWz3YqBMk@infradead.org/T/
On Tue, May 30, 2023 at 06:05:41PM +1000, Alistair Popple wrote: > > >> As no notification is sent and the SMMU does not snoop TLB invalidates > >> it will continue to return read-only entries to a device even though > >> the CPU page table contains a writable entry. This leads to a > >> continually faulting device and no way of handling the fault. > > > > Doesn't the fault generate a PRI/etc? If we get a PRI maybe we should > > just have the iommu driver push an iotlb invalidation command before > > it acks it? PRI is already really slow so I'm not sure a pipelined > > invalidation is going to be a problem? Does the SMMU architecture > > permit negative caching which would suggest we need it anyhow? > > Yes, SMMU architecture (which matches the ARM architecture in regards to > TLB maintenance requirements) permits negative caching of some mapping > attributes including the read-only attribute. Hence without the flushing > we fault continuously. Sounds like a straight up SMMU bug, invalidate the cache after resolving the PRI event. Jason
On 2023-05-30 12:54, Jason Gunthorpe wrote: > On Tue, May 30, 2023 at 06:05:41PM +1000, Alistair Popple wrote: >> >>>> As no notification is sent and the SMMU does not snoop TLB invalidates >>>> it will continue to return read-only entries to a device even though >>>> the CPU page table contains a writable entry. This leads to a >>>> continually faulting device and no way of handling the fault. >>> >>> Doesn't the fault generate a PRI/etc? If we get a PRI maybe we should >>> just have the iommu driver push an iotlb invalidation command before >>> it acks it? PRI is already really slow so I'm not sure a pipelined >>> invalidation is going to be a problem? Does the SMMU architecture >>> permit negative caching which would suggest we need it anyhow? >> >> Yes, SMMU architecture (which matches the ARM architecture in regards to >> TLB maintenance requirements) permits negative caching of some mapping >> attributes including the read-only attribute. Hence without the flushing >> we fault continuously. > > Sounds like a straight up SMMU bug, invalidate the cache after > resolving the PRI event. No, if the IOPF handler calls back into the mm layer to resolve the fault, and the mm layer issues an invalidation in the process of that which isn't propagated back to the SMMU (as it would be if BTM were in use), logically that's the mm layer's failing. The SMMU driver shouldn't have to issue extra mostly-redundant invalidations just because different CPU architectures have different idiosyncracies around caching of permissions. Thanks, Robin.
On Tue, May 30, 2023 at 01:14:41PM +0100, Robin Murphy wrote: > On 2023-05-30 12:54, Jason Gunthorpe wrote: > > On Tue, May 30, 2023 at 06:05:41PM +1000, Alistair Popple wrote: > > > > > > > > As no notification is sent and the SMMU does not snoop TLB invalidates > > > > > it will continue to return read-only entries to a device even though > > > > > the CPU page table contains a writable entry. This leads to a > > > > > continually faulting device and no way of handling the fault. > > > > > > > > Doesn't the fault generate a PRI/etc? If we get a PRI maybe we should > > > > just have the iommu driver push an iotlb invalidation command before > > > > it acks it? PRI is already really slow so I'm not sure a pipelined > > > > invalidation is going to be a problem? Does the SMMU architecture > > > > permit negative caching which would suggest we need it anyhow? > > > > > > Yes, SMMU architecture (which matches the ARM architecture in regards to > > > TLB maintenance requirements) permits negative caching of some mapping > > > attributes including the read-only attribute. Hence without the flushing > > > we fault continuously. > > > > Sounds like a straight up SMMU bug, invalidate the cache after > > resolving the PRI event. > > No, if the IOPF handler calls back into the mm layer to resolve the fault, > and the mm layer issues an invalidation in the process of that which isn't > propagated back to the SMMU (as it would be if BTM were in use), logically > that's the mm layer's failing. The SMMU driver shouldn't have to issue extra > mostly-redundant invalidations just because different CPU architectures have > different idiosyncracies around caching of permissions. The mm has a definition for invalidate_range that does not include all the invalidation points SMMU needs. This is difficult to sort out because this is general purpose cross arch stuff. You are right that this is worth optimizing, but right now we have a -rc bug that needs fixing and adding and extra SMMU invalidation is a straightforward -rc friendly way to address it. Jason
On 30/05/2023 1:52 pm, Jason Gunthorpe wrote: > On Tue, May 30, 2023 at 01:14:41PM +0100, Robin Murphy wrote: >> On 2023-05-30 12:54, Jason Gunthorpe wrote: >>> On Tue, May 30, 2023 at 06:05:41PM +1000, Alistair Popple wrote: >>>> >>>>>> As no notification is sent and the SMMU does not snoop TLB invalidates >>>>>> it will continue to return read-only entries to a device even though >>>>>> the CPU page table contains a writable entry. This leads to a >>>>>> continually faulting device and no way of handling the fault. >>>>> >>>>> Doesn't the fault generate a PRI/etc? If we get a PRI maybe we should >>>>> just have the iommu driver push an iotlb invalidation command before >>>>> it acks it? PRI is already really slow so I'm not sure a pipelined >>>>> invalidation is going to be a problem? Does the SMMU architecture >>>>> permit negative caching which would suggest we need it anyhow? >>>> >>>> Yes, SMMU architecture (which matches the ARM architecture in regards to >>>> TLB maintenance requirements) permits negative caching of some mapping >>>> attributes including the read-only attribute. Hence without the flushing >>>> we fault continuously. >>> >>> Sounds like a straight up SMMU bug, invalidate the cache after >>> resolving the PRI event. >> >> No, if the IOPF handler calls back into the mm layer to resolve the fault, >> and the mm layer issues an invalidation in the process of that which isn't >> propagated back to the SMMU (as it would be if BTM were in use), logically >> that's the mm layer's failing. The SMMU driver shouldn't have to issue extra >> mostly-redundant invalidations just because different CPU architectures have >> different idiosyncracies around caching of permissions. > > The mm has a definition for invalidate_range that does not include all > the invalidation points SMMU needs. This is difficult to sort out > because this is general purpose cross arch stuff. > > You are right that this is worth optimizing, but right now we have a > -rc bug that needs fixing and adding and extra SMMU invalidation is a > straightforward -rc friendly way to address it. Sure; to clarify, I'm not against the overall idea of putting a hack in the SMMU driver with a big comment that it is a hack to work around missing notifications under SVA, but it would not constitute an "SMMU bug" to not do that. SMMU is just another VMSAv8-compatible MMU - if, say, KVM or some other arm64 hypervisor driver wanted to do something funky with notifiers to shadow stage 1 permissions for some reason, it would presumably be equally affected. FWIW, the VT-d spec seems to suggest that invalidation on RO->RW is only optional if the requester supports recoverable page faults, so although there's no use-case for non-PRI-based SVA at the moment, there is some potential argument that the notifier issue generalises even to x86. Thanks, Robin.
On Tue, May 30, 2023 at 02:44:11PM +0100, Robin Murphy wrote: > On 30/05/2023 1:52 pm, Jason Gunthorpe wrote: > > On Tue, May 30, 2023 at 01:14:41PM +0100, Robin Murphy wrote: > > > On 2023-05-30 12:54, Jason Gunthorpe wrote: > > > > On Tue, May 30, 2023 at 06:05:41PM +1000, Alistair Popple wrote: > > > > > > > > > > > > As no notification is sent and the SMMU does not snoop TLB invalidates > > > > > > > it will continue to return read-only entries to a device even though > > > > > > > the CPU page table contains a writable entry. This leads to a > > > > > > > continually faulting device and no way of handling the fault. > > > > > > > > > > > > Doesn't the fault generate a PRI/etc? If we get a PRI maybe we should > > > > > > just have the iommu driver push an iotlb invalidation command before > > > > > > it acks it? PRI is already really slow so I'm not sure a pipelined > > > > > > invalidation is going to be a problem? Does the SMMU architecture > > > > > > permit negative caching which would suggest we need it anyhow? > > > > > > > > > > Yes, SMMU architecture (which matches the ARM architecture in regards to > > > > > TLB maintenance requirements) permits negative caching of some mapping > > > > > attributes including the read-only attribute. Hence without the flushing > > > > > we fault continuously. > > > > > > > > Sounds like a straight up SMMU bug, invalidate the cache after > > > > resolving the PRI event. > > > > > > No, if the IOPF handler calls back into the mm layer to resolve the fault, > > > and the mm layer issues an invalidation in the process of that which isn't > > > propagated back to the SMMU (as it would be if BTM were in use), logically > > > that's the mm layer's failing. The SMMU driver shouldn't have to issue extra > > > mostly-redundant invalidations just because different CPU architectures have > > > different idiosyncracies around caching of permissions. > > > > The mm has a definition for invalidate_range that does not include all > > the invalidation points SMMU needs. This is difficult to sort out > > because this is general purpose cross arch stuff. > > > > You are right that this is worth optimizing, but right now we have a > > -rc bug that needs fixing and adding and extra SMMU invalidation is a > > straightforward -rc friendly way to address it. > > Sure; to clarify, I'm not against the overall idea of putting a hack in the > SMMU driver with a big comment that it is a hack to work around missing > notifications under SVA, but it would not constitute an "SMMU bug" to not do > that. SMMU is just another VMSAv8-compatible MMU - if, say, KVM or some > other arm64 hypervisor driver wanted to do something funky with notifiers to > shadow stage 1 permissions for some reason, it would presumably be equally > affected. Okay, Alistair can you make this? > FWIW, the VT-d spec seems to suggest that invalidation on RO->RW is only > optional if the requester supports recoverable page faults, so although > there's no use-case for non-PRI-based SVA at the moment, there is some > potential argument that the notifier issue generalises even to x86. IMHO I think we messed this up at some point.. Joerg added invalidate_range just for the iommu to use, so having it be arch specific could make some sense. However, KVM later co-opted it to do this: commit e649b3f0188f8fd34dd0dde8d43fd3312b902fb2 Author: Eiichi Tsukata <eiichi.tsukata@nutanix.com> Date: Sat Jun 6 13:26:27 2020 +0900 KVM: x86: Fix APIC page invalidation race Commit b1394e745b94 ("KVM: x86: fix APIC page invalidation") tried to fix inappropriate APIC page invalidation by re-introducing arch specific kvm_arch_mmu_notifier_invalidate_range() and calling it from kvm_mmu_notifier_invalidate_range_start. However, the patch left a possible race where the VMCS APIC address cache is updated *before* it is unmapped: (Invalidator) kvm_mmu_notifier_invalidate_range_start() (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD) (KVM VCPU) vcpu_enter_guest() (KVM VCPU) kvm_vcpu_reload_apic_access_page() (Invalidator) actually unmap page Because of the above race, there can be a mismatch between the host physical address stored in the APIC_ACCESS_PAGE VMCS field and the host physical address stored in the EPT entry for the APIC GPA (0xfee0000). When this happens, the processor will not trap APIC accesses, and will instead show the raw contents of the APIC-access page. Because Windows OS periodically checks for unexpected modifications to the LAPIC register, this will show up as a BSOD crash with BugCheck CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in https://bugzilla.redhat.com/show_bug.cgi?id=1751017. The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range() cannot guarantee that no additional references are taken to the pages in the range before kvm_mmu_notifier_invalidate_range_end(). Fortunately, this case is supported by the MMU notifier API, as documented in include/linux/mmu_notifier.h: * If the subsystem * can't guarantee that no additional references are taken to * the pages in the range, it has to implement the * invalidate_range() notifier to remove any references taken * after invalidate_range_start(). The fix therefore is to reload the APIC-access page field in the VMCS from kvm_mmu_notifier_invalidate_range() instead of ..._range_start(). Which I think is a hacky fix. KVM already has locking for invalidate_start/end - it has to check mmu_notifier_retry_cache() with the sequence numbers/etc around when it does does hva_to_pfn() The bug is that the kvm_vcpu_reload_apic_access_page() path is ignoring this locking so it ignores in-progress range invalidations. It should spin until the invalidation clears like other places in KVM. The comment is kind of misleading because drivers shouldn't be abusing the iommu centric invalidate_range() thing to fix missing locking in start/end users. :\ So if KVM could be fixed up we could make invalidate_range defined to be an arch specific callback to synchronize the iommu TLB. Sean? Jason
On Tue, May 30, 2023, Jason Gunthorpe wrote: > IMHO I think we messed this up at some point.. > > Joerg added invalidate_range just for the iommu to use, so having it > be arch specific could make some sense. > > However, KVM later co-opted it to do this: > > commit e649b3f0188f8fd34dd0dde8d43fd3312b902fb2 > Author: Eiichi Tsukata <eiichi.tsukata@nutanix.com> > Date: Sat Jun 6 13:26:27 2020 +0900 > > KVM: x86: Fix APIC page invalidation race ... > The fix therefore is to reload the APIC-access page field in the VMCS > from kvm_mmu_notifier_invalidate_range() instead of ..._range_start(). > > Which I think is a hacky fix. Agreed, though as you note below, the invalidate_range() description does make it seem like a somewhat reasonable thing to do. > KVM already has locking for invalidate_start/end - it has to check > mmu_notifier_retry_cache() with the sequence numbers/etc around when > it does does hva_to_pfn() > > The bug is that the kvm_vcpu_reload_apic_access_page() path is > ignoring this locking so it ignores in-progress range > invalidations. It should spin until the invalidation clears like other > places in KVM. > > The comment is kind of misleading because drivers shouldn't be abusing > the iommu centric invalidate_range() thing to fix missing locking in > start/end users. :\ > > So if KVM could be fixed up we could make invalidate_range defined to > be an arch specific callback to synchronize the iommu TLB. And maybe rename invalidate_range() and/or invalidate_range_{start,end}() to make it super obvious that they are intended for two different purposes? E.g. instead of invalidate_range(), something like invalidate_secondary_tlbs(). FWIW, PPC's OpenCAPI support (drivers/misc/ocxl/link.c) also uses invalidate_range(). Though IIUC, the use case is the same as a "traditional" IOMMU, where a device can share the CPU's page tables, so maybe the devices can be considered IOMMUs in practice, if not in name? It allows an accelerator (which could be an FPGA, ASICs, ...) to access the host memory coherently, using virtual addresses. An OpenCAPI device can also host its own memory, that can be accessed from the host. > Sean? It's doable, though definitely not 6.4 material. I have patches coded up. Assuming testing goes well, I'll post them regardless of the OCXL side of things. I've disliked KVM's one-off use of invalidate_range() for a long time, this is a good excuse to get rid of it before KVM gains more usage.
On Tue, May 30, 2023 at 02:44:40PM -0700, Sean Christopherson wrote: > > KVM already has locking for invalidate_start/end - it has to check > > mmu_notifier_retry_cache() with the sequence numbers/etc around when > > it does does hva_to_pfn() > > > > The bug is that the kvm_vcpu_reload_apic_access_page() path is > > ignoring this locking so it ignores in-progress range > > invalidations. It should spin until the invalidation clears like other > > places in KVM. > > > > The comment is kind of misleading because drivers shouldn't be abusing > > the iommu centric invalidate_range() thing to fix missing locking in > > start/end users. :\ > > > > So if KVM could be fixed up we could make invalidate_range defined to > > be an arch specific callback to synchronize the iommu TLB. > > And maybe rename invalidate_range() and/or invalidate_range_{start,end}() to make > it super obvious that they are intended for two different purposes? E.g. instead > of invalidate_range(), something like invalidate_secondary_tlbs(). Yeah, I think I would call it invalidate_arch_secondary_tlb() and document it as being an arch specific set of invalidations that match the architected TLB maintenance requrements. And maybe we can check it more carefully to make it be called in less places. Like I'm not sure it is right to call it from invalidate_range_end under this new definition.. > FWIW, PPC's OpenCAPI support (drivers/misc/ocxl/link.c) also uses invalidate_range(). > Though IIUC, the use case is the same as a "traditional" IOMMU, where a device can > share the CPU's page tables, so maybe the devices can be considered IOMMUs in practice, > if not in name? OpenCAPI is an IOMMU HW for sure. PPC just doesn't have integration with the drivers/iommu infrastructure. > I have patches coded up. Assuming testing goes well, I'll post them regardless > of the OCXL side of things. I've disliked KVM's one-off use of invalidate_range() > for a long time, this is a good excuse to get rid of it before KVM gains more usage. Nice! Thanks, Jason
Jason Gunthorpe <jgg@nvidia.com> writes: > On Tue, May 30, 2023 at 02:44:11PM +0100, Robin Murphy wrote: >> On 30/05/2023 1:52 pm, Jason Gunthorpe wrote: >> > On Tue, May 30, 2023 at 01:14:41PM +0100, Robin Murphy wrote: >> > > On 2023-05-30 12:54, Jason Gunthorpe wrote: >> > > > On Tue, May 30, 2023 at 06:05:41PM +1000, Alistair Popple wrote: >> > > > > >> > > > > > > As no notification is sent and the SMMU does not snoop TLB invalidates >> > > > > > > it will continue to return read-only entries to a device even though >> > > > > > > the CPU page table contains a writable entry. This leads to a >> > > > > > > continually faulting device and no way of handling the fault. >> > > > > > >> > > > > > Doesn't the fault generate a PRI/etc? If we get a PRI maybe we should >> > > > > > just have the iommu driver push an iotlb invalidation command before >> > > > > > it acks it? PRI is already really slow so I'm not sure a pipelined >> > > > > > invalidation is going to be a problem? Does the SMMU architecture >> > > > > > permit negative caching which would suggest we need it anyhow? >> > > > > >> > > > > Yes, SMMU architecture (which matches the ARM architecture in regards to >> > > > > TLB maintenance requirements) permits negative caching of some mapping >> > > > > attributes including the read-only attribute. Hence without the flushing >> > > > > we fault continuously. >> > > > >> > > > Sounds like a straight up SMMU bug, invalidate the cache after >> > > > resolving the PRI event. >> > > >> > > No, if the IOPF handler calls back into the mm layer to resolve the fault, >> > > and the mm layer issues an invalidation in the process of that which isn't >> > > propagated back to the SMMU (as it would be if BTM were in use), logically >> > > that's the mm layer's failing. The SMMU driver shouldn't have to issue extra >> > > mostly-redundant invalidations just because different CPU architectures have >> > > different idiosyncracies around caching of permissions. >> > >> > The mm has a definition for invalidate_range that does not include all >> > the invalidation points SMMU needs. This is difficult to sort out >> > because this is general purpose cross arch stuff. >> > >> > You are right that this is worth optimizing, but right now we have a >> > -rc bug that needs fixing and adding and extra SMMU invalidation is a >> > straightforward -rc friendly way to address it. >> >> Sure; to clarify, I'm not against the overall idea of putting a hack in the >> SMMU driver with a big comment that it is a hack to work around missing >> notifications under SVA, but it would not constitute an "SMMU bug" to not do >> that. SMMU is just another VMSAv8-compatible MMU - if, say, KVM or some >> other arm64 hypervisor driver wanted to do something funky with notifiers to >> shadow stage 1 permissions for some reason, it would presumably be equally >> affected. > > Okay, Alistair can you make this? Right, I agree this isn't a bug in SMMU. I could add the hack to the SMMU driver, but it doesn't address my issue because we're using SVA without PRI. So I'd much rather update the MM to keep SVA IOMMU TLBs in sync. So I'd rather keep the invalidate in ptep_set_access_flags(). Would renaming invalidate_range() to invalidate_arch_secondary_tlb() along with clearing up the documentation make that more acceptable, at least in the short term? > On Tue, May 30, 2023 at 02:44:40PM -0700, Sean Christopherson wrote: >> > KVM already has locking for invalidate_start/end - it has to check >> > mmu_notifier_retry_cache() with the sequence numbers/etc around when >> > it does does hva_to_pfn() >> > >> > The bug is that the kvm_vcpu_reload_apic_access_page() path is >> > ignoring this locking so it ignores in-progress range >> > invalidations. It should spin until the invalidation clears like other >> > places in KVM. >> > >> > The comment is kind of misleading because drivers shouldn't be abusing >> > the iommu centric invalidate_range() thing to fix missing locking in >> > start/end users. :\ >> > >> > So if KVM could be fixed up we could make invalidate_range defined to >> > be an arch specific callback to synchronize the iommu TLB. >> >> And maybe rename invalidate_range() and/or invalidate_range_{start,end}() to make >> it super obvious that they are intended for two different purposes? E.g. instead >> of invalidate_range(), something like invalidate_secondary_tlbs(). > > Yeah, I think I would call it invalidate_arch_secondary_tlb() and > document it as being an arch specific set of invalidations that match > the architected TLB maintenance requrements. And maybe we can check it > more carefully to make it be called in less places. Like I'm not sure > it is right to call it from invalidate_range_end under this new > definition.. I'd be happy to look at that, although it sounds like Sean already is. >> FWIW, PPC's OpenCAPI support (drivers/misc/ocxl/link.c) also uses invalidate_range(). >> Though IIUC, the use case is the same as a "traditional" IOMMU, where a device can >> share the CPU's page tables, so maybe the devices can be considered IOMMUs in practice, >> if not in name? > > OpenCAPI is an IOMMU HW for sure. PPC just doesn't have integration > with the drivers/iommu infrastructure. Yep it sure is. I worked on that a fair bit when it was first being brought up. It doesn't suffer this problem because it follows the PPC MMU architecture which doesn't require TLB invalidates for RO/RW upgrades. It's a pity it was never integrated with the rest of the driver iommu infrastructure though. >> I have patches coded up. Assuming testing goes well, I'll post them regardless >> of the OCXL side of things. I've disliked KVM's one-off use of invalidate_range() >> for a long time, this is a good excuse to get rid of it before KVM gains more usage. Feel free to CC me, I'd be happy to review them and can probably help with the OCXL side of things. > Nice! > > Thanks, > Jason
On Wed, May 31, 2023 at 10:30:48AM +1000, Alistair Popple wrote: > So I'd rather keep the invalidate in ptep_set_access_flags(). Would > renaming invalidate_range() to invalidate_arch_secondary_tlb() along > with clearing up the documentation make that more acceptable, at least > in the short term? Then we need to go through removing kvm first I think. Jason
Jason Gunthorpe <jgg@nvidia.com> writes: > On Wed, May 31, 2023 at 10:30:48AM +1000, Alistair Popple wrote: > >> So I'd rather keep the invalidate in ptep_set_access_flags(). Would >> renaming invalidate_range() to invalidate_arch_secondary_tlb() along >> with clearing up the documentation make that more acceptable, at least >> in the short term? > > Then we need to go through removing kvm first I think. Why? I don't think we need to hold up a fix for something that is an issue today so we can rework a fix for an unrelated problem. Strongly agree the API/interface/documentation could be better but neither this nor the KVM fix are abusing the API based on how it's currently documented IMHO. So I think improving the API is a separate problem. Happy to help with that, but don't see why it has to happen first given KVM usage was acceptable and still presumably works even though its implementation isn't something we like now. >> And maybe rename invalidate_range() and/or invalidate_range_{start,end}() to make >> it super obvious that they are intended for two different purposes? E.g. instead >> of invalidate_range(), something like invalidate_secondary_tlbs(). > > Yeah, I think I would call it invalidate_arch_secondary_tlb() and > document it as being an arch specific set of invalidations that match > the architected TLB maintenance requrements. And maybe we can check it > more carefully to make it be called in less places. Like I'm not sure > it is right to call it from invalidate_range_end under this new > definition.. I will look at this in more depth, but this comment reminded me there is already an issue with calling .invalidate_range() from invalidate_range_end(). We have seen slow downs when unmapping unused ranges because unmap_vmas() will call .invalidate_range() via .invalidate_range_end() flooding the SMMU with invalidates even though zap_pte_range() skipped it because the PTEs were pte_none. - Alistair
On Wed, May 31, 2023 at 12:46:06PM +1000, Alistair Popple wrote: > > Jason Gunthorpe <jgg@nvidia.com> writes: > > > On Wed, May 31, 2023 at 10:30:48AM +1000, Alistair Popple wrote: > > > >> So I'd rather keep the invalidate in ptep_set_access_flags(). Would > >> renaming invalidate_range() to invalidate_arch_secondary_tlb() along > >> with clearing up the documentation make that more acceptable, at least > >> in the short term? > > > > Then we need to go through removing kvm first I think. > > Why? I don't think we need to hold up a fix for something that is an > issue today so we can rework a fix for an unrelated problem. I'm nervous about affecting KVM's weird usage if we go in and start making changes. Getting rid of it first is much safer > > Yeah, I think I would call it invalidate_arch_secondary_tlb() and > > document it as being an arch specific set of invalidations that match > > the architected TLB maintenance requrements. And maybe we can check it > > more carefully to make it be called in less places. Like I'm not sure > > it is right to call it from invalidate_range_end under this new > > definition.. > > I will look at this in more depth, but this comment reminded me there is > already an issue with calling .invalidate_range() from > invalidate_range_end(). We have seen slow downs when unmapping unused > ranges because unmap_vmas() will call .invalidate_range() via > .invalidate_range_end() flooding the SMMU with invalidates even though > zap_pte_range() skipped it because the PTEs were pte_none. Yes, if the new API is specifically about synchronizing an architected TLB then really the call to the op should be done near the architectures TLB flush points, and not higher in the MM. ie any flush of the CPU tlb should mirror 1:1 to a flush of the IOMMU TLB, no broadinging or narrowing. It is a very clean defintion and we can leap directly to it once we get kvm out of the way. Jason
Jason Gunthorpe <jgg@nvidia.com> writes: > On Wed, May 31, 2023 at 12:46:06PM +1000, Alistair Popple wrote: >> >> Jason Gunthorpe <jgg@nvidia.com> writes: >> >> > On Wed, May 31, 2023 at 10:30:48AM +1000, Alistair Popple wrote: >> > >> >> So I'd rather keep the invalidate in ptep_set_access_flags(). Would >> >> renaming invalidate_range() to invalidate_arch_secondary_tlb() along >> >> with clearing up the documentation make that more acceptable, at least >> >> in the short term? >> > >> > Then we need to go through removing kvm first I think. >> >> Why? I don't think we need to hold up a fix for something that is an >> issue today so we can rework a fix for an unrelated problem. > > I'm nervous about affecting KVM's weird usage if we go in and start > making changes. Getting rid of it first is much safer Fair enough. In this case though I think we're safe because we won't be affecting KVM's usage of it - my change only affects ARM64 and KVM only really uses this on x86 via the arch-specific kvm_arch_mmu_notifier_invalidate_range() definition. >> > Yeah, I think I would call it invalidate_arch_secondary_tlb() and >> > document it as being an arch specific set of invalidations that match >> > the architected TLB maintenance requrements. And maybe we can check it >> > more carefully to make it be called in less places. Like I'm not sure >> > it is right to call it from invalidate_range_end under this new >> > definition.. >> >> I will look at this in more depth, but this comment reminded me there is >> already an issue with calling .invalidate_range() from >> invalidate_range_end(). We have seen slow downs when unmapping unused >> ranges because unmap_vmas() will call .invalidate_range() via >> .invalidate_range_end() flooding the SMMU with invalidates even though >> zap_pte_range() skipped it because the PTEs were pte_none. > > Yes, if the new API is specifically about synchronizing an architected > TLB then really the call to the op should be done near the > architectures TLB flush points, and not higher in the MM. > > ie any flush of the CPU tlb should mirror 1:1 to a flush of the IOMMU > TLB, no broadinging or narrowing. > > It is a very clean defintion and we can leap directly to it once we > get kvm out of the way. Yes, no argument there. > Jason
Alistair Popple <apopple@nvidia.com> writes: > Alistair Popple <apopple@nvidia.com> writes: > >>> On Tue, May 30, 2023 at 02:44:40PM -0700, Sean Christopherson wrote: >>>> > KVM already has locking for invalidate_start/end - it has to check >>>> > mmu_notifier_retry_cache() with the sequence numbers/etc around when >>>> > it does does hva_to_pfn() >>>> > >>>> > The bug is that the kvm_vcpu_reload_apic_access_page() path is >>>> > ignoring this locking so it ignores in-progress range >>>> > invalidations. It should spin until the invalidation clears like other >>>> > places in KVM. >>>> > >>>> > The comment is kind of misleading because drivers shouldn't be abusing >>>> > the iommu centric invalidate_range() thing to fix missing locking in >>>> > start/end users. :\ >>>> > >>>> > So if KVM could be fixed up we could make invalidate_range defined to >>>> > be an arch specific callback to synchronize the iommu TLB. >>>> >>>> And maybe rename invalidate_range() and/or invalidate_range_{start,end}() to make >>>> it super obvious that they are intended for two different purposes? E.g. instead >>>> of invalidate_range(), something like invalidate_secondary_tlbs(). >>> >>> Yeah, I think I would call it invalidate_arch_secondary_tlb() and >>> document it as being an arch specific set of invalidations that match >>> the architected TLB maintenance requrements. And maybe we can check it >>> more carefully to make it be called in less places. Like I'm not sure >>> it is right to call it from invalidate_range_end under this new >>> definition.. >> >> I'd be happy to look at that, although it sounds like Sean already is. Thanks Sean for getting the KVM fix posted so quickly. I'm looking into doing the rename now. Do we want to do more than a simple rename and tidy up of callers though? What I'm thinking is introducing something like an IOMMU/TLB specific variant of notifiers (eg. struct tlb_notifier) which has the invalidate_secondary_tlbs() callback in say struct tlb_notifier_ops rather than leaving that in the mmu_notifier_ops. Implementation wise we'd reuse most of the mmu_notifier code, but it would help make the two different uses of notifiers clearer. Thoughts? - Alistair
Alistair Popple <apopple@nvidia.com> writes: > Alistair Popple <apopple@nvidia.com> writes: > >> Alistair Popple <apopple@nvidia.com> writes: >> >>>> On Tue, May 30, 2023 at 02:44:40PM -0700, Sean Christopherson wrote: >>>>> > KVM already has locking for invalidate_start/end - it has to check >>>>> > mmu_notifier_retry_cache() with the sequence numbers/etc around when >>>>> > it does does hva_to_pfn() >>>>> > >>>>> > The bug is that the kvm_vcpu_reload_apic_access_page() path is >>>>> > ignoring this locking so it ignores in-progress range >>>>> > invalidations. It should spin until the invalidation clears like other >>>>> > places in KVM. >>>>> > >>>>> > The comment is kind of misleading because drivers shouldn't be abusing >>>>> > the iommu centric invalidate_range() thing to fix missing locking in >>>>> > start/end users. :\ >>>>> > >>>>> > So if KVM could be fixed up we could make invalidate_range defined to >>>>> > be an arch specific callback to synchronize the iommu TLB. >>>>> >>>>> And maybe rename invalidate_range() and/or invalidate_range_{start,end}() to make >>>>> it super obvious that they are intended for two different purposes? E.g. instead >>>>> of invalidate_range(), something like invalidate_secondary_tlbs(). >>>> >>>> Yeah, I think I would call it invalidate_arch_secondary_tlb() and >>>> document it as being an arch specific set of invalidations that match >>>> the architected TLB maintenance requrements. And maybe we can check it >>>> more carefully to make it be called in less places. Like I'm not sure >>>> it is right to call it from invalidate_range_end under this new >>>> definition.. >>> >>> I'd be happy to look at that, although it sounds like Sean already is. > > Thanks Sean for getting the KVM fix posted so quickly. I'm looking into > doing the rename now. > > Do we want to do more than a simple rename and tidy up of callers > though? What I'm thinking is introducing something like an IOMMU/TLB > specific variant of notifiers (eg. struct tlb_notifier) which has the > invalidate_secondary_tlbs() callback in say struct tlb_notifier_ops > rather than leaving that in the mmu_notifier_ops. > > Implementation wise we'd reuse most of the mmu_notifier code, but it > would help make the two different uses of notifiers clearer. Thoughts? So something like the below incomplete patch (against v6.2) which would introduce a new struct tlb_notifier and associated ops. The change isn't huge, but does result in some churn and another layer of indirection in mmu_notifier.c. Otherwise we can just rename the callback. --- diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c index a5a63b1c947e..c300cd435609 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c @@ -14,7 +14,7 @@ #include "../../io-pgtable-arm.h" struct arm_smmu_mmu_notifier { - struct mmu_notifier mn; + struct tlb_notifier mn; struct arm_smmu_ctx_desc *cd; bool cleared; refcount_t refs; @@ -186,7 +186,7 @@ static void arm_smmu_free_shared_cd(struct arm_smmu_ctx_desc *cd) } } -static void arm_smmu_mm_invalidate_range(struct mmu_notifier *mn, +static void arm_smmu_mm_invalidate_range(struct tlb_notifier *mn, struct mm_struct *mm, unsigned long start, unsigned long end) { @@ -207,7 +207,7 @@ static void arm_smmu_mm_invalidate_range(struct mmu_notifier *mn, arm_smmu_atc_inv_domain(smmu_domain, mm->pasid, start, size); } -static void arm_smmu_mm_release(struct mmu_notifier *mn, struct mm_struct *mm) +static void arm_smmu_mm_release(struct tlb_notifier *mn, struct mm_struct *mm) { struct arm_smmu_mmu_notifier *smmu_mn = mn_to_smmu(mn); struct arm_smmu_domain *smmu_domain = smmu_mn->domain; @@ -231,15 +231,15 @@ static void arm_smmu_mm_release(struct mmu_notifier *mn, struct mm_struct *mm) mutex_unlock(&sva_lock); } -static void arm_smmu_mmu_notifier_free(struct mmu_notifier *mn) +static void arm_smmu_mmu_notifier_free(struct tlb_notifier *mn) { kfree(mn_to_smmu(mn)); } -static const struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = { - .invalidate_range = arm_smmu_mm_invalidate_range, - .release = arm_smmu_mm_release, - .free_notifier = arm_smmu_mmu_notifier_free, +static const struct tlb_notifier_ops arm_smmu_tlb_notifier_ops = { + .invalidate_secondary_tlbs = arm_smmu_mm_invalidate_range, + .release = arm_smmu_mm_release, + .free_notifier = arm_smmu_mmu_notifier_free, }; /* Allocate or get existing MMU notifier for this {domain, mm} pair */ @@ -252,7 +252,7 @@ arm_smmu_mmu_notifier_get(struct arm_smmu_domain *smmu_domain, struct arm_smmu_mmu_notifier *smmu_mn; list_for_each_entry(smmu_mn, &smmu_domain->mmu_notifiers, list) { - if (smmu_mn->mn.mm == mm) { + if (smmu_mn->mn.mm_notifier_chain.mm == mm) { refcount_inc(&smmu_mn->refs); return smmu_mn; } @@ -271,9 +271,9 @@ arm_smmu_mmu_notifier_get(struct arm_smmu_domain *smmu_domain, refcount_set(&smmu_mn->refs, 1); smmu_mn->cd = cd; smmu_mn->domain = smmu_domain; - smmu_mn->mn.ops = &arm_smmu_mmu_notifier_ops; + smmu_mn->mn.ops = &arm_smmu_tlb_notifier_ops; - ret = mmu_notifier_register(&smmu_mn->mn, mm); + ret = tlb_notifier_register(&smmu_mn->mn, mm); if (ret) { kfree(smmu_mn); goto err_free_cd; @@ -288,7 +288,7 @@ arm_smmu_mmu_notifier_get(struct arm_smmu_domain *smmu_domain, err_put_notifier: /* Frees smmu_mn */ - mmu_notifier_put(&smmu_mn->mn); + tlb_notifier_put(&smmu_mn->mn); err_free_cd: arm_smmu_free_shared_cd(cd); return ERR_PTR(ret); @@ -296,7 +296,7 @@ arm_smmu_mmu_notifier_get(struct arm_smmu_domain *smmu_domain, static void arm_smmu_mmu_notifier_put(struct arm_smmu_mmu_notifier *smmu_mn) { - struct mm_struct *mm = smmu_mn->mn.mm; + struct mm_struct *mm = smmu_mn->mn.mm_notifier_chain.mm; struct arm_smmu_ctx_desc *cd = smmu_mn->cd; struct arm_smmu_domain *smmu_domain = smmu_mn->domain; @@ -316,7 +316,7 @@ static void arm_smmu_mmu_notifier_put(struct arm_smmu_mmu_notifier *smmu_mn) } /* Frees smmu_mn */ - mmu_notifier_put(&smmu_mn->mn); + tlb_notifier_put(&smmu_mn->mn); arm_smmu_free_shared_cd(cd); } diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index d6c06e140277..157571497b28 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -13,6 +13,7 @@ struct mmu_notifier_subscriptions; struct mmu_notifier; struct mmu_notifier_range; struct mmu_interval_notifier; +struct mm_notifier_chain; /** * enum mmu_notifier_event - reason for the mmu notifier callback @@ -61,6 +62,18 @@ enum mmu_notifier_event { #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0) +struct tlb_notifier; +struct tlb_notifier_ops { + void (*invalidate_secondary_tlbs)(struct tlb_notifier *subscription, + struct mm_struct *mm, + unsigned long start, + unsigned long end); + + void (*free_notifier)(struct tlb_notifier *subscription); + void (*release)(struct tlb_notifier *subscription, + struct mm_struct *mm); +}; + struct mmu_notifier_ops { /* * Called either by mmu_notifier_unregister or when the mm is @@ -186,29 +199,6 @@ struct mmu_notifier_ops { void (*invalidate_range_end)(struct mmu_notifier *subscription, const struct mmu_notifier_range *range); - /* - * invalidate_range() is either called between - * invalidate_range_start() and invalidate_range_end() when the - * VM has to free pages that where unmapped, but before the - * pages are actually freed, or outside of _start()/_end() when - * a (remote) TLB is necessary. - * - * If invalidate_range() is used to manage a non-CPU TLB with - * shared page-tables, it not necessary to implement the - * invalidate_range_start()/end() notifiers, as - * invalidate_range() already catches the points in time when an - * external TLB range needs to be flushed. For more in depth - * discussion on this see Documentation/mm/mmu_notifier.rst - * - * Note that this function might be called with just a sub-range - * of what was passed to invalidate_range_start()/end(), if - * called between those functions. - */ - void (*invalidate_range)(struct mmu_notifier *subscription, - struct mm_struct *mm, - unsigned long start, - unsigned long end); - /* * These callbacks are used with the get/put interface to manage the * lifetime of the mmu_notifier memory. alloc_notifier() returns a new @@ -234,14 +224,23 @@ struct mmu_notifier_ops { * 2. One of the reverse map locks is held (i_mmap_rwsem or anon_vma->rwsem). * 3. No other concurrent thread can access the list (release) */ -struct mmu_notifier { +struct mm_notifier_chain { struct hlist_node hlist; - const struct mmu_notifier_ops *ops; struct mm_struct *mm; struct rcu_head rcu; unsigned int users; }; +struct mmu_notifier { + const struct mmu_notifier_ops *ops; + struct mm_notifier_chain mm_notifier_chain; +}; + +struct tlb_notifier { + const struct tlb_notifier_ops *ops; + struct mm_notifier_chain mm_notifier_chain; +}; + /** * struct mmu_interval_notifier_ops * @invalidate: Upon return the caller must stop using any SPTEs within this @@ -283,6 +282,10 @@ static inline int mm_has_notifiers(struct mm_struct *mm) return unlikely(mm->notifier_subscriptions); } +int tlb_notifier_register(struct tlb_notifier *subscription, + struct mm_struct *mm); +void tlb_notifier_put(struct tlb_notifier *subscription); + struct mmu_notifier *mmu_notifier_get_locked(const struct mmu_notifier_ops *ops, struct mm_struct *mm); static inline struct mmu_notifier * diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index f45ff1b7626a..cdc3a373a225 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -47,6 +47,16 @@ struct mmu_notifier_subscriptions { struct hlist_head deferred_list; }; +struct mmu_notifier *mmu_notifier_from_chain(struct mm_notifier_chain *chain) +{ + return container_of(chain, struct mmu_notifier, mm_notifier_chain); +} + +struct tlb_notifier *tlb_notifier_from_chain(struct mm_notifier_chain *chain) +{ + return container_of(chain, struct tlb_notifier, mm_notifier_chain); +} + /* * This is a collision-retry read-side/write-side 'lock', a lot like a * seqcount, however this allows multiple write-sides to hold it at @@ -299,7 +309,7 @@ static void mn_itree_release(struct mmu_notifier_subscriptions *subscriptions, static void mn_hlist_release(struct mmu_notifier_subscriptions *subscriptions, struct mm_struct *mm) { - struct mmu_notifier *subscription; + struct mm_notifier_chain *chain; int id; /* @@ -307,8 +317,10 @@ static void mn_hlist_release(struct mmu_notifier_subscriptions *subscriptions, * ->release returns. */ id = srcu_read_lock(&srcu); - hlist_for_each_entry_rcu(subscription, &subscriptions->list, hlist, - srcu_read_lock_held(&srcu)) + hlist_for_each_entry_rcu(chain, &subscriptions->list, hlist, + srcu_read_lock_held(&srcu)) { + struct mmu_notifier *subscription = + mmu_notifier_from_chain(chain); /* * If ->release runs before mmu_notifier_unregister it must be * handled, as it's the only way for the driver to flush all @@ -317,18 +329,19 @@ static void mn_hlist_release(struct mmu_notifier_subscriptions *subscriptions, */ if (subscription->ops->release) subscription->ops->release(subscription, mm); + } spin_lock(&subscriptions->lock); while (unlikely(!hlist_empty(&subscriptions->list))) { - subscription = hlist_entry(subscriptions->list.first, - struct mmu_notifier, hlist); + chain = hlist_entry(subscriptions->list.first, + struct mm_notifier_chain, hlist); /* * We arrived before mmu_notifier_unregister so * mmu_notifier_unregister will do nothing other than to wait * for ->release to finish and for mmu_notifier_unregister to * return. */ - hlist_del_init_rcu(&subscription->hlist); + hlist_del_init_rcu(&chain->hlist); } spin_unlock(&subscriptions->lock); srcu_read_unlock(&srcu, id); @@ -366,13 +379,15 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm, unsigned long start, unsigned long end) { - struct mmu_notifier *subscription; + struct mm_notifier_chain *chain; int young = 0, id; id = srcu_read_lock(&srcu); - hlist_for_each_entry_rcu(subscription, + hlist_for_each_entry_rcu(chain, &mm->notifier_subscriptions->list, hlist, srcu_read_lock_held(&srcu)) { + struct mmu_notifier *subscription = + mmu_notifier_from_chain(chain); if (subscription->ops->clear_flush_young) young |= subscription->ops->clear_flush_young( subscription, mm, start, end); @@ -386,13 +401,15 @@ int __mmu_notifier_clear_young(struct mm_struct *mm, unsigned long start, unsigned long end) { - struct mmu_notifier *subscription; + struct mm_notifier_chain *chain; int young = 0, id; id = srcu_read_lock(&srcu); - hlist_for_each_entry_rcu(subscription, + hlist_for_each_entry_rcu(chain, &mm->notifier_subscriptions->list, hlist, srcu_read_lock_held(&srcu)) { + struct mmu_notifier *subscription = + mmu_notifier_from_chain(chain); if (subscription->ops->clear_young) young |= subscription->ops->clear_young(subscription, mm, start, end); @@ -405,13 +422,15 @@ int __mmu_notifier_clear_young(struct mm_struct *mm, int __mmu_notifier_test_young(struct mm_struct *mm, unsigned long address) { - struct mmu_notifier *subscription; + struct mm_notifier_chain *chain; int young = 0, id; id = srcu_read_lock(&srcu); - hlist_for_each_entry_rcu(subscription, + hlist_for_each_entry_rcu(chain, &mm->notifier_subscriptions->list, hlist, srcu_read_lock_held(&srcu)) { + struct mmu_notifier *subscription = + mmu_notifier_from_chain(chain); if (subscription->ops->test_young) { young = subscription->ops->test_young(subscription, mm, address); @@ -427,13 +446,15 @@ int __mmu_notifier_test_young(struct mm_struct *mm, void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address, pte_t pte) { - struct mmu_notifier *subscription; + struct mm_notifier_chain *chain; int id; id = srcu_read_lock(&srcu); - hlist_for_each_entry_rcu(subscription, + hlist_for_each_entry_rcu(chain, &mm->notifier_subscriptions->list, hlist, srcu_read_lock_held(&srcu)) { + struct mmu_notifier *subscription = + mmu_notifier_from_chain(chain); if (subscription->ops->change_pte) subscription->ops->change_pte(subscription, mm, address, pte); @@ -476,13 +497,15 @@ static int mn_hlist_invalidate_range_start( struct mmu_notifier_subscriptions *subscriptions, struct mmu_notifier_range *range) { - struct mmu_notifier *subscription; + struct mm_notifier_chain *chain; int ret = 0; int id; id = srcu_read_lock(&srcu); - hlist_for_each_entry_rcu(subscription, &subscriptions->list, hlist, + hlist_for_each_entry_rcu(chain, &subscriptions->list, hlist, srcu_read_lock_held(&srcu)) { + struct mmu_notifier *subscription = + mmu_notifier_from_chain(chain); const struct mmu_notifier_ops *ops = subscription->ops; if (ops->invalidate_range_start) { @@ -519,8 +542,10 @@ static int mn_hlist_invalidate_range_start( * notifiers and one or more failed start, any that succeeded * start are expecting their end to be called. Do so now. */ - hlist_for_each_entry_rcu(subscription, &subscriptions->list, + hlist_for_each_entry_rcu(chain, &subscriptions->list, hlist, srcu_read_lock_held(&srcu)) { + struct mmu_notifier *subscription = + mmu_notifier_from_chain(chain); if (!subscription->ops->invalidate_range_end) continue; @@ -553,35 +578,20 @@ static void mn_hlist_invalidate_end(struct mmu_notifier_subscriptions *subscriptions, struct mmu_notifier_range *range, bool only_end) { - struct mmu_notifier *subscription; + struct mm_notifier_chain *chain; int id; id = srcu_read_lock(&srcu); - hlist_for_each_entry_rcu(subscription, &subscriptions->list, hlist, + hlist_for_each_entry_rcu(chain, &subscriptions->list, hlist, srcu_read_lock_held(&srcu)) { - /* - * Call invalidate_range here too to avoid the need for the - * subsystem of having to register an invalidate_range_end - * call-back when there is invalidate_range already. Usually a - * subsystem registers either invalidate_range_start()/end() or - * invalidate_range(), so this will be no additional overhead - * (besides the pointer check). - * - * We skip call to invalidate_range() if we know it is safe ie - * call site use mmu_notifier_invalidate_range_only_end() which - * is safe to do when we know that a call to invalidate_range() - * already happen under page table lock. - */ - if (!only_end && subscription->ops->invalidate_range) - subscription->ops->invalidate_range(subscription, - range->mm, - range->start, - range->end); - if (subscription->ops->invalidate_range_end) { + struct mmu_notifier *subscription = + mmu_notifier_from_chain(chain); + const struct mmu_notifier_ops *ops = subscription->ops; + + if (ops->invalidate_range_end) { if (!mmu_notifier_range_blockable(range)) non_block_start(); - subscription->ops->invalidate_range_end(subscription, - range); + ops->invalidate_range_end(subscription, range); if (!mmu_notifier_range_blockable(range)) non_block_end(); } @@ -607,27 +617,24 @@ void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range, void __mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long start, unsigned long end) { - struct mmu_notifier *subscription; + struct mm_notifier_chain *chain; int id; id = srcu_read_lock(&srcu); - hlist_for_each_entry_rcu(subscription, + hlist_for_each_entry_rcu(chain, &mm->notifier_subscriptions->list, hlist, srcu_read_lock_held(&srcu)) { - if (subscription->ops->invalidate_range) - subscription->ops->invalidate_range(subscription, mm, - start, end); + struct tlb_notifier *subscription = + tlb_notifier_from_chain(chain); + if (subscription->ops->invalidate_secondary_tlbs) + subscription->ops->invalidate_secondary_tlbs(subscription, + mm, start, end); } srcu_read_unlock(&srcu, id); } -/* - * Same as mmu_notifier_register but here the caller must hold the mmap_lock in - * write mode. A NULL mn signals the notifier is being registered for itree - * mode. - */ -int __mmu_notifier_register(struct mmu_notifier *subscription, - struct mm_struct *mm) +int __mm_notifier_chain_register(struct mm_notifier_chain *chain, + struct mm_struct *mm) { struct mmu_notifier_subscriptions *subscriptions = NULL; int ret; @@ -677,14 +684,14 @@ int __mmu_notifier_register(struct mmu_notifier *subscription, if (subscriptions) smp_store_release(&mm->notifier_subscriptions, subscriptions); - if (subscription) { + if (chain) { /* Pairs with the mmdrop in mmu_notifier_unregister_* */ mmgrab(mm); - subscription->mm = mm; - subscription->users = 1; + chain->mm = mm; + chain->users = 1; spin_lock(&mm->notifier_subscriptions->lock); - hlist_add_head_rcu(&subscription->hlist, + hlist_add_head_rcu(&chain->hlist, &mm->notifier_subscriptions->list); spin_unlock(&mm->notifier_subscriptions->lock); } else @@ -698,6 +705,18 @@ int __mmu_notifier_register(struct mmu_notifier *subscription, kfree(subscriptions); return ret; } + +/* + * Same as mmu_notifier_register but here the caller must hold the mmap_lock in + * write mode. A NULL mn signals the notifier is being registered for itree + * mode. + */ +int __mmu_notifier_register(struct mmu_notifier *subscription, + struct mm_struct *mm) +{ + return __mm_notifier_chain_register(&subscription->mm_notifier_chain, + mm); +} EXPORT_SYMBOL_GPL(__mmu_notifier_register); /** @@ -731,20 +750,34 @@ int mmu_notifier_register(struct mmu_notifier *subscription, } EXPORT_SYMBOL_GPL(mmu_notifier_register); +int tlb_notifier_register(struct tlb_notifier *subscription, + struct mm_struct *mm) +{ + int ret; + + mmap_write_lock(mm); + ret = __mm_notifier_chain_register(&subscription->mm_notifier_chain, mm); + mmap_write_unlock(mm); + return ret; +} +EXPORT_SYMBOL_GPL(tlb_notifier_register); + static struct mmu_notifier * find_get_mmu_notifier(struct mm_struct *mm, const struct mmu_notifier_ops *ops) { - struct mmu_notifier *subscription; + struct mm_notifier_chain *chain; spin_lock(&mm->notifier_subscriptions->lock); - hlist_for_each_entry_rcu(subscription, + hlist_for_each_entry_rcu(chain, &mm->notifier_subscriptions->list, hlist, lockdep_is_held(&mm->notifier_subscriptions->lock)) { + struct mmu_notifier *subscription = + mmu_notifier_from_chain(chain); if (subscription->ops != ops) continue; - if (likely(subscription->users != UINT_MAX)) - subscription->users++; + if (likely(chain->users != UINT_MAX)) + chain->users++; else subscription = ERR_PTR(-EOVERFLOW); spin_unlock(&mm->notifier_subscriptions->lock); @@ -822,7 +855,7 @@ void mmu_notifier_unregister(struct mmu_notifier *subscription, { BUG_ON(atomic_read(&mm->mm_count) <= 0); - if (!hlist_unhashed(&subscription->hlist)) { + if (!hlist_unhashed(&subscription->mm_notifier_chain.hlist)) { /* * SRCU here will force exit_mmap to wait for ->release to * finish before freeing the pages. @@ -843,7 +876,7 @@ void mmu_notifier_unregister(struct mmu_notifier *subscription, * Can not use list_del_rcu() since __mmu_notifier_release * can delete it before we hold the lock. */ - hlist_del_init_rcu(&subscription->hlist); + hlist_del_init_rcu(&subscription->mm_notifier_chain.hlist); spin_unlock(&mm->notifier_subscriptions->lock); } @@ -861,15 +894,34 @@ EXPORT_SYMBOL_GPL(mmu_notifier_unregister); static void mmu_notifier_free_rcu(struct rcu_head *rcu) { - struct mmu_notifier *subscription = - container_of(rcu, struct mmu_notifier, rcu); - struct mm_struct *mm = subscription->mm; + struct mm_notifier_chain *chain = + container_of(rcu, struct mm_notifier_chain, rcu); + struct mm_struct *mm = chain->mm; + struct mmu_notifier *subscription = mmu_notifier_from_chain(chain); subscription->ops->free_notifier(subscription); /* Pairs with the get in __mmu_notifier_register() */ mmdrop(mm); } +void mm_notifier_chain_put(struct mm_notifier_chain *chain) +{ + struct mm_struct *mm = chain->mm; + + spin_lock(&mm->notifier_subscriptions->lock); + if (WARN_ON(!chain->users) || + --chain->users) + goto out_unlock; + hlist_del_init_rcu(&chain->hlist); + spin_unlock(&mm->notifier_subscriptions->lock); + + call_srcu(&srcu, &chain->rcu, mmu_notifier_free_rcu); + return; + +out_unlock: + spin_unlock(&mm->notifier_subscriptions->lock); +} + /** * mmu_notifier_put - Release the reference on the notifier * @subscription: The notifier to act on @@ -894,22 +946,16 @@ static void mmu_notifier_free_rcu(struct rcu_head *rcu) */ void mmu_notifier_put(struct mmu_notifier *subscription) { - struct mm_struct *mm = subscription->mm; - - spin_lock(&mm->notifier_subscriptions->lock); - if (WARN_ON(!subscription->users) || --subscription->users) - goto out_unlock; - hlist_del_init_rcu(&subscription->hlist); - spin_unlock(&mm->notifier_subscriptions->lock); - - call_srcu(&srcu, &subscription->rcu, mmu_notifier_free_rcu); - return; - -out_unlock: - spin_unlock(&mm->notifier_subscriptions->lock); + mm_notifier_chain_put(&subscription->mm_notifier_chain); } EXPORT_SYMBOL_GPL(mmu_notifier_put); +void tlb_notifier_put(struct tlb_notifier *subscription) +{ + mm_notifier_chain_put(&subscription->mm_notifier_chain); +} +EXPORT_SYMBOL_GPL(tlb_notifier_put); + static int __mmu_interval_notifier_insert( struct mmu_interval_notifier *interval_sub, struct mm_struct *mm, struct mmu_notifier_subscriptions *subscriptions, unsigned long start,
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index cb21ccd..1ee45a8 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -25,6 +25,7 @@ #include <linux/perf_event.h> #include <linux/preempt.h> #include <linux/hugetlb.h> +#include <linux/mmu_notifier.h> #include <asm/acpi.h> #include <asm/bug.h> @@ -225,8 +226,11 @@ int ptep_set_access_flags(struct vm_area_struct *vma, } while (pteval != old_pteval); /* Invalidate a stale read-only entry */ - if (dirty) + if (dirty) { flush_tlb_page(vma, address); + mmu_notifier_invalidate_range(vma->vm_mm, address & PAGE_MASK, + (address & PAGE_MASK) + PAGE_SIZE); + } return 1; } diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c index 95364e8..677f0d1 100644 --- a/arch/arm64/mm/hugetlbpage.c +++ b/arch/arm64/mm/hugetlbpage.c @@ -14,6 +14,7 @@ #include <linux/pagemap.h> #include <linux/err.h> #include <linux/sysctl.h> +#include <linux/mmu_notifier.h> #include <asm/mman.h> #include <asm/tlb.h> #include <asm/tlbflush.h> @@ -487,6 +488,14 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma, orig_pte = get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig); + /* + * Make sure any cached read-only entries are removed from + * secondary TLBs. + */ + if (dirty) + mmu_notifier_invalidate_range(mm, addr, + addr + (pgsize + ncontig)); + /* Make sure we don't lose the dirty or young state */ if (pte_dirty(orig_pte)) pte = pte_mkdirty(pte);
ARM64 requires TLB invalidates when upgrading pte permission from read-only to read-write. However mmu_notifiers assume upgrades do not need notifications and none are sent. This causes problems when a secondary TLB such as implemented by an ARM SMMU doesn't support broadcast TLB maintenance (BTM) and caches a read-only PTE. As no notification is sent and the SMMU does not snoop TLB invalidates it will continue to return read-only entries to a device even though the CPU page table contains a writable entry. This leads to a continually faulting device and no way of handling the fault. The ARM SMMU driver already registers for mmu notifier events to keep any secondary TLB synchronised. Therefore sending a notifier on permission upgrade fixes the problem. Rather than adding notifier calls to generic architecture independent code where it may cause performance regressions on architectures that don't require it add it to the architecture specific ptep_set_access_flags() where the CPU TLB is invalidated. Signed-off-by: Alistair Popple <apopple@nvidia.com> --- A version of this fix was previously posted here: https://lore.kernel.org/linux-mm/ZGxg+I8FWz3YqBMk@infradead.org/T/ That fix updated generic architecture independent code by adding a new mmu notifier range event that would allow filtering by architectures/drivers that didn't require flushing for upgrades. This was done because calling notifiers from architecture specific code requires calling the notifier while holding the ptl. It wasn't immediately obvious that that was safe, but review comments and git history suggests it must be hence the updated approach here. --- arch/arm64/mm/fault.c | 6 +++++- arch/arm64/mm/hugetlbpage.c | 9 +++++++++ 2 files changed, 14 insertions(+), 1 deletion(-)