Message ID | 20190827131818.14724-1-will@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | Fix TLB invalidation on arm64 | expand |
On Tue, Aug 27, 2019 at 02:18:12PM +0100, Will Deacon wrote: > Hi all, Hi Will, For the series: Reviewed-by: Mark Rutland <mark.rutland@arm.com> Thanks, Mark. > > [+linux-arch since the end of this may be applicable to other architectures] > > Commit 24fe1b0efad4fcdd ("arm64: Remove unnecessary ISBs from > set_{pte,pmd,pud") removed ISB instructions immediately following updates to > the page table, on the grounds that they are not required by the > architecture and a DSB alone is sufficient to ensure that subsequent data > accesses use the new translation: > > DDI0487E_a, B2-128: > > | ... no instruction that appears in program order after the DSB instruction > | can alter any state of the system or perform any part of its functionality > | until the DSB completes other than: > | > | * Being fetched from memory and decoded > | * Reading the general-purpose, SIMD and floating-point, Special-purpose, or > | System registers that are directly or indirectly read without causing > | side-effects. > > However, the same document also states the following: > > DDI0487E_a, B2-125: > > | DMB and DSB instructions affect reads and writes to the memory system > | generated by Load/Store instructions and data or unified cache maintenance > | instructions being executed by the PE. Instruction fetches or accesses > | caused by a hardware translation table access are not explicit accesses. > > which appears to claim that the DSB alone is insufficient. Unfortunately, > some CPU designers have followed the second clause above, whereas in Linux > we've been relying on the first. This means that our mapping sequence: > > MOV X0, <valid pte> > STR X0, [Xptep] // Store new PTE to page table > DSB ISHST > LDR X1, [X2] // Translates using the new PTE > > can actually raise a translation fault on the load instruction because the > translation can be performed speculatively before the page table update and > then marked as "faulting" by the CPU. For user PTEs, this is ok because we > can handle the spurious fault, but for kernel PTEs and intermediate table > entries this results in a panic(). > > We can fix this by reverting 24fe1b0efad4fcdd, but the fun doesn't stop > there. If we consider the unmap case, then a similar constraint applies to > ordering subsequent memory accesses after the completion of the TLB > invalidation, so we also need to add an ISB instruction to > __flush_tlb_kernel_pgtable(). For user addresses, the exception return > provides the necessary context synchronisation. > > This then raises an interesting question: if an ISB is required after a TLBI > instruction to prevent speculative translation of subsequent instructions, > how is this speculation prevented on concurrent CPUs that receive the > broadcast TLB invalidation message? Sending and completing a broadcast TLB > invalidation message does not imply execution of an ISB on the remote CPU, > however it /does/ require that the remote CPU will no longer make use of any > old translations because otherwise we wouldn't be able to guarantee that an > unmapped page could no longer be modified. In this regard, receiving a TLB > invalidation is in some ways stronger than sending one (where you need the > ISB). > > So far, so good, but the final piece of the puzzle isn't quite so rosy. > > *** Other architecture maintainers -- start here! *** > > In the case that one CPU maps a page and then sets a flag to tell another > CPU: > > CPU 0 > ----- > > MOV X0, <valid pte> > STR X0, [Xptep] // Store new PTE to page table > DSB ISHST > ISB > MOV X1, #1 > STR X1, [Xflag] // Set the flag > > CPU 1 > ----- > > loop: LDAR X0, [Xflag] // Poll flag with Acquire semantics > CBZ X0, loop > LDR X1, [X2] // Translates using the new PTE > > then the final load on CPU 1 can raise a translation fault for the same > reasons as mentioned at the start of this description. In reality, code > such as: > > CPU 0 CPU 1 > ----- ----- > spin_lock(&lock); spin_lock(&lock); > *ptr = vmalloc(size); if (*ptr) > spin_unlock(&lock); foo = **ptr; > spin_unlock(&lock); > > will not trigger the fault because there is an address dependency on > CPU1 which prevents the speculative translation. However, more exotic > code where the virtual address is known ahead of time, such as: > > CPU 0 CPU 1 > ----- ----- > spin_lock(&lock); spin_lock(&lock); > set_fixmap(0, paddr, prot); if (mapped) > mapped = true; foo = *fix_to_virt(0); > spin_unlock(&lock); spin_unlock(&lock); > > could fault. This can be avoided by any of: > > * Introducing broadcast TLB maintenance on the map path > * Adding a DSB;ISB sequence after checking a flag which indicates > that a virtual address is now mapped > * Handling the spurious fault > > Given that we have never observed a problem in the concurrent case under > Linux and future revisions of the architecture are being tightened so that > translation table walks are effectively ordered in the same way as explicit > memory accesses, we no longer treat spurious kernel faults as fatal if the > page table indicates that the access was valid. > > Anyway, this patch series attempts to implement some of this and I plan > to queue it for 5.4. > > Will > > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Mark Rutland <mark.rutland@arm.com> > Cc: Marc Zyngier <maz@kernel.org> > Cc: Peter Zijlstra <peterz@infradead.org> > > --->8 > > Will Deacon (6): > Revert "arm64: Remove unnecessary ISBs from set_{pte,pmd,pud}" > arm64: tlb: Ensure we execute an ISB following walk cache invalidation > arm64: mm: Add ISB instruction to set_pgd() > arm64: sysreg: Add some field definitions for PAR_EL1 > arm64: mm: Ignore spurious translation faults taken from the kernel > arm64: kvm: Replace hardcoded '1' with SYS_PAR_EL1_F > > arch/arm64/include/asm/pgtable.h | 13 ++++++++++--- > arch/arm64/include/asm/sysreg.h | 3 +++ > arch/arm64/include/asm/tlbflush.h | 1 + > arch/arm64/kvm/hyp/switch.c | 2 +- > arch/arm64/mm/fault.c | 33 +++++++++++++++++++++++++++++++++ > 5 files changed, 48 insertions(+), 4 deletions(-) > > -- > 2.11.0 >
Will Deacon's on August 27, 2019 11:18 pm: > can actually raise a translation fault on the load instruction because the > translation can be performed speculatively before the page table update and > then marked as "faulting" by the CPU. For user PTEs, this is ok because we > can handle the spurious fault, but for kernel PTEs and intermediate table > entries this results in a panic(). powerpc sounds like it has the same coherency issue with stores vs loads from the MMU's page table walker, and a barrier called ptesync to order them. > We can fix this by reverting 24fe1b0efad4fcdd, but the fun doesn't stop > there. If we consider the unmap case, then a similar constraint applies to > ordering subsequent memory accesses after the completion of the TLB > invalidation, so we also need to add an ISB instruction to > __flush_tlb_kernel_pgtable(). For user addresses, the exception return > provides the necessary context synchronisation. > > This then raises an interesting question: if an ISB is required after a TLBI > instruction to prevent speculative translation of subsequent instructions, > how is this speculation prevented on concurrent CPUs that receive the > broadcast TLB invalidation message? Sending and completing a broadcast TLB > invalidation message does not imply execution of an ISB on the remote CPU, > however it /does/ require that the remote CPU will no longer make use of any > old translations because otherwise we wouldn't be able to guarantee that an > unmapped page could no longer be modified. In this regard, receiving a TLB > invalidation is in some ways stronger than sending one (where you need the > ISB). Similar with powerpc's tlbie, sender requires extra barriers! > So far, so good, but the final piece of the puzzle isn't quite so rosy. > > *** Other architecture maintainers -- start here! *** > > In the case that one CPU maps a page and then sets a flag to tell another > CPU: > > CPU 0 > ----- > > MOV X0, <valid pte> > STR X0, [Xptep] // Store new PTE to page table > DSB ISHST > ISB > MOV X1, #1 > STR X1, [Xflag] // Set the flag > > CPU 1 > ----- > > loop: LDAR X0, [Xflag] // Poll flag with Acquire semantics > CBZ X0, loop > LDR X1, [X2] // Translates using the new PTE > > then the final load on CPU 1 can raise a translation fault for the same > reasons as mentioned at the start of this description. powerpc's ptesync instruction is defined to order MMU memory accesses on all other CPUs. ptesync does not go out to the fabric though. How does it work then? Because the MMU coherency problem (at least we have) is not that the load will begin to "partially" execute ahead of the store, enough to kick off a table walk that goes ahead of the store, but not so much that it violates the regular CPU barriers. It's just that the loads from the MMU don't participate in the LSU pipeline, they don't snoop the store queues aren't inserted into load queues so the regular memory barrier instructions won't see stores from other threads cuasing ordering violations. In your first example, if powerpc just has a normal memory barrier there instead of a ptesync, it could all execute completely non-speculatively and in-order but still cause a fault, because the table walker's loads didn't see the store in the store queue. From the other side of the fabric you have no such problem. The table walker is cache coherent apart from the local stores, so we don't need a special barrier on the other side. That's why ptesync doesn't broadcast. I would be surprised if ARM's issue is different, but interested to hear if it is. > In reality, code > such as: > > CPU 0 CPU 1 > ----- ----- > spin_lock(&lock); spin_lock(&lock); > *ptr = vmalloc(size); if (*ptr) > spin_unlock(&lock); foo = **ptr; > spin_unlock(&lock); > > will not trigger the fault because there is an address dependency on > CPU1 which prevents the speculative translation. However, more exotic > code where the virtual address is known ahead of time, such as: > > CPU 0 CPU 1 > ----- ----- > spin_lock(&lock); spin_lock(&lock); > set_fixmap(0, paddr, prot); if (mapped) > mapped = true; foo = *fix_to_virt(0); > spin_unlock(&lock); spin_unlock(&lock); > > could fault. This is kind of a different issue, or part of a wider one at least. Consider speculative execution more generally, any branch mispredict can send us off to crazy town executing instructions using nonsense register values. CPU0 does not have to be in the picture, or any kernel page table modification at all, CPU1 alone will be doing speculative loads wildly all over the kernel address space and trying to access pages with no pte. Yet we don't have to flush TLB when creating a new kernel mapping, and we don't get spurious kernel faults. The page table walker won't install negative entries, at least not "architectural" i.e., that cause faults and require flushing. My guess is ARM is similar, or you would have seen bigger problems by now? If you have CPU0 doing a ro->rw upgrade on a kernel PTE, then it may be possible another CPU1 would speculatively install a ro TLB and then spurious fault on it when attempting to store to it. But no amount of barriers would help because CPU1 could have picked up that TLB any time in the past. Thanks, Nick
Hi Nick, On Wed, Aug 28, 2019 at 10:35:24AM +1000, Nicholas Piggin wrote: > Will Deacon's on August 27, 2019 11:18 pm: > > So far, so good, but the final piece of the puzzle isn't quite so rosy. > > > > *** Other architecture maintainers -- start here! *** > > > > In the case that one CPU maps a page and then sets a flag to tell another > > CPU: > > > > CPU 0 > > ----- > > > > MOV X0, <valid pte> > > STR X0, [Xptep] // Store new PTE to page table > > DSB ISHST > > ISB > > MOV X1, #1 > > STR X1, [Xflag] // Set the flag > > > > CPU 1 > > ----- > > > > loop: LDAR X0, [Xflag] // Poll flag with Acquire semantics > > CBZ X0, loop > > LDR X1, [X2] // Translates using the new PTE > > > > then the final load on CPU 1 can raise a translation fault for the same > > reasons as mentioned at the start of this description. > > powerpc's ptesync instruction is defined to order MMU memory accesses on > all other CPUs. ptesync does not go out to the fabric though. How does > it work then? > > Because the MMU coherency problem (at least we have) is not that the > load will begin to "partially" execute ahead of the store, enough to > kick off a table walk that goes ahead of the store, but not so much > that it violates the regular CPU barriers. It's just that the loads > from the MMU don't participate in the LSU pipeline, they don't snoop > the store queues aren't inserted into load queues so the regular > memory barrier instructions won't see stores from other threads cuasing > ordering violations. > > In your first example, if powerpc just has a normal memory barrier > there instead of a ptesync, it could all execute completely > non-speculatively and in-order but still cause a fault, because the > table walker's loads didn't see the store in the store queue. Ah, so I think this is where our DSB comes in, as opposed to our usual DMB. DMB provides ordering guarantees and is generally the only barrier instruction you need for shared memory communication. DSB, on the other hand, has additional properties such as making page-table updates visible to the table walker and completing pending TLB invalidation. So in practice, DSB is likely to drain the store buffer to ensure that pending page-table writes are visible at L1, which is coherent with all CPUs (and their page-table walkers). > From the other side of the fabric you have no such problem. The table > walker is cache coherent apart from the local stores, so we don't need a > special barrier on the other side. That's why ptesync doesn't broadcast. Curious: but do you need to do anything extra to take into account instruction fetch on remote CPUs if you're mapping an executable page? We added an IPI to flush_icache_range() in 3b8c9f1cdfc5 to handle this, because our broadcast I-cache maintenance doesn't force a pipeline flush for remote CPUs (and may even execute as a NOP on recent cores). > I would be surprised if ARM's issue is different, but interested to > hear if it is. If you take the four scenarios of Map/Unmap for the UP/SMP cases: Map+UP // Some sort of fence instruction (DSB;ISB/ptesync) Map+SMP // Same as above Unmap+UP // Local TLB invalidation Unmap+SMP // Broadcast TLB invalidation then the most interesting case is Map+SMP, where one CPU transitions a PTE from invalid to valid and executes its DSB;ISB/PTESYNC sequence without affecting the remote CPU. That's what I'm trying to get at in my example below: > > CPU 0 CPU 1 > > ----- ----- > > spin_lock(&lock); spin_lock(&lock); > > set_fixmap(0, paddr, prot); if (mapped) > > mapped = true; foo = *fix_to_virt(0); > > spin_unlock(&lock); spin_unlock(&lock); > > > > could fault. > > This is kind of a different issue, or part of a wider one at least. > Consider speculative execution more generally, any branch mispredict can > send us off to crazy town executing instructions using nonsense register > values. CPU0 does not have to be in the picture, or any kernel page > table modification at all, CPU1 alone will be doing speculative loads > wildly all over the kernel address space and trying to access pages with > no pte. > > Yet we don't have to flush TLB when creating a new kernel mapping, and > we don't get spurious kernel faults. The page table walker won't > install negative entries, at least not "architectural" i.e., that cause > faults and require flushing. My guess is ARM is similar, or you would > have seen bigger problems by now? Right, we don't allow negative (invalid) entries to be cached in the TLB, but CPUs can effectively cache the result of a translation for a load/store instruction whilst that instruction flows down the pipe after its virtual address is known. /That/ caching is not restricted to valid translations. For example, if we just take a simple message passing example where we have two global variables: a 'mapped' flag (initialised to zero) and a pointer (initialised to point at a page that is not yet mapped): [please excuse the mess I've made of your assembly language] CPU 0 // Set PTE which maps the page pointed to by pointer stw r5, 0(r4) ptesync lwsync // Set 'mapped' flag to 1 li r9, 1 stb r9, 0(r3) CPU 1 // Poll for 'mapped' flag to be set loop: lbz r9, 0(r3) lwsync cmpdi cr7, r0, 0 beq cr7, loop // Dereference pointer lwz r4, 0(r5) So in this example, I think it's surprising that CPU 1's dereference of 'pointer' can fault. The way this happens on arm64 is that CPU 1 can translate the 'pointer' dereference before the load of the 'mapped' flag has returned its data. The walker may come back with a fault, but then if the flag data later comes back as 1, the fault will be taken when the lwz commits. In other words, translation table walks can occur out-of-order with respect to the accesses they are translating for, even in the presence of memory barriers. In practice, I think this kind of code sequence is unusual and triggering the problem relies on CPU 1 knowing the virtual address it's going to dereference before it's actually mapped. However, this could conceivably happen with the fixmap and possibly also if the page-table itself was being written concurrently using cmpxchg(), in which case you might use the actual pte value in the same way as the 'mapped' flag above. But yes, adding the spurious fault handler is belt and braces, which is why I've kept a WARN_RATELIMIT() in there if it ever triggers. > If you have CPU0 doing a ro->rw upgrade on a kernel PTE, then it may > be possible another CPU1 would speculatively install a ro TLB and then > spurious fault on it when attempting to store to it. But no amount of > barriers would help because CPU1 could have picked up that TLB any time > in the past. Transitioning from ro->rw requires TLB invalidation, and so that falls into the Unmap+SMP combination which I'm not worried about because it doesn't pose a problem for us. Will
Will Deacon's on August 29, 2019 2:12 am: > Hi Nick, > > On Wed, Aug 28, 2019 at 10:35:24AM +1000, Nicholas Piggin wrote: >> Will Deacon's on August 27, 2019 11:18 pm: >> > So far, so good, but the final piece of the puzzle isn't quite so rosy. >> > >> > *** Other architecture maintainers -- start here! *** >> > >> > In the case that one CPU maps a page and then sets a flag to tell another >> > CPU: >> > >> > CPU 0 >> > ----- >> > >> > MOV X0, <valid pte> >> > STR X0, [Xptep] // Store new PTE to page table >> > DSB ISHST >> > ISB >> > MOV X1, #1 >> > STR X1, [Xflag] // Set the flag >> > >> > CPU 1 >> > ----- >> > >> > loop: LDAR X0, [Xflag] // Poll flag with Acquire semantics >> > CBZ X0, loop >> > LDR X1, [X2] // Translates using the new PTE >> > >> > then the final load on CPU 1 can raise a translation fault for the same >> > reasons as mentioned at the start of this description. >> >> powerpc's ptesync instruction is defined to order MMU memory accesses on >> all other CPUs. ptesync does not go out to the fabric though. How does >> it work then? >> >> Because the MMU coherency problem (at least we have) is not that the >> load will begin to "partially" execute ahead of the store, enough to >> kick off a table walk that goes ahead of the store, but not so much >> that it violates the regular CPU barriers. It's just that the loads >> from the MMU don't participate in the LSU pipeline, they don't snoop >> the store queues aren't inserted into load queues so the regular >> memory barrier instructions won't see stores from other threads cuasing >> ordering violations. >> >> In your first example, if powerpc just has a normal memory barrier >> there instead of a ptesync, it could all execute completely >> non-speculatively and in-order but still cause a fault, because the >> table walker's loads didn't see the store in the store queue. > > Ah, so I think this is where our DSB comes in, as opposed to our usual > DMB. DMB provides ordering guarantees and is generally the only barrier > instruction you need for shared memory communication. DSB, on the other > hand, has additional properties such as making page-table updates visible > to the table walker and completing pending TLB invalidation. > > So in practice, DSB is likely to drain the store buffer to ensure that > pending page-table writes are visible at L1, which is coherent with all > CPUs (and their page-table walkers). > >> From the other side of the fabric you have no such problem. The table >> walker is cache coherent apart from the local stores, so we don't need a >> special barrier on the other side. That's why ptesync doesn't broadcast. > > Curious: but do you need to do anything extra to take into account > instruction fetch on remote CPUs if you're mapping an executable page? > We added an IPI to flush_icache_range() in 3b8c9f1cdfc5 to handle this, > because our broadcast I-cache maintenance doesn't force a pipeline flush > for remote CPUs (and may even execute as a NOP on recent cores). Ah, I think the tlbie does not force re-fetch indeed. We may need something like that as well. What do you do on the user side? Require threads to ISB themselves? >> I would be surprised if ARM's issue is different, but interested to >> hear if it is. > > If you take the four scenarios of Map/Unmap for the UP/SMP cases: > > Map+UP // Some sort of fence instruction (DSB;ISB/ptesync) > Map+SMP // Same as above > Unmap+UP // Local TLB invalidation > Unmap+SMP // Broadcast TLB invalidation > > then the most interesting case is Map+SMP, where one CPU transitions a PTE > from invalid to valid and executes its DSB;ISB/PTESYNC sequence without > affecting the remote CPU. That's what I'm trying to get at in my example > below: > >> > CPU 0 CPU 1 >> > ----- ----- >> > spin_lock(&lock); spin_lock(&lock); >> > set_fixmap(0, paddr, prot); if (mapped) >> > mapped = true; foo = *fix_to_virt(0); >> > spin_unlock(&lock); spin_unlock(&lock); >> > >> > could fault. >> >> This is kind of a different issue, or part of a wider one at least. >> Consider speculative execution more generally, any branch mispredict can >> send us off to crazy town executing instructions using nonsense register >> values. CPU0 does not have to be in the picture, or any kernel page >> table modification at all, CPU1 alone will be doing speculative loads >> wildly all over the kernel address space and trying to access pages with >> no pte. >> >> Yet we don't have to flush TLB when creating a new kernel mapping, and >> we don't get spurious kernel faults. The page table walker won't >> install negative entries, at least not "architectural" i.e., that cause >> faults and require flushing. My guess is ARM is similar, or you would >> have seen bigger problems by now? > > Right, we don't allow negative (invalid) entries to be cached in the TLB, > but CPUs can effectively cache the result of a translation for a load/store > instruction whilst that instruction flows down the pipe after its virtual > address is known. /That/ caching is not restricted to valid translations. Ah, I misunderstood you sorry. Yeah that is interesting, I don't think that is explicitly prohibited in the power ISA either. I believe CPU1 would have to do a ptesync to avoid the fault there. > For example, if we just take a simple message passing example where we have > two global variables: a 'mapped' flag (initialised to zero) and a pointer > (initialised to point at a page that is not yet mapped): > > [please excuse the mess I've made of your assembly language] > > CPU 0 > > // Set PTE which maps the page pointed to by pointer > stw r5, 0(r4) > ptesync > lwsync > > // Set 'mapped' flag to 1 > li r9, 1 > stb r9, 0(r3) > > > CPU 1 > > // Poll for 'mapped' flag to be set > loop: lbz r9, 0(r3) > lwsync > cmpdi cr7, r0, 0 > beq cr7, loop > > // Dereference pointer > lwz r4, 0(r5) > > > So in this example, I think it's surprising that CPU 1's dereference of > 'pointer' can fault. The way this happens on arm64 is that CPU 1 can > translate the 'pointer' dereference before the load of the 'mapped' flag has > returned its data. The walker may come back with a fault, but then if the > flag data later comes back as 1, the fault will be taken when the lwz > commits. In other words, translation table walks can occur out-of-order > with respect to the accesses they are translating for, even in the presence > of memory barriers. > > In practice, I think this kind of code sequence is unusual and triggering > the problem relies on CPU 1 knowing the virtual address it's going to > dereference before it's actually mapped. However, this could conceivably > happen with the fixmap and possibly also if the page-table itself was > being written concurrently using cmpxchg(), in which case you might use > the actual pte value in the same way as the 'mapped' flag above. > > But yes, adding the spurious fault handler is belt and braces, which is > why I've kept a WARN_RATELIMIT() in there if it ever triggers. This could be a theoretical problem for powerpc too, I think. Maybe. I'll ask around, I might not be understanding the architecture or Linux code properly. Thanks, Nick
On Fri, Aug 30, 2019 at 12:08:52AM +1000, Nicholas Piggin wrote: > Will Deacon's on August 29, 2019 2:12 am: > > On Wed, Aug 28, 2019 at 10:35:24AM +1000, Nicholas Piggin wrote: > >> From the other side of the fabric you have no such problem. The table > >> walker is cache coherent apart from the local stores, so we don't need a > >> special barrier on the other side. That's why ptesync doesn't broadcast. > > > > Curious: but do you need to do anything extra to take into account > > instruction fetch on remote CPUs if you're mapping an executable page? > > We added an IPI to flush_icache_range() in 3b8c9f1cdfc5 to handle this, > > because our broadcast I-cache maintenance doesn't force a pipeline flush > > for remote CPUs (and may even execute as a NOP on recent cores). > > Ah, I think the tlbie does not force re-fetch indeed. We may need > something like that as well. > > What do you do on the user side? Require threads to ISB themselves? I think they'd probably have to use sys_membarrier() with MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, yes. Will