Message ID | 20210131001132.3368247-1-namit@vmware.com (mailing list archive) |
---|---|
Headers | show |
Series | TLB batching consolidation and enhancements | expand |
On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote: > > From: Nadav Amit <namit@vmware.com> > > There are currently (at least?) 5 different TLB batching schemes in the > kernel: > > 1. Using mmu_gather (e.g., zap_page_range()). > > 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the > ongoing deferred TLB flush and flushing the entire range eventually > (e.g., change_protection_range()). > > 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?). > > 4. Batching per-table flushes (move_ptes()). > > 5. By setting a flag on that a deferred TLB flush operation takes place, > flushing when (try_to_unmap_one() on x86). Are you referring to the arch_tlbbatch_add_mm/flush mechanism?
> On Jan 30, 2021, at 4:39 PM, Andy Lutomirski <luto@kernel.org> wrote: > > On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote: >> From: Nadav Amit <namit@vmware.com> >> >> There are currently (at least?) 5 different TLB batching schemes in the >> kernel: >> >> 1. Using mmu_gather (e.g., zap_page_range()). >> >> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the >> ongoing deferred TLB flush and flushing the entire range eventually >> (e.g., change_protection_range()). >> >> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?). >> >> 4. Batching per-table flushes (move_ptes()). >> >> 5. By setting a flag on that a deferred TLB flush operation takes place, >> flushing when (try_to_unmap_one() on x86). > > Are you referring to the arch_tlbbatch_add_mm/flush mechanism? Yes.
Excerpts from Nadav Amit's message of January 31, 2021 10:11 am: > From: Nadav Amit <namit@vmware.com> > > There are currently (at least?) 5 different TLB batching schemes in the > kernel: > > 1. Using mmu_gather (e.g., zap_page_range()). > > 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the > ongoing deferred TLB flush and flushing the entire range eventually > (e.g., change_protection_range()). > > 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?). > > 4. Batching per-table flushes (move_ptes()). > > 5. By setting a flag on that a deferred TLB flush operation takes place, > flushing when (try_to_unmap_one() on x86). > > It seems that (1)-(4) can be consolidated. In addition, it seems that > (5) is racy. It also seems there can be many redundant TLB flushes, and > potentially TLB-shootdown storms, for instance during batched > reclamation (using try_to_unmap_one()) if at the same time mmu_gather > defers TLB flushes. > > More aggressive TLB batching may be possible, but this patch-set does > not add such batching. The proposed changes would enable such batching > in a later time. > > Admittedly, I do not understand how things are not broken today, which > frightens me to make further batching before getting things in order. > For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes > for each page-table (but not in greater granularity). Can't > ClearPageDirty() be called before the flush, causing writes after > ClearPageDirty() and before the flush to be lost? Because it's holding the page table lock which stops page_mkclean from cleaning the page. Or am I misunderstanding the question? I'll go through the patches a bit more closely when they all come through. Sparc and powerpc of course need the arch lazy mode to get per-page/pte information for operations that are not freeing pages, which is what mmu gather is designed for. I wouldn't mind using a similar API so it's less of a black box when reading generic code, but it might not quite fit the mmu gather API exactly (most of these paths don't want a full mmu_gather on stack). > > This patch-set therefore performs the following changes: > > 1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather > instead of {inc|dec}_tlb_flush_pending(). > > 2. Avoid TLB flushes if PTE permission is not demoted. > > 3. Cleans up mmu_gather to be less arch-dependant. > > 4. Uses mm's generations to track in finer granularity, either per-VMA > or per page-table, whether a pending mmu_gather operation is > outstanding. This should allow to avoid some TLB flushes when KSM or > memory reclamation takes place while another operation such as > munmap() or mprotect() is running. > > 5. Changes try_to_unmap_one() flushing scheme, as the current seems > broken to track in a bitmap which CPUs have outstanding TLB flushes > instead of having a flag. Putting fixes first, and cleanups and independent patches (like #2) next would help with getting stuff merged and backported. Thanks, Nick
> On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > > Excerpts from Nadav Amit's message of January 31, 2021 10:11 am: >> From: Nadav Amit <namit@vmware.com> >> >> There are currently (at least?) 5 different TLB batching schemes in the >> kernel: >> >> 1. Using mmu_gather (e.g., zap_page_range()). >> >> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the >> ongoing deferred TLB flush and flushing the entire range eventually >> (e.g., change_protection_range()). >> >> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?). >> >> 4. Batching per-table flushes (move_ptes()). >> >> 5. By setting a flag on that a deferred TLB flush operation takes place, >> flushing when (try_to_unmap_one() on x86). >> >> It seems that (1)-(4) can be consolidated. In addition, it seems that >> (5) is racy. It also seems there can be many redundant TLB flushes, and >> potentially TLB-shootdown storms, for instance during batched >> reclamation (using try_to_unmap_one()) if at the same time mmu_gather >> defers TLB flushes. >> >> More aggressive TLB batching may be possible, but this patch-set does >> not add such batching. The proposed changes would enable such batching >> in a later time. >> >> Admittedly, I do not understand how things are not broken today, which >> frightens me to make further batching before getting things in order. >> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes >> for each page-table (but not in greater granularity). Can't >> ClearPageDirty() be called before the flush, causing writes after >> ClearPageDirty() and before the flush to be lost? > > Because it's holding the page table lock which stops page_mkclean from > cleaning the page. Or am I misunderstanding the question? Thanks. I understood this part. Looking again at the code, I now understand my confusion: I forgot that the reverse mapping is removed after the PTE is zapped. Makes me wonder whether it is ok to defer the TLB flush to tlb_finish_mmu(), by performing set_page_dirty() for the batched pages when needed in tlb_finish_mmu() [ i.e., by marking for each batched page whether set_page_dirty() should be issued for that page while collecting them ]. > I'll go through the patches a bit more closely when they all come > through. Sparc and powerpc of course need the arch lazy mode to get > per-page/pte information for operations that are not freeing pages, > which is what mmu gather is designed for. IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE where no previous PTE was set, right? > I wouldn't mind using a similar API so it's less of a black box when > reading generic code, but it might not quite fit the mmu gather API > exactly (most of these paths don't want a full mmu_gather on stack). I see your point. It may be possible to create two mmu_gather structs: a small one that only holds the flush information and another that also holds the pages. >> This patch-set therefore performs the following changes: >> >> 1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather >> instead of {inc|dec}_tlb_flush_pending(). >> >> 2. Avoid TLB flushes if PTE permission is not demoted. >> >> 3. Cleans up mmu_gather to be less arch-dependant. >> >> 4. Uses mm's generations to track in finer granularity, either per-VMA >> or per page-table, whether a pending mmu_gather operation is >> outstanding. This should allow to avoid some TLB flushes when KSM or >> memory reclamation takes place while another operation such as >> munmap() or mprotect() is running. >> >> 5. Changes try_to_unmap_one() flushing scheme, as the current seems >> broken to track in a bitmap which CPUs have outstanding TLB flushes >> instead of having a flag. > > Putting fixes first, and cleanups and independent patches (like #2) next > would help with getting stuff merged and backported. I tried to do it mostly this way. There are some theoretical races which I did not manage (or try hard enough) to create, so I did not include in the “fixes” section. I will restructure the patch-set according to the feedback. Thanks, Nadav
> On Jan 30, 2021, at 11:57 PM, Nadav Amit <namit@vmware.com> wrote: > >> On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote: >> >> Excerpts from Nadav Amit's message of January 31, 2021 10:11 am: >>> From: Nadav Amit <namit@vmware.com> >>> >>> There are currently (at least?) 5 different TLB batching schemes in the >>> kernel: >>> >>> 1. Using mmu_gather (e.g., zap_page_range()). >>> >>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the >>> ongoing deferred TLB flush and flushing the entire range eventually >>> (e.g., change_protection_range()). >>> >>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?). >>> >>> 4. Batching per-table flushes (move_ptes()). >>> >>> 5. By setting a flag on that a deferred TLB flush operation takes place, >>> flushing when (try_to_unmap_one() on x86). >>> >>> It seems that (1)-(4) can be consolidated. In addition, it seems that >>> (5) is racy. It also seems there can be many redundant TLB flushes, and >>> potentially TLB-shootdown storms, for instance during batched >>> reclamation (using try_to_unmap_one()) if at the same time mmu_gather >>> defers TLB flushes. >>> >>> More aggressive TLB batching may be possible, but this patch-set does >>> not add such batching. The proposed changes would enable such batching >>> in a later time. >>> >>> Admittedly, I do not understand how things are not broken today, which >>> frightens me to make further batching before getting things in order. >>> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes >>> for each page-table (but not in greater granularity). Can't >>> ClearPageDirty() be called before the flush, causing writes after >>> ClearPageDirty() and before the flush to be lost? >> >> Because it's holding the page table lock which stops page_mkclean from >> cleaning the page. Or am I misunderstanding the question? > > Thanks. I understood this part. Looking again at the code, I now understand > my confusion: I forgot that the reverse mapping is removed after the PTE is > zapped. > > Makes me wonder whether it is ok to defer the TLB flush to tlb_finish_mmu(), > by performing set_page_dirty() for the batched pages when needed in > tlb_finish_mmu() [ i.e., by marking for each batched page whether > set_page_dirty() should be issued for that page while collecting them ]. Correcting myself (I hope): no we cannot do so, since the buffers might be remove from the page at that point.
On Sun, Jan 31, 2021 at 07:57:01AM +0000, Nadav Amit wrote: > > On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > > I'll go through the patches a bit more closely when they all come > > through. Sparc and powerpc of course need the arch lazy mode to get > > per-page/pte information for operations that are not freeing pages, > > which is what mmu gather is designed for. > > IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE > where no previous PTE was set, right? These are the HASH architectures. Their hardware doesn't walk the page-tables, but it consults a hash-table to resolve page translations. They _MUST_ flush the entries under the PTL to avoid ever seeing conflicting information, which will make them really unhappy. They can do this because they have TLBI broadcast. There's a few more details I'm sure, but those seem to have slipped from my mind.
Excerpts from Peter Zijlstra's message of February 1, 2021 10:44 pm: > On Sun, Jan 31, 2021 at 07:57:01AM +0000, Nadav Amit wrote: >> > On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > >> > I'll go through the patches a bit more closely when they all come >> > through. Sparc and powerpc of course need the arch lazy mode to get >> > per-page/pte information for operations that are not freeing pages, >> > which is what mmu gather is designed for. >> >> IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE >> where no previous PTE was set, right? In cases of increasing permissiveness of access, yes it may want to update the "TLB" (read hash table) to avoid taking hash table faults. But whatever the reason for the flush, there may have to be more data carried than just the virtual address range and/or physical pages. If you clear out the PTE then you have no guarantee of actually being able to go back and address the the in-memory or in-hardware translation structures to update them, depending on what exact scheme is used (powerpc probably could if all page sizes were the same, but THP or 64k/4k sub pages would throw a spanner in those works). > These are the HASH architectures. Their hardware doesn't walk the > page-tables, but it consults a hash-table to resolve page translations. Yeah, it's very cool in a masochistic way. I actually don't know if it's worth doing a big rework of it, as much as I'd like to. Rather than just keep it in place and eventually dismantling some of the go-fast hooks from core code if we can one day deprecate it in favour of the much easier radix mode. The whole thing is like a big steam train, years ago Paul and Ben and Anton and co got the boiler stoked up and set all the valves just right so it runs unbelievably well for what it's actually doing but look at it the wrong way and the whole thing could blow up. (at least that's what it feels like to me probably because I don't know the code that well). Sparc could probably do the same, not sure about Xen. I don't suppose vmware is intending to add any kind of paravirt mode related to this stuff? Thanks, Nick
From: Nadav Amit <namit@vmware.com> There are currently (at least?) 5 different TLB batching schemes in the kernel: 1. Using mmu_gather (e.g., zap_page_range()). 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the ongoing deferred TLB flush and flushing the entire range eventually (e.g., change_protection_range()). 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?). 4. Batching per-table flushes (move_ptes()). 5. By setting a flag on that a deferred TLB flush operation takes place, flushing when (try_to_unmap_one() on x86). It seems that (1)-(4) can be consolidated. In addition, it seems that (5) is racy. It also seems there can be many redundant TLB flushes, and potentially TLB-shootdown storms, for instance during batched reclamation (using try_to_unmap_one()) if at the same time mmu_gather defers TLB flushes. More aggressive TLB batching may be possible, but this patch-set does not add such batching. The proposed changes would enable such batching in a later time. Admittedly, I do not understand how things are not broken today, which frightens me to make further batching before getting things in order. For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes for each page-table (but not in greater granularity). Can't ClearPageDirty() be called before the flush, causing writes after ClearPageDirty() and before the flush to be lost? This patch-set therefore performs the following changes: 1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather instead of {inc|dec}_tlb_flush_pending(). 2. Avoid TLB flushes if PTE permission is not demoted. 3. Cleans up mmu_gather to be less arch-dependant. 4. Uses mm's generations to track in finer granularity, either per-VMA or per page-table, whether a pending mmu_gather operation is outstanding. This should allow to avoid some TLB flushes when KSM or memory reclamation takes place while another operation such as munmap() or mprotect() is running. 5. Changes try_to_unmap_one() flushing scheme, as the current seems broken to track in a bitmap which CPUs have outstanding TLB flushes instead of having a flag. Further optimizations are possible, such as changing move_ptes() to use mmu_gather. The patches were very very lightly tested. I am looking forward for your feedback regarding the overall approaches, and whether to split them into multiple patch-sets. Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-csky@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Nick Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: x86@kernel.org Cc: Yu Zhao <yuzhao@google.com> Nadav Amit (20): mm/tlb: fix fullmm semantics mm/mprotect: use mmu_gather mm/mprotect: do not flush on permission promotion mm/mapping_dirty_helpers: use mmu_gather mm/tlb: move BATCHED_UNMAP_TLB_FLUSH to tlb.h fs/task_mmu: use mmu_gather interface of clear-soft-dirty mm: move x86 tlb_gen to generic code mm: store completed TLB generation mm: create pte/pmd_tlb_flush_pending() mm: add pte_to_page() mm/tlb: remove arch-specific tlb_start/end_vma() mm/tlb: save the VMA that is flushed during tlb_start_vma() mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes() mm: move inc/dec_tlb_flush_pending() to mmu_gather.c mm: detect deferred TLB flushes in vma granularity mm/tlb: per-page table generation tracking mm/tlb: updated completed deferred TLB flush conditionally mm: make mm_cpumask() volatile lib/cpumask: introduce cpumask_atomic_or() mm/rmap: avoid potential races arch/arm/include/asm/bitops.h | 4 +- arch/arm/include/asm/pgtable.h | 4 +- arch/arm64/include/asm/pgtable.h | 4 +- arch/csky/Kconfig | 1 + arch/csky/include/asm/tlb.h | 12 -- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/tlb.h | 2 - arch/s390/Kconfig | 1 + arch/s390/include/asm/tlb.h | 3 - arch/sparc/Kconfig | 1 + arch/sparc/include/asm/pgtable_64.h | 9 +- arch/sparc/include/asm/tlb_64.h | 2 - arch/sparc/mm/init_64.c | 2 +- arch/x86/Kconfig | 3 + arch/x86/hyperv/mmu.c | 2 +- arch/x86/include/asm/mmu.h | 10 - arch/x86/include/asm/mmu_context.h | 1 - arch/x86/include/asm/paravirt_types.h | 2 +- arch/x86/include/asm/pgtable.h | 24 +-- arch/x86/include/asm/tlb.h | 21 +- arch/x86/include/asm/tlbbatch.h | 15 -- arch/x86/include/asm/tlbflush.h | 61 ++++-- arch/x86/mm/tlb.c | 52 +++-- arch/x86/xen/mmu_pv.c | 2 +- drivers/firmware/efi/efi.c | 1 + fs/proc/task_mmu.c | 29 ++- include/asm-generic/bitops/find.h | 8 +- include/asm-generic/tlb.h | 291 +++++++++++++++++++++----- include/linux/bitmap.h | 21 +- include/linux/cpumask.h | 40 ++-- include/linux/huge_mm.h | 3 +- include/linux/mm.h | 29 ++- include/linux/mm_types.h | 166 ++++++++++----- include/linux/mm_types_task.h | 13 -- include/linux/pgtable.h | 2 +- include/linux/smp.h | 6 +- init/Kconfig | 21 ++ kernel/fork.c | 2 + kernel/smp.c | 8 +- lib/bitmap.c | 33 ++- lib/cpumask.c | 8 +- lib/find_bit.c | 10 +- mm/huge_memory.c | 6 +- mm/init-mm.c | 1 + mm/internal.h | 16 -- mm/ksm.c | 2 +- mm/madvise.c | 6 +- mm/mapping_dirty_helpers.c | 52 +++-- mm/memory.c | 2 + mm/mmap.c | 1 + mm/mmu_gather.c | 59 +++++- mm/mprotect.c | 55 ++--- mm/mremap.c | 2 +- mm/pgtable-generic.c | 2 +- mm/rmap.c | 42 ++-- mm/vmscan.c | 1 + 56 files changed, 803 insertions(+), 374 deletions(-) delete mode 100644 arch/x86/include/asm/tlbbatch.h