Message ID | 20250122232716.1321171-1-roman.gushchin@linux.dev (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables() | expand |
On Wed, 22 Jan 2025, Roman Gushchin wrote: > Commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas") > added a forced tlbflush to tlb_vma_end(), which is required to avoid a > race between munmap() and unmap_mapping_range(). However it added some > overhead to other paths where tlb_vma_end() is used, but vmas are not > removed, e.g. madvise(MADV_DONTNEED). > > Fix this by moving the tlb flush out of tlb_end_vma() into > free_pgtables(), somewhat similar to the stable version of the > original commit: e.g. stable commit 895428ee124a ("mm: Force TLB flush > for PFNMAP mappings before unlink_file_vma()"). > > Note, that if tlb->fullmm is set, no flush is required, as the whole > mm is about to be destroyed. > > v2: > - moved vma_pfn flag handling into tlb.h (by Peter Z.) > - added comments (by Peter Z.) > - fixed the vma_pfn flag setting (by Hugh D.) > > Suggested-by: Jann Horn <jannh@google.com> > Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Will Deacon <will@kernel.org> > Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Nick Piggin <npiggin@gmail.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: linux-arch@vger.kernel.org > Cc: linux-mm@kvack.org > --- > include/asm-generic/tlb.h | 41 ++++++++++++++++++++++++++------------- > mm/memory.c | 7 +++++++ > 2 files changed, 35 insertions(+), 13 deletions(-) > > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h > index 709830274b75..fbe31f49a5af 100644 > --- a/include/asm-generic/tlb.h > +++ b/include/asm-generic/tlb.h > @@ -449,7 +449,14 @@ tlb_update_vma_flags(struct mmu_gather *tlb, struct vm_area_struct *vma) > */ > tlb->vma_huge = is_vm_hugetlb_page(vma); > tlb->vma_exec = !!(vma->vm_flags & VM_EXEC); > - tlb->vma_pfn = !!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)); > + > + /* > + * vma_pfn is checked and cleared by tlb_flush_mmu_pfnmap() > + * for a set of vma's, so it should be set if at least one vma > + * has VM_PFNMAP or VM_MIXEDMAP flags set. > + */ > + if (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) > + tlb->vma_pfn = 1; Okay, but struct mmu_gather is usually on the caller's stack, containing junk initially, and there's nothing in this patch yet to initialize tlb->vma_pfn to 0. __tlb_reset_range() needs to do that, doesn't it? With some adjustment to its comment about not resetting mmu_gather::vma_* fields. Or would it be better to get around that by renaming vma_pfn to, er, something else - I'd have to understand the essence of Jann's race better to come up with the right name - untracked_mappings? would that be right? I still haven't grasped the essence of that race. (Panic attack: where is, for example, tlb->need_flush_all initialized to 0? Ah, over in mm/mmu_gather.c, __tlb_gather_mmu(). Phew.) And if __tlb_reset_range() resets tlb->vma_pfn to 0, then that has the side-effect that any TLB flush cancels the vma_pfn state: which is a desirable side-effect, isn't it? avoiding the possibility of doing an unnecessary extra TLB flush in free_pgtables(), as I criticized before. Hugh > } > > static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) > @@ -466,6 +473,22 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) > __tlb_reset_range(tlb); > } > > +static inline void tlb_flush_mmu_pfnmap(struct mmu_gather *tlb) > +{ > + /* > + * VM_PFNMAP and VM_MIXEDMAP maps are fragile because the core mm > + * doesn't track the page mapcount -- there might not be page-frames > + * for these PFNs after all. Force flush TLBs for such ranges to avoid > + * munmap() vs unmap_mapping_range() races. > + * Ensure we have no stale TLB entries by the time this mapping is > + * removed from the rmap. > + */ > + if (unlikely(!tlb->fullmm && tlb->vma_pfn)) { > + tlb_flush_mmu_tlbonly(tlb); > + tlb->vma_pfn = 0; > + } > +} > + > static inline void tlb_remove_page_size(struct mmu_gather *tlb, > struct page *page, int page_size) > { > @@ -549,22 +572,14 @@ static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct * > > static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) > { > - if (tlb->fullmm) > + if (tlb->fullmm || IS_ENABLED(CONFIG_MMU_GATHER_MERGE_VMAS)) > return; > > /* > - * VM_PFNMAP is more fragile because the core mm will not track the > - * page mapcount -- there might not be page-frames for these PFNs after > - * all. Force flush TLBs for such ranges to avoid munmap() vs > - * unmap_mapping_range() races. > + * Do a TLB flush and reset the range at VMA boundaries; this avoids > + * the ranges growing with the unused space between consecutive VMAs. > */ > - if (tlb->vma_pfn || !IS_ENABLED(CONFIG_MMU_GATHER_MERGE_VMAS)) { > - /* > - * Do a TLB flush and reset the range at VMA boundaries; this avoids > - * the ranges growing with the unused space between consecutive VMAs. > - */ > - tlb_flush_mmu_tlbonly(tlb); > - } > + tlb_flush_mmu_tlbonly(tlb); > } > > /* > diff --git a/mm/memory.c b/mm/memory.c > index 398c031be9ba..c2a9effb2e32 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -365,6 +365,13 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, > { > struct unlink_vma_file_batch vb; > > + /* > + * VM_PFNMAP and VM_MIXEDMAP maps require a special handling here: > + * force flush TLBs for such ranges to avoid munmap() vs > + * unmap_mapping_range() races. > + */ > + tlb_flush_mmu_pfnmap(tlb); > + > do { > unsigned long addr = vma->vm_start; > struct vm_area_struct *next; > -- > 2.48.1.262.g85cc9f2d1e-goog
On Wed, 22 Jan 2025, Hugh Dickins wrote: > On Wed, 22 Jan 2025, Roman Gushchin wrote: > > > Commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas") > > added a forced tlbflush to tlb_vma_end(), which is required to avoid a > > race between munmap() and unmap_mapping_range(). However it added some > > overhead to other paths where tlb_vma_end() is used, but vmas are not > > removed, e.g. madvise(MADV_DONTNEED). > > > > Fix this by moving the tlb flush out of tlb_end_vma() into > > free_pgtables(), somewhat similar to the stable version of the > > original commit: e.g. stable commit 895428ee124a ("mm: Force TLB flush > > for PFNMAP mappings before unlink_file_vma()"). > > > > Note, that if tlb->fullmm is set, no flush is required, as the whole > > mm is about to be destroyed. > > > > v2: > > - moved vma_pfn flag handling into tlb.h (by Peter Z.) > > - added comments (by Peter Z.) > > - fixed the vma_pfn flag setting (by Hugh D.) And in v3, that changelog should be after the ---, not in the commit.
On Wed, Jan 22, 2025 at 11:42:13PM -0800, Hugh Dickins wrote: > On Wed, 22 Jan 2025, Roman Gushchin wrote: > > > Commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas") > > added a forced tlbflush to tlb_vma_end(), which is required to avoid a > > race between munmap() and unmap_mapping_range(). However it added some > > overhead to other paths where tlb_vma_end() is used, but vmas are not > > removed, e.g. madvise(MADV_DONTNEED). > > > > Fix this by moving the tlb flush out of tlb_end_vma() into > > free_pgtables(), somewhat similar to the stable version of the > > original commit: e.g. stable commit 895428ee124a ("mm: Force TLB flush > > for PFNMAP mappings before unlink_file_vma()"). > > > > Note, that if tlb->fullmm is set, no flush is required, as the whole > > mm is about to be destroyed. > > > > v2: > > - moved vma_pfn flag handling into tlb.h (by Peter Z.) > > - added comments (by Peter Z.) > > - fixed the vma_pfn flag setting (by Hugh D.) > > > > Suggested-by: Jann Horn <jannh@google.com> > > Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> > > Cc: Peter Zijlstra <peterz@infradead.org> > > Cc: Will Deacon <will@kernel.org> > > Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Nick Piggin <npiggin@gmail.com> > > Cc: Hugh Dickins <hughd@google.com> > > Cc: linux-arch@vger.kernel.org > > Cc: linux-mm@kvack.org > > --- > > include/asm-generic/tlb.h | 41 ++++++++++++++++++++++++++------------- > > mm/memory.c | 7 +++++++ > > 2 files changed, 35 insertions(+), 13 deletions(-) > > > > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h > > index 709830274b75..fbe31f49a5af 100644 > > --- a/include/asm-generic/tlb.h > > +++ b/include/asm-generic/tlb.h > > @@ -449,7 +449,14 @@ tlb_update_vma_flags(struct mmu_gather *tlb, struct vm_area_struct *vma) > > */ > > tlb->vma_huge = is_vm_hugetlb_page(vma); > > tlb->vma_exec = !!(vma->vm_flags & VM_EXEC); > > - tlb->vma_pfn = !!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)); > > + > > + /* > > + * vma_pfn is checked and cleared by tlb_flush_mmu_pfnmap() > > + * for a set of vma's, so it should be set if at least one vma > > + * has VM_PFNMAP or VM_MIXEDMAP flags set. > > + */ > > + if (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) > > + tlb->vma_pfn = 1; > > Okay, but struct mmu_gather is usually on the caller's stack, > containing junk initially, and there's nothing in this patch > yet to initialize tlb->vma_pfn to 0. > > __tlb_reset_range() needs to do that, doesn't it? With some adjustment > to its comment about not resetting mmu_gather::vma_* fields. Or would > it be better to get around that by renaming vma_pfn to, er, something > else - I'd have to understand the essence of Jann's race better to > come up with the right name - untracked_mappings? would that be right? > I still haven't grasped the essence of that race. Yeah, this is a really good point. > > (Panic attack: where is, for example, tlb->need_flush_all initialized > to 0? Ah, over in mm/mmu_gather.c, __tlb_gather_mmu(). Phew.) > > And if __tlb_reset_range() resets tlb->vma_pfn to 0, then that has the > side-effect that any TLB flush cancels the vma_pfn state: which is a > desirable side-effect, isn't it? avoiding the possibility of doing an > unnecessary extra TLB flush in free_pgtables(), as I criticized before. Good point. Just sent out v3 with your suggestions. Thank you for your help here! Roman
On Wed, Jan 22, 2025 at 11:27:16PM +0000, Roman Gushchin wrote: > Commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas") > added a forced tlbflush to tlb_vma_end(), which is required to avoid a > race between munmap() and unmap_mapping_range(). However it added some > overhead to other paths where tlb_vma_end() is used, but vmas are not > removed, e.g. madvise(MADV_DONTNEED). > > Fix this by moving the tlb flush out of tlb_end_vma() into > free_pgtables(), somewhat similar to the stable version of the > original commit: e.g. stable commit 895428ee124a ("mm: Force TLB flush > for PFNMAP mappings before unlink_file_vma()"). > diff --git a/mm/memory.c b/mm/memory.c > index 398c031be9ba..c2a9effb2e32 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -365,6 +365,13 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, > { > struct unlink_vma_file_batch vb; > > + /* > + * VM_PFNMAP and VM_MIXEDMAP maps require a special handling here: > + * force flush TLBs for such ranges to avoid munmap() vs > + * unmap_mapping_range() races. > + */ > + tlb_flush_mmu_pfnmap(tlb); > + > do { > unsigned long addr = vma->vm_start; > struct vm_area_struct *next; So I'm not sure I muc like this name, it is fairly random and does very much not convey the reason we're calling this. Anyway, going back to reading the original commit (because this changelog isn't helping me much), the problem appears to be that unlinking the vma will make unmap_mapping_range() skip things (no vma, nothing to do, etc) return early and bad things happen. So am I right in thinking we need this tlb flush before all those unlink_{anon,file}_vma*() calls, and that the whole free_pgd_range() that goes and invalidates the page-tables is too late? So how about we do something like so instead? --- diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 709830274b75..481a24f2b839 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -58,6 +58,11 @@ * Defaults to flushing at tlb_end_vma() to reset the range; helps when * there's large holes between the VMAs. * + * - tlb_free_vma() + * + * tlb_free_vma() marks the start of unlinking the vma and freeing + * page-tables. + * * - tlb_remove_table() * * tlb_remove_table() is the basic primitive to free page-table directories @@ -384,7 +389,10 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) * Do not reset mmu_gather::vma_* fields here, we do not * call into tlb_start_vma() again to set them if there is an * intermediate flush. + * + * Except for vma_pfn, that only cares if there's pending TLBI. */ + tlb->vma_pfn = 0; } #ifdef CONFIG_MMU_GATHER_NO_RANGE @@ -449,7 +457,12 @@ tlb_update_vma_flags(struct mmu_gather *tlb, struct vm_area_struct *vma) */ tlb->vma_huge = is_vm_hugetlb_page(vma); tlb->vma_exec = !!(vma->vm_flags & VM_EXEC); - tlb->vma_pfn = !!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)); + + /* + * Track if there's at least one VM_PFNMAP/VM_MIXEDMAP vma + * in the tracked range, see tlb_free_vma(). + */ + tlb->vma_pfn |= !!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)); } static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) @@ -548,23 +561,39 @@ static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct * } static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) +{ + if (tlb->fullmm || IS_ENABLED(CONFIG_MMU_GATHER_MERGE_VMAS)) + return; + + /* + * Do a TLB flush and reset the range at VMA boundaries; this avoids + * the ranges growing with the unused space between consecutive VMAs, + * but also the mmu_gather::vma_* flags from tlb_start_vma() rely on + * this. + */ + tlb_flush_mmu_tlbonly(tlb); +} + +static inline void tlb_free_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) { if (tlb->fullmm) return; /* * VM_PFNMAP is more fragile because the core mm will not track the - * page mapcount -- there might not be page-frames for these PFNs after - * all. Force flush TLBs for such ranges to avoid munmap() vs - * unmap_mapping_range() races. + * page mapcount -- there might not be page-frames for these PFNs + * after all. + * + * Specifically() there is a race between munmap() and + * unmap_mapping_range(), where munmap() will unlink the VMA, such + * that unmap_mapping_range() will no longer observe the VMA and + * no-op, without observing the TLBI, returning prematurely. + * + * So if we're about to unlink such a VMA, and we have pending + * TLBI for such a vma, flush things now. */ - if (tlb->vma_pfn || !IS_ENABLED(CONFIG_MMU_GATHER_MERGE_VMAS)) { - /* - * Do a TLB flush and reset the range at VMA boundaries; this avoids - * the ranges growing with the unused space between consecutive VMAs. - */ + if ((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) && tlb->vma_pfn) tlb_flush_mmu_tlbonly(tlb); - } } /* diff --git a/mm/memory.c b/mm/memory.c index 398c031be9ba..41b037803fd9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -365,6 +365,8 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, { struct unlink_vma_file_batch vb; + tlb_free_vma(tlb, vma); + do { unsigned long addr = vma->vm_start; struct vm_area_struct *next;
On Thu, Jan 23, 2025 at 10:45:31PM +0100, Peter Zijlstra wrote: > On Wed, Jan 22, 2025 at 11:27:16PM +0000, Roman Gushchin wrote: > > Commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas") > > added a forced tlbflush to tlb_vma_end(), which is required to avoid a > > race between munmap() and unmap_mapping_range(). However it added some > > overhead to other paths where tlb_vma_end() is used, but vmas are not > > removed, e.g. madvise(MADV_DONTNEED). > > > > Fix this by moving the tlb flush out of tlb_end_vma() into > > free_pgtables(), somewhat similar to the stable version of the > > original commit: e.g. stable commit 895428ee124a ("mm: Force TLB flush > > for PFNMAP mappings before unlink_file_vma()"). > > > > diff --git a/mm/memory.c b/mm/memory.c > > index 398c031be9ba..c2a9effb2e32 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -365,6 +365,13 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, > > { > > struct unlink_vma_file_batch vb; > > > > + /* > > + * VM_PFNMAP and VM_MIXEDMAP maps require a special handling here: > > + * force flush TLBs for such ranges to avoid munmap() vs > > + * unmap_mapping_range() races. > > + */ > > + tlb_flush_mmu_pfnmap(tlb); > > + > > do { > > unsigned long addr = vma->vm_start; > > struct vm_area_struct *next; > > So I'm not sure I muc like this name, it is fairly random and does very > much not convey the reason we're calling this. > > Anyway, going back to reading the original commit (because this > changelog isn't helping me much), the problem appears to be that > unlinking the vma will make unmap_mapping_range() skip things (no vma, > nothing to do, etc) return early and bad things happen. > > So am I right in thinking we need this tlb flush before all those > unlink_{anon,file}_vma*() calls, and that the whole free_pgd_range() > that goes and invalidates the page-tables is too late? > > So how about we do something like so instead? Overall looks good to me, except one question (below). > > --- > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h > index 709830274b75..481a24f2b839 100644 > --- a/include/asm-generic/tlb.h > +++ b/include/asm-generic/tlb.h > @@ -58,6 +58,11 @@ > * Defaults to flushing at tlb_end_vma() to reset the range; helps when > * there's large holes between the VMAs. > * > + * - tlb_free_vma() > + * > + * tlb_free_vma() marks the start of unlinking the vma and freeing > + * page-tables. > + * > * - tlb_remove_table() > * > * tlb_remove_table() is the basic primitive to free page-table directories > @@ -384,7 +389,10 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) > * Do not reset mmu_gather::vma_* fields here, we do not > * call into tlb_start_vma() again to set them if there is an > * intermediate flush. > + * > + * Except for vma_pfn, that only cares if there's pending TLBI. > */ > + tlb->vma_pfn = 0; > } > > #ifdef CONFIG_MMU_GATHER_NO_RANGE > @@ -449,7 +457,12 @@ tlb_update_vma_flags(struct mmu_gather *tlb, struct vm_area_struct *vma) > */ > tlb->vma_huge = is_vm_hugetlb_page(vma); > tlb->vma_exec = !!(vma->vm_flags & VM_EXEC); > - tlb->vma_pfn = !!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)); > + > + /* > + * Track if there's at least one VM_PFNMAP/VM_MIXEDMAP vma > + * in the tracked range, see tlb_free_vma(). > + */ > + tlb->vma_pfn |= !!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)); > } > > static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) > @@ -548,23 +561,39 @@ static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct * > } > > static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) > +{ > + if (tlb->fullmm || IS_ENABLED(CONFIG_MMU_GATHER_MERGE_VMAS)) > + return; > + > + /* > + * Do a TLB flush and reset the range at VMA boundaries; this avoids > + * the ranges growing with the unused space between consecutive VMAs, > + * but also the mmu_gather::vma_* flags from tlb_start_vma() rely on > + * this. > + */ > + tlb_flush_mmu_tlbonly(tlb); > +} > + > +static inline void tlb_free_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) > { > if (tlb->fullmm) > return; > > /* > * VM_PFNMAP is more fragile because the core mm will not track the > - * page mapcount -- there might not be page-frames for these PFNs after > - * all. Force flush TLBs for such ranges to avoid munmap() vs > - * unmap_mapping_range() races. > + * page mapcount -- there might not be page-frames for these PFNs > + * after all. > + * > + * Specifically() there is a race between munmap() and > + * unmap_mapping_range(), where munmap() will unlink the VMA, such > + * that unmap_mapping_range() will no longer observe the VMA and > + * no-op, without observing the TLBI, returning prematurely. > + * > + * So if we're about to unlink such a VMA, and we have pending > + * TLBI for such a vma, flush things now. > */ > - if (tlb->vma_pfn || !IS_ENABLED(CONFIG_MMU_GATHER_MERGE_VMAS)) { > - /* > - * Do a TLB flush and reset the range at VMA boundaries; this avoids > - * the ranges growing with the unused space between consecutive VMAs. > - */ > + if ((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) && tlb->vma_pfn) > tlb_flush_mmu_tlbonly(tlb); > - } Why do we need to re-check vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP) here? In free_pgtables() we're iterating over multiple vma's. What if the first has no VM_PFNMAP set, but some other do? Idk if it's even possible, but it's not obvious that it's not possible either. Btw, are you going to handle it yourself or you want me to send a v4 based on your version? Thank you!
On Thu, 23 Jan 2025, Roman Gushchin wrote: > On Thu, Jan 23, 2025 at 10:45:31PM +0100, Peter Zijlstra wrote: > > On Wed, Jan 22, 2025 at 11:27:16PM +0000, Roman Gushchin wrote: > > > Commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas") > > > added a forced tlbflush to tlb_vma_end(), which is required to avoid a > > > race between munmap() and unmap_mapping_range(). However it added some > > > overhead to other paths where tlb_vma_end() is used, but vmas are not > > > removed, e.g. madvise(MADV_DONTNEED). > > > > > > Fix this by moving the tlb flush out of tlb_end_vma() into > > > free_pgtables(), somewhat similar to the stable version of the > > > original commit: e.g. stable commit 895428ee124a ("mm: Force TLB flush > > > for PFNMAP mappings before unlink_file_vma()"). > > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 398c031be9ba..c2a9effb2e32 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -365,6 +365,13 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, > > > { > > > struct unlink_vma_file_batch vb; > > > > > > + /* > > > + * VM_PFNMAP and VM_MIXEDMAP maps require a special handling here: > > > + * force flush TLBs for such ranges to avoid munmap() vs > > > + * unmap_mapping_range() races. > > > + */ > > > + tlb_flush_mmu_pfnmap(tlb); > > > + > > > do { > > > unsigned long addr = vma->vm_start; > > > struct vm_area_struct *next; > > > > So I'm not sure I muc like this name, it is fairly random and does very > > much not convey the reason we're calling this. > > > > Anyway, going back to reading the original commit (because this > > changelog isn't helping me much), the problem appears to be that > > unlinking the vma will make unmap_mapping_range() skip things (no vma, > > nothing to do, etc) return early and bad things happen. The changelog of commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas") has not helped me either. Nor could I locate any discussion (Jann, Linus, Peter, Will?) that led up to it. I can see that a racing unmap_mapping_range() may complete before the munmap() has done its TLB flush, but (a) so what? and (b) how does VM_PFNMAP or page_mapcount affect that? and (c) did Linus's later delay_rmap changes make any difference to the story here. Jann, please spell it out for us: I think I'm not the only one who fails to understand the race in question. At present I'm hovering between thinking there was no bug to be fixed in the first place, or that a tlb_flush_mmu_tlbonly() has to be done before unlinking vmas in all cases (or in all file cases?). I'm probably wrong in both directions, but cannot see it without help. > > > > So am I right in thinking we need this tlb flush before all those > > unlink_{anon,file}_vma*() calls, and that the whole free_pgd_range() > > that goes and invalidates the page-tables is too late? > > > > So how about we do something like so instead? > > Overall looks good to me, except one question (below). To me, Peter's patch looks much like yours, except wth different names and comments, plus the "vma" error you point out below. > > +static inline void tlb_free_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) > > { > > if (tlb->fullmm) > > return; > > > > /* > > * VM_PFNMAP is more fragile because the core mm will not track the > > - * page mapcount -- there might not be page-frames for these PFNs after > > - * all. Force flush TLBs for such ranges to avoid munmap() vs > > - * unmap_mapping_range() races. > > + * page mapcount -- there might not be page-frames for these PFNs > > + * after all. > > + * > > + * Specifically() there is a race between munmap() and > > + * unmap_mapping_range(), where munmap() will unlink the VMA, such > > + * that unmap_mapping_range() will no longer observe the VMA and > > + * no-op, without observing the TLBI, returning prematurely. > > + * > > + * So if we're about to unlink such a VMA, and we have pending > > + * TLBI for such a vma, flush things now. > > */ > > - if (tlb->vma_pfn || !IS_ENABLED(CONFIG_MMU_GATHER_MERGE_VMAS)) { > > - /* > > - * Do a TLB flush and reset the range at VMA boundaries; this avoids > > - * the ranges growing with the unused space between consecutive VMAs. > > - */ > > + if ((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) && tlb->vma_pfn) > > tlb_flush_mmu_tlbonly(tlb); > > - } > > Why do we need to re-check vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP) here? > > In free_pgtables() we're iterating over multiple vma's. What if the first has > no VM_PFNMAP set, but some other do? Idk if it's even possible, but it's not > obvious that it's not possible either. Yes, of course that is possible: the "vma" to tlb_free_vma() is a mistake (hmm, and there's no "free" in it either). Hugh
On Thu, Jan 23, 2025 at 11:12:33PM +0000, Roman Gushchin wrote: > > +static inline void tlb_free_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) > > { > > if (tlb->fullmm) > > return; > > > > /* > > * VM_PFNMAP is more fragile because the core mm will not track the > > + * page mapcount -- there might not be page-frames for these PFNs > > + * after all. > > + * > > + * Specifically() there is a race between munmap() and > > + * unmap_mapping_range(), where munmap() will unlink the VMA, such > > + * that unmap_mapping_range() will no longer observe the VMA and > > + * no-op, without observing the TLBI, returning prematurely. > > + * > > + * So if we're about to unlink such a VMA, and we have pending > > + * TLBI for such a vma, flush things now. > > */ > > + if ((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) && tlb->vma_pfn) > > tlb_flush_mmu_tlbonly(tlb); > > Why do we need to re-check vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP) here? No need, but an opportunity. > In free_pgtables() we're iterating over multiple vma's. What if the first has > no VM_PFNMAP set, but some other do? Idk if it's even possible, but it's not > obvious that it's not possible either. If we only need to flush PFN entries before unlinking PFN VMAs, then: - if there are no PFNs pending (vma_pfn), we don't need to flush; - if no PFN vma is being freed (vm_flags), we don't need to flush. Notably, if an earlier flush has already issued the TLBI, there is no need to issue one again, but also, if we end up not actually freeing the PFN vma, we also don't care.
On Thu, Jan 23, 2025 at 08:42:36PM -0800, Hugh Dickins wrote: > The changelog of commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush > VM_PFNMAP vmas") has not helped me either. Nor could I locate any > discussion (Jann, Linus, Peter, Will?) that led up to it. Hmm, that was probably on security -- I should have those mails around somewhere, I'll see if I can dig them up. > To me, Peter's patch looks much like yours, except wth different > names and comments, plus the "vma" error you point out below. Yes, 3 differences: - naming; - the extra check; - the vma_pfn clearing condition. Under the assumption that this is all about those PFNs, the argument (as also outlined in the email to Roman just now) is that you only need to flush if both: you have pending TLBI for PFN and are indeed about to unlink a PFN vma. If we've flushed the relevant PFNs earlier, for whatever reason, batching, or the arch has !MERGE_VMAS or whatever, then we do not need to flush again. So clearing vma_pfn in __tlb_reset_range() is the right place. Similarly, if we don't ever actually free/unlink the PFN vma, we also don't care.
On Fri, Jan 24, 2025 at 09:31:39AM +0100, Peter Zijlstra wrote: > On Thu, Jan 23, 2025 at 08:42:36PM -0800, Hugh Dickins wrote: > > The changelog of commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush > > VM_PFNMAP vmas") has not helped me either. Nor could I locate any > > discussion (Jann, Linus, Peter, Will?) that led up to it. > > Hmm, that was probably on security -- I should have those mails around > somewhere, I'll see if I can dig them up. Hugh, I've bounced you a copy of Jann's original report on the issue. Subject: unmap_mapping_range() race with munmap() on VM_PFNMAP mappings leads to stale TLB entry
On Fri, Jan 24, 2025 at 09:22:50AM +0100, Peter Zijlstra wrote: > On Thu, Jan 23, 2025 at 11:12:33PM +0000, Roman Gushchin wrote: > > > > +static inline void tlb_free_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) > > > { > > > if (tlb->fullmm) > > > return; > > > > > > /* > > > * VM_PFNMAP is more fragile because the core mm will not track the > > > + * page mapcount -- there might not be page-frames for these PFNs > > > + * after all. > > > + * > > > + * Specifically() there is a race between munmap() and > > > + * unmap_mapping_range(), where munmap() will unlink the VMA, such > > > + * that unmap_mapping_range() will no longer observe the VMA and > > > + * no-op, without observing the TLBI, returning prematurely. > > > + * > > > + * So if we're about to unlink such a VMA, and we have pending > > > + * TLBI for such a vma, flush things now. > > > */ > > > + if ((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) && tlb->vma_pfn) > > > tlb_flush_mmu_tlbonly(tlb); > > > > Why do we need to re-check vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP) here? > > No need, but an opportunity. > > > In free_pgtables() we're iterating over multiple vma's. What if the first has > > no VM_PFNMAP set, but some other do? Idk if it's even possible, but it's not > > obvious that it's not possible either. > > If we only need to flush PFN entries before unlinking PFN VMAs, then: > > - if there are no PFNs pending (vma_pfn), we don't need to flush; > - if no PFN vma is being freed (vm_flags), we don't need to flush. Right, but if I understand the code correctly, more than one vma can be freed by a single free_pgtables() invocation. Should we then check each vma's flags in the while loop in free_pgtables()? But then we're back to where we're now with multiple flushes. Do I misunderstand this? Thanks
On Fri, 24 Jan 2025, Peter Zijlstra wrote: > On Thu, Jan 23, 2025 at 08:42:36PM -0800, Hugh Dickins wrote: > > The changelog of commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush > > VM_PFNMAP vmas") has not helped me either. Nor could I locate any > > discussion (Jann, Linus, Peter, Will?) that led up to it. > > Hmm, that was probably on security -- I should have those mails around > somewhere, I'll see if I can dig them up. That was very helpful, thank you: I've gone through a lot of confusion, but feeling more confident about it all today. > > > To me, Peter's patch looks much like yours, except wth different > > names and comments, plus the "vma" error you point out below. > > Yes, 3 differences: > > - naming; > - the extra check; > - the vma_pfn clearing condition. > > Under the assumption that this is all about those PFNs, the argument > (as also outlined in the email to Roman just now) is that you only need > to flush if both: you have pending TLBI for PFN and are indeed about to > unlink a PFN vma. > > If we've flushed the relevant PFNs earlier, for whatever reason, > batching, or the arch has !MERGE_VMAS or whatever, then we do not need > to flush again. So clearing vma_pfn in __tlb_reset_range() is the right > place. Yes, Roman moved to clearing vma_pfn in __tlb_reset_range() in his v3: we are all in agreement on that. > > Similarly, if we don't ever actually free/unlink the PFN vma, we also > don't care. I cannot think of a case in which we arrive at free_pgtables(), but do not unlink the vma(s) which caused vma_pfn to be set. If there is such a case, it's not worth optimizing for; and wrong to check just the first vma in the list (don't look only at the stable commit 895428ee124a which Roman cited - it had to be fixed by 891f03f688de afterwards). Personally, I prefer code inline in free_pgtables() which shows what's going on, as Roman did in v1, rather than struggling to devise a self-explanatory function name for something over there in tlb.h. But I may be in a minority on that, and his tlb_flush_mmu_pfnmap() is much more to the point than tlb_free_vma(). Hugh
On Sun, Jan 26, 2025 at 06:34:48PM -0800, Hugh Dickins wrote: > > If we've flushed the relevant PFNs earlier, for whatever reason, > > batching, or the arch has !MERGE_VMAS or whatever, then we do not need > > to flush again. So clearing vma_pfn in __tlb_reset_range() is the right > > place. > > Yes, Roman moved to clearing vma_pfn in __tlb_reset_range() in his v3: > we are all in agreement on that. Ah, I had not seen v3 yet. > > Similarly, if we don't ever actually free/unlink the PFN vma, we also > > don't care. > > I cannot think of a case in which we arrive at free_pgtables(), but do not > unlink the vma(s) which caused vma_pfn to be set. If there is such a case, > it's not worth optimizing for; Yeah, I suppose it doesn't happen. But I figured why assume stuff. > and wrong to check just the first vma in the > list (don't look only at the stable commit 895428ee124a which Roman cited - > it had to be fixed by 891f03f688de afterwards). Duh, yeah, so tlb_free_vma() wants to be inside the vma loop of free_pgtables(). > Personally, I prefer code inline in free_pgtables() which shows what's > going on, as Roman did in v1, rather than struggling to devise a > self-explanatory function name for something over there in tlb.h. > > But I may be in a minority on that, and his tlb_flush_mmu_pfnmap() > is much more to the point than tlb_free_vma(). I prefer a function over the in-line thing such that mmu-gather is more or less self contained. So my concern is/was maintainability of all this; tlb_flush_mmu_pfnmap() tells me nothing about when this function should be called. Otoh tlb_free_vma() tell me this should be called when we're freeing VMAs -- much harder to misplace etc. If we have hooks placed at, and named after, natural events in the lifetime of things, placement is 'obvious'. Another possible name might be tlb_free_pgtables(), indicating we're about to start freeing pagetables -- but it would need to assert !tlb->freed_tables, and I'm not sure this is a constraint worth imposing. It would bring pain if someone wanted to mix freeing pages and page-tables. And we already have vma based hooks, so I much prefer adding one more of those. This is about funky VMAs after all.
On Sat, Jan 25, 2025 at 01:23:54AM +0000, Roman Gushchin wrote: > On Fri, Jan 24, 2025 at 09:22:50AM +0100, Peter Zijlstra wrote: > > On Thu, Jan 23, 2025 at 11:12:33PM +0000, Roman Gushchin wrote: > > > > > > +static inline void tlb_free_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) > > > > { > > > > if (tlb->fullmm) > > > > return; > > > > > > > > /* > > > > * VM_PFNMAP is more fragile because the core mm will not track the > > > > + * page mapcount -- there might not be page-frames for these PFNs > > > > + * after all. > > > > + * > > > > + * Specifically() there is a race between munmap() and > > > > + * unmap_mapping_range(), where munmap() will unlink the VMA, such > > > > + * that unmap_mapping_range() will no longer observe the VMA and > > > > + * no-op, without observing the TLBI, returning prematurely. > > > > + * > > > > + * So if we're about to unlink such a VMA, and we have pending > > > > + * TLBI for such a vma, flush things now. > > > > */ > > > > + if ((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) && tlb->vma_pfn) > > > > tlb_flush_mmu_tlbonly(tlb); > > > > > > Why do we need to re-check vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP) here? > > > > No need, but an opportunity. > > > > > In free_pgtables() we're iterating over multiple vma's. What if the first has > > > no VM_PFNMAP set, but some other do? Idk if it's even possible, but it's not > > > obvious that it's not possible either. > > > > If we only need to flush PFN entries before unlinking PFN VMAs, then: > > > > - if there are no PFNs pending (vma_pfn), we don't need to flush; > > - if no PFN vma is being freed (vm_flags), we don't need to flush. > > Right, but if I understand the code correctly, more than one vma can be > freed by a single free_pgtables() invocation. Should we then check > each vma's flags in the while loop in free_pgtables()? But then > we're back to where we're now with multiple flushes. Right, I misplaced it -- it should be in the vma loop. > Do I misunderstand this? I'm not sure how this would cause more flushes; notably it will not cause flushes where no page-tables are dropped, eg. MADV, which was why you started all this IIUC.
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 709830274b75..fbe31f49a5af 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -449,7 +449,14 @@ tlb_update_vma_flags(struct mmu_gather *tlb, struct vm_area_struct *vma) */ tlb->vma_huge = is_vm_hugetlb_page(vma); tlb->vma_exec = !!(vma->vm_flags & VM_EXEC); - tlb->vma_pfn = !!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)); + + /* + * vma_pfn is checked and cleared by tlb_flush_mmu_pfnmap() + * for a set of vma's, so it should be set if at least one vma + * has VM_PFNMAP or VM_MIXEDMAP flags set. + */ + if (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) + tlb->vma_pfn = 1; } static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) @@ -466,6 +473,22 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) __tlb_reset_range(tlb); } +static inline void tlb_flush_mmu_pfnmap(struct mmu_gather *tlb) +{ + /* + * VM_PFNMAP and VM_MIXEDMAP maps are fragile because the core mm + * doesn't track the page mapcount -- there might not be page-frames + * for these PFNs after all. Force flush TLBs for such ranges to avoid + * munmap() vs unmap_mapping_range() races. + * Ensure we have no stale TLB entries by the time this mapping is + * removed from the rmap. + */ + if (unlikely(!tlb->fullmm && tlb->vma_pfn)) { + tlb_flush_mmu_tlbonly(tlb); + tlb->vma_pfn = 0; + } +} + static inline void tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size) { @@ -549,22 +572,14 @@ static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct * static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) { - if (tlb->fullmm) + if (tlb->fullmm || IS_ENABLED(CONFIG_MMU_GATHER_MERGE_VMAS)) return; /* - * VM_PFNMAP is more fragile because the core mm will not track the - * page mapcount -- there might not be page-frames for these PFNs after - * all. Force flush TLBs for such ranges to avoid munmap() vs - * unmap_mapping_range() races. + * Do a TLB flush and reset the range at VMA boundaries; this avoids + * the ranges growing with the unused space between consecutive VMAs. */ - if (tlb->vma_pfn || !IS_ENABLED(CONFIG_MMU_GATHER_MERGE_VMAS)) { - /* - * Do a TLB flush and reset the range at VMA boundaries; this avoids - * the ranges growing with the unused space between consecutive VMAs. - */ - tlb_flush_mmu_tlbonly(tlb); - } + tlb_flush_mmu_tlbonly(tlb); } /* diff --git a/mm/memory.c b/mm/memory.c index 398c031be9ba..c2a9effb2e32 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -365,6 +365,13 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, { struct unlink_vma_file_batch vb; + /* + * VM_PFNMAP and VM_MIXEDMAP maps require a special handling here: + * force flush TLBs for such ranges to avoid munmap() vs + * unmap_mapping_range() races. + */ + tlb_flush_mmu_pfnmap(tlb); + do { unsigned long addr = vma->vm_start; struct vm_area_struct *next;
Commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas") added a forced tlbflush to tlb_vma_end(), which is required to avoid a race between munmap() and unmap_mapping_range(). However it added some overhead to other paths where tlb_vma_end() is used, but vmas are not removed, e.g. madvise(MADV_DONTNEED). Fix this by moving the tlb flush out of tlb_end_vma() into free_pgtables(), somewhat similar to the stable version of the original commit: e.g. stable commit 895428ee124a ("mm: Force TLB flush for PFNMAP mappings before unlink_file_vma()"). Note, that if tlb->fullmm is set, no flush is required, as the whole mm is about to be destroyed. v2: - moved vma_pfn flag handling into tlb.h (by Peter Z.) - added comments (by Peter Z.) - fixed the vma_pfn flag setting (by Hugh D.) Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <npiggin@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: linux-arch@vger.kernel.org Cc: linux-mm@kvack.org --- include/asm-generic/tlb.h | 41 ++++++++++++++++++++++++++------------- mm/memory.c | 7 +++++++ 2 files changed, 35 insertions(+), 13 deletions(-)