Message ID | 1557264889-109594-1-git-send-email-yang.shi@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: mmu_gather: remove __tlb_reset_range() for force flush | expand |
Hi all, [+Peter] Apologies for the delay; I'm attending a conference this week so it's tricky to keep up with email. On Wed, May 08, 2019 at 05:34:49AM +0800, Yang Shi wrote: > A few new fields were added to mmu_gather to make TLB flush smarter for > huge page by telling what level of page table is changed. > > __tlb_reset_range() is used to reset all these page table state to > unchanged, which is called by TLB flush for parallel mapping changes for > the same range under non-exclusive lock (i.e. read mmap_sem). Before > commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in > munmap"), MADV_DONTNEED is the only one who may do page zapping in > parallel and it doesn't remove page tables. But, the forementioned commit > may do munmap() under read mmap_sem and free page tables. This causes a > bug [1] reported by Jan Stancek since __tlb_reset_range() may pass the > wrong page table state to architecture specific TLB flush operations. Yikes. Is it actually safe to run free_pgtables() concurrently for a given mm? > So, removing __tlb_reset_range() sounds sane. This may cause more TLB > flush for MADV_DONTNEED, but it should be not called very often, hence > the impact should be negligible. > > The original proposed fix came from Jan Stancek who mainly debugged this > issue, I just wrapped up everything together. I'm still paging the nested flush logic back in, but I have some comments on the patch below. > [1] https://lore.kernel.org/linux-mm/342bf1fd-f1bf-ed62-1127-e911b5032274@linux.alibaba.com/T/#m7a2ab6c878d5a256560650e56189cfae4e73217f > > Reported-by: Jan Stancek <jstancek@redhat.com> > Tested-by: Jan Stancek <jstancek@redhat.com> > Cc: Will Deacon <will.deacon@arm.com> > Cc: stable@vger.kernel.org > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> > Signed-off-by: Jan Stancek <jstancek@redhat.com> > --- > mm/mmu_gather.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > index 99740e1..9fd5272 100644 > --- a/mm/mmu_gather.c > +++ b/mm/mmu_gather.c > @@ -249,11 +249,12 @@ void tlb_finish_mmu(struct mmu_gather *tlb, > * flush by batching, a thread has stable TLB entry can fail to flush Urgh, we should rewrite this comment while we're here so that it makes sense... > * the TLB by observing pte_none|!pte_dirty, for example so flush TLB > * forcefully if we detect parallel PTE batching threads. > + * > + * munmap() may change mapping under non-excluse lock and also free > + * page tables. Do not call __tlb_reset_range() for it. > */ > - if (mm_tlb_flush_nested(tlb->mm)) { > - __tlb_reset_range(tlb); > + if (mm_tlb_flush_nested(tlb->mm)) > __tlb_adjust_range(tlb, start, end - start); > - } I don't think we can elide the call __tlb_reset_range() entirely, since I think we do want to clear the freed_pXX bits to ensure that we walk the range with the smallest mapping granule that we have. Otherwise couldn't we have a problem if we hit a PMD that had been cleared, but the TLB invalidation for the PTEs that used to be linked below it was still pending? Perhaps we should just set fullmm if we see that here's a concurrent unmapper rather than do a worst-case range invalidation. Do you have a feeling for often the mm_tlb_flush_nested() triggers in practice? Will
On Thu, May 09, 2019 at 09:37:26AM +0100, Will Deacon wrote: > Hi all, [+Peter] Right, mm/mmu_gather.c has a MAINTAINERS entry; use it. Also added Nadav and Minchan who've poked at this issue before. And Mel, because he loves these things :-) > Apologies for the delay; I'm attending a conference this week so it's tricky > to keep up with email. > > On Wed, May 08, 2019 at 05:34:49AM +0800, Yang Shi wrote: > > A few new fields were added to mmu_gather to make TLB flush smarter for > > huge page by telling what level of page table is changed. > > > > __tlb_reset_range() is used to reset all these page table state to > > unchanged, which is called by TLB flush for parallel mapping changes for > > the same range under non-exclusive lock (i.e. read mmap_sem). Before > > commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in > > munmap"), MADV_DONTNEED is the only one who may do page zapping in > > parallel and it doesn't remove page tables. But, the forementioned commit > > may do munmap() under read mmap_sem and free page tables. This causes a > > bug [1] reported by Jan Stancek since __tlb_reset_range() may pass the Please don't _EVER_ refer to external sources to describe the actual bug a patch is fixing. That is the primary purpose of the Changelog. Worse, the email you reference does _NOT_ describe the actual problem. Nor do you. > > wrong page table state to architecture specific TLB flush operations. > > Yikes. Is it actually safe to run free_pgtables() concurrently for a given > mm? Yeah.. sorta.. it's been a source of 'interesting' things. This really isn't the first issue here. Also, change_protection_range() is 'fun' too. > > So, removing __tlb_reset_range() sounds sane. This may cause more TLB > > flush for MADV_DONTNEED, but it should be not called very often, hence > > the impact should be negligible. > > > > The original proposed fix came from Jan Stancek who mainly debugged this > > issue, I just wrapped up everything together. > > I'm still paging the nested flush logic back in, but I have some comments on > the patch below. > > > [1] https://lore.kernel.org/linux-mm/342bf1fd-f1bf-ed62-1127-e911b5032274@linux.alibaba.com/T/#m7a2ab6c878d5a256560650e56189cfae4e73217f > > > > Reported-by: Jan Stancek <jstancek@redhat.com> > > Tested-by: Jan Stancek <jstancek@redhat.com> > > Cc: Will Deacon <will.deacon@arm.com> > > Cc: stable@vger.kernel.org > > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> > > Signed-off-by: Jan Stancek <jstancek@redhat.com> > > --- > > mm/mmu_gather.c | 7 ++++--- > > 1 file changed, 4 insertions(+), 3 deletions(-) > > > > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > > index 99740e1..9fd5272 100644 > > --- a/mm/mmu_gather.c > > +++ b/mm/mmu_gather.c > > @@ -249,11 +249,12 @@ void tlb_finish_mmu(struct mmu_gather *tlb, > > * flush by batching, a thread has stable TLB entry can fail to flush > > Urgh, we should rewrite this comment while we're here so that it makes sense... Yeah, that's atrocious. We should put the actual race in there. > > * the TLB by observing pte_none|!pte_dirty, for example so flush TLB > > * forcefully if we detect parallel PTE batching threads. > > + * > > + * munmap() may change mapping under non-excluse lock and also free > > + * page tables. Do not call __tlb_reset_range() for it. > > */ > > - if (mm_tlb_flush_nested(tlb->mm)) { > > - __tlb_reset_range(tlb); > > + if (mm_tlb_flush_nested(tlb->mm)) > > __tlb_adjust_range(tlb, start, end - start); > > - } > > I don't think we can elide the call __tlb_reset_range() entirely, since I > think we do want to clear the freed_pXX bits to ensure that we walk the > range with the smallest mapping granule that we have. Otherwise couldn't we > have a problem if we hit a PMD that had been cleared, but the TLB > invalidation for the PTEs that used to be linked below it was still pending? That's tlb->cleared_p*, and yes agreed. That is, right until some architecture has level dependent TLBI instructions, at which point we'll need to have them all set instead of cleared. > Perhaps we should just set fullmm if we see that here's a concurrent > unmapper rather than do a worst-case range invalidation. Do you have a feeling > for often the mm_tlb_flush_nested() triggers in practice? Quite a bit for certain workloads I imagine, that was the whole point of doing it. Anyway; am I correct in understanding that the actual problem is that we've cleared freed_tables and the ARM64 tlb_flush() will then not invalidate the cache and badness happens? Because so far nobody has actually provided a coherent description of the actual problem we're trying to solve. But I'm thinking something like the below ought to do. diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1dd273..fe768f8d612e 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) { /* - * If there are parallel threads are doing PTE changes on same range - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB - * flush by batching, a thread has stable TLB entry can fail to flush - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB - * forcefully if we detect parallel PTE batching threads. + * Sensible comment goes here.. */ - if (mm_tlb_flush_nested(tlb->mm)) { - __tlb_reset_range(tlb); - __tlb_adjust_range(tlb, start, end - start); + if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) { + /* + * Since we're can't tell what we actually should have + * flushed flush everything in the given range. + */ + tlb->start = start; + tlb->end = end; + tlb->freed_tables = 1; + tlb->cleared_ptes = 1; + tlb->cleared_pmds = 1; + tlb->cleared_puds = 1; + tlb->cleared_p4ds = 1; } tlb_flush_mmu(tlb);
On Thu, May 09, 2019 at 12:38:13PM +0200, Peter Zijlstra wrote: > That's tlb->cleared_p*, and yes agreed. That is, right until some > architecture has level dependent TLBI instructions, at which point we'll > need to have them all set instead of cleared. > Anyway; am I correct in understanding that the actual problem is that > we've cleared freed_tables and the ARM64 tlb_flush() will then not > invalidate the cache and badness happens? > > Because so far nobody has actually provided a coherent description of > the actual problem we're trying to solve. But I'm thinking something > like the below ought to do. There's another 'fun' issue I think. For architectures like ARM that have range invalidation and care about VM_EXEC for I$ invalidation, the below doesn't quite work right either. I suspect we also have to force: tlb->vma_exec = 1. And I don't think there's an architecture that cares, but depending on details I can construct cases where any setting of tlb->vm_hugetlb is wrong, so that is _awesome_. But I suspect the sane thing for now is to force it 0. > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > index 99740e1dd273..fe768f8d612e 100644 > --- a/mm/mmu_gather.c > +++ b/mm/mmu_gather.c > @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, > unsigned long start, unsigned long end) > { > /* > - * If there are parallel threads are doing PTE changes on same range > - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB > - * flush by batching, a thread has stable TLB entry can fail to flush > - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB > - * forcefully if we detect parallel PTE batching threads. > + * Sensible comment goes here.. > */ > - if (mm_tlb_flush_nested(tlb->mm)) { > - __tlb_reset_range(tlb); > - __tlb_adjust_range(tlb, start, end - start); > + if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) { > + /* > + * Since we're can't tell what we actually should have > + * flushed flush everything in the given range. > + */ > + tlb->start = start; > + tlb->end = end; > + tlb->freed_tables = 1; > + tlb->cleared_ptes = 1; > + tlb->cleared_pmds = 1; > + tlb->cleared_puds = 1; > + tlb->cleared_p4ds = 1; > } > > tlb_flush_mmu(tlb);
> > I don't think we can elide the call __tlb_reset_range() entirely, since I > > think we do want to clear the freed_pXX bits to ensure that we walk the > > range with the smallest mapping granule that we have. Otherwise couldn't we > > have a problem if we hit a PMD that had been cleared, but the TLB > > invalidation for the PTEs that used to be linked below it was still > > pending? > > That's tlb->cleared_p*, and yes agreed. That is, right until some > architecture has level dependent TLBI instructions, at which point we'll > need to have them all set instead of cleared. > > > Perhaps we should just set fullmm if we see that here's a concurrent > > unmapper rather than do a worst-case range invalidation. Do you have a > > feeling > > for often the mm_tlb_flush_nested() triggers in practice? > > Quite a bit for certain workloads I imagine, that was the whole point of > doing it. > > Anyway; am I correct in understanding that the actual problem is that > we've cleared freed_tables and the ARM64 tlb_flush() will then not > invalidate the cache and badness happens? That is my understanding, only last level is flushed, which is not enough. > > Because so far nobody has actually provided a coherent description of > the actual problem we're trying to solve. But I'm thinking something > like the below ought to do. I applied it (and fixed small typo: s/tlb->full_mm/tlb->fullmm/). It fixes the problem for me. > > > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > index 99740e1dd273..fe768f8d612e 100644 > --- a/mm/mmu_gather.c > +++ b/mm/mmu_gather.c > @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, > unsigned long start, unsigned long end) > { > /* > - * If there are parallel threads are doing PTE changes on same range > - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB > - * flush by batching, a thread has stable TLB entry can fail to flush > - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB > - * forcefully if we detect parallel PTE batching threads. > + * Sensible comment goes here.. > */ > - if (mm_tlb_flush_nested(tlb->mm)) { > - __tlb_reset_range(tlb); > - __tlb_adjust_range(tlb, start, end - start); > + if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) { > + /* > + * Since we're can't tell what we actually should have > + * flushed flush everything in the given range. > + */ > + tlb->start = start; > + tlb->end = end; > + tlb->freed_tables = 1; > + tlb->cleared_ptes = 1; > + tlb->cleared_pmds = 1; > + tlb->cleared_puds = 1; > + tlb->cleared_p4ds = 1; > } > > tlb_flush_mmu(tlb); >
> On May 9, 2019, at 3:38 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, May 09, 2019 at 09:37:26AM +0100, Will Deacon wrote: >> Hi all, [+Peter] > > Right, mm/mmu_gather.c has a MAINTAINERS entry; use it. > > Also added Nadav and Minchan who've poked at this issue before. And Mel, > because he loves these things :-) > >> Apologies for the delay; I'm attending a conference this week so it's tricky >> to keep up with email. >> >> On Wed, May 08, 2019 at 05:34:49AM +0800, Yang Shi wrote: >>> A few new fields were added to mmu_gather to make TLB flush smarter for >>> huge page by telling what level of page table is changed. >>> >>> __tlb_reset_range() is used to reset all these page table state to >>> unchanged, which is called by TLB flush for parallel mapping changes for >>> the same range under non-exclusive lock (i.e. read mmap_sem). Before >>> commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in >>> munmap"), MADV_DONTNEED is the only one who may do page zapping in >>> parallel and it doesn't remove page tables. But, the forementioned commit >>> may do munmap() under read mmap_sem and free page tables. This causes a >>> bug [1] reported by Jan Stancek since __tlb_reset_range() may pass the > > Please don't _EVER_ refer to external sources to describe the actual bug > a patch is fixing. That is the primary purpose of the Changelog. > > Worse, the email you reference does _NOT_ describe the actual problem. > Nor do you. > >>> wrong page table state to architecture specific TLB flush operations. >> >> Yikes. Is it actually safe to run free_pgtables() concurrently for a given >> mm? > > Yeah.. sorta.. it's been a source of 'interesting' things. This really > isn't the first issue here. > > Also, change_protection_range() is 'fun' too. > >>> So, removing __tlb_reset_range() sounds sane. This may cause more TLB >>> flush for MADV_DONTNEED, but it should be not called very often, hence >>> the impact should be negligible. >>> >>> The original proposed fix came from Jan Stancek who mainly debugged this >>> issue, I just wrapped up everything together. >> >> I'm still paging the nested flush logic back in, but I have some comments on >> the patch below. >> >>> [1] https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2F342bf1fd-f1bf-ed62-1127-e911b5032274%40linux.alibaba.com%2FT%2F%23m7a2ab6c878d5a256560650e56189cfae4e73217f&data=02%7C01%7Cnamit%40vmware.com%7C7be2f2b29b654aba7de308d6d46a7b93%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636929951176903247&sdata=gGptCMeb9vW4jXUnG53amgvrv8TB9F52JYBHmPeHFvs%3D&reserved=0 >>> >>> Reported-by: Jan Stancek <jstancek@redhat.com> >>> Tested-by: Jan Stancek <jstancek@redhat.com> >>> Cc: Will Deacon <will.deacon@arm.com> >>> Cc: stable@vger.kernel.org >>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> >>> Signed-off-by: Jan Stancek <jstancek@redhat.com> >>> --- >>> mm/mmu_gather.c | 7 ++++--- >>> 1 file changed, 4 insertions(+), 3 deletions(-) >>> >>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c >>> index 99740e1..9fd5272 100644 >>> --- a/mm/mmu_gather.c >>> +++ b/mm/mmu_gather.c >>> @@ -249,11 +249,12 @@ void tlb_finish_mmu(struct mmu_gather *tlb, >>> * flush by batching, a thread has stable TLB entry can fail to flush >> >> Urgh, we should rewrite this comment while we're here so that it makes sense... > > Yeah, that's atrocious. We should put the actual race in there. > >>> * the TLB by observing pte_none|!pte_dirty, for example so flush TLB >>> * forcefully if we detect parallel PTE batching threads. >>> + * >>> + * munmap() may change mapping under non-excluse lock and also free >>> + * page tables. Do not call __tlb_reset_range() for it. >>> */ >>> - if (mm_tlb_flush_nested(tlb->mm)) { >>> - __tlb_reset_range(tlb); >>> + if (mm_tlb_flush_nested(tlb->mm)) >>> __tlb_adjust_range(tlb, start, end - start); >>> - } >> >> I don't think we can elide the call __tlb_reset_range() entirely, since I >> think we do want to clear the freed_pXX bits to ensure that we walk the >> range with the smallest mapping granule that we have. Otherwise couldn't we >> have a problem if we hit a PMD that had been cleared, but the TLB >> invalidation for the PTEs that used to be linked below it was still pending? > > That's tlb->cleared_p*, and yes agreed. That is, right until some > architecture has level dependent TLBI instructions, at which point we'll > need to have them all set instead of cleared. > >> Perhaps we should just set fullmm if we see that here's a concurrent >> unmapper rather than do a worst-case range invalidation. Do you have a feeling >> for often the mm_tlb_flush_nested() triggers in practice? > > Quite a bit for certain workloads I imagine, that was the whole point of > doing it. > > Anyway; am I correct in understanding that the actual problem is that > we've cleared freed_tables and the ARM64 tlb_flush() will then not > invalidate the cache and badness happens? > > Because so far nobody has actually provided a coherent description of > the actual problem we're trying to solve. But I'm thinking something > like the below ought to do. > > > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > index 99740e1dd273..fe768f8d612e 100644 > --- a/mm/mmu_gather.c > +++ b/mm/mmu_gather.c > @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, > unsigned long start, unsigned long end) > { > /* > - * If there are parallel threads are doing PTE changes on same range > - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB > - * flush by batching, a thread has stable TLB entry can fail to flush > - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB > - * forcefully if we detect parallel PTE batching threads. > + * Sensible comment goes here.. > */ > - if (mm_tlb_flush_nested(tlb->mm)) { > - __tlb_reset_range(tlb); > - __tlb_adjust_range(tlb, start, end - start); > + if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) { > + /* > + * Since we're can't tell what we actually should have > + * flushed flush everything in the given range. > + */ > + tlb->start = start; > + tlb->end = end; > + tlb->freed_tables = 1; > + tlb->cleared_ptes = 1; > + tlb->cleared_pmds = 1; > + tlb->cleared_puds = 1; > + tlb->cleared_p4ds = 1; > } > > tlb_flush_mmu(tlb); As a simple optimization, I think it is possible to hold multiple nesting counters in the mm, similar to tlb_flush_pending, for freed_tables, cleared_ptes, etc. The first time you set tlb->freed_tables, you also atomically increase mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use mm->tlb_flush_freed_tables instead of tlb->freed_tables.
On 5/9/19 3:38 AM, Peter Zijlstra wrote: > On Thu, May 09, 2019 at 09:37:26AM +0100, Will Deacon wrote: >> Hi all, [+Peter] > Right, mm/mmu_gather.c has a MAINTAINERS entry; use it. Sorry for that, I didn't realize we have mmu_gather maintainers. I should run maintainer.pl. > > Also added Nadav and Minchan who've poked at this issue before. And Mel, > because he loves these things :-) > >> Apologies for the delay; I'm attending a conference this week so it's tricky >> to keep up with email. >> >> On Wed, May 08, 2019 at 05:34:49AM +0800, Yang Shi wrote: >>> A few new fields were added to mmu_gather to make TLB flush smarter for >>> huge page by telling what level of page table is changed. >>> >>> __tlb_reset_range() is used to reset all these page table state to >>> unchanged, which is called by TLB flush for parallel mapping changes for >>> the same range under non-exclusive lock (i.e. read mmap_sem). Before >>> commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in >>> munmap"), MADV_DONTNEED is the only one who may do page zapping in >>> parallel and it doesn't remove page tables. But, the forementioned commit >>> may do munmap() under read mmap_sem and free page tables. This causes a >>> bug [1] reported by Jan Stancek since __tlb_reset_range() may pass the > Please don't _EVER_ refer to external sources to describe the actual bug > a patch is fixing. That is the primary purpose of the Changelog. > > Worse, the email you reference does _NOT_ describe the actual problem. > Nor do you. Sure, will articulate the real bug in the commit log. > >>> wrong page table state to architecture specific TLB flush operations. >> Yikes. Is it actually safe to run free_pgtables() concurrently for a given >> mm? > Yeah.. sorta.. it's been a source of 'interesting' things. This really > isn't the first issue here. > > Also, change_protection_range() is 'fun' too. > >>> So, removing __tlb_reset_range() sounds sane. This may cause more TLB >>> flush for MADV_DONTNEED, but it should be not called very often, hence >>> the impact should be negligible. >>> >>> The original proposed fix came from Jan Stancek who mainly debugged this >>> issue, I just wrapped up everything together. >> I'm still paging the nested flush logic back in, but I have some comments on >> the patch below. >> >>> [1] https://lore.kernel.org/linux-mm/342bf1fd-f1bf-ed62-1127-e911b5032274@linux.alibaba.com/T/#m7a2ab6c878d5a256560650e56189cfae4e73217f >>> >>> Reported-by: Jan Stancek <jstancek@redhat.com> >>> Tested-by: Jan Stancek <jstancek@redhat.com> >>> Cc: Will Deacon <will.deacon@arm.com> >>> Cc: stable@vger.kernel.org >>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> >>> Signed-off-by: Jan Stancek <jstancek@redhat.com> >>> --- >>> mm/mmu_gather.c | 7 ++++--- >>> 1 file changed, 4 insertions(+), 3 deletions(-) >>> >>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c >>> index 99740e1..9fd5272 100644 >>> --- a/mm/mmu_gather.c >>> +++ b/mm/mmu_gather.c >>> @@ -249,11 +249,12 @@ void tlb_finish_mmu(struct mmu_gather *tlb, >>> * flush by batching, a thread has stable TLB entry can fail to flush >> Urgh, we should rewrite this comment while we're here so that it makes sense... > Yeah, that's atrocious. We should put the actual race in there. > >>> * the TLB by observing pte_none|!pte_dirty, for example so flush TLB >>> * forcefully if we detect parallel PTE batching threads. >>> + * >>> + * munmap() may change mapping under non-excluse lock and also free >>> + * page tables. Do not call __tlb_reset_range() for it. >>> */ >>> - if (mm_tlb_flush_nested(tlb->mm)) { >>> - __tlb_reset_range(tlb); >>> + if (mm_tlb_flush_nested(tlb->mm)) >>> __tlb_adjust_range(tlb, start, end - start); >>> - } >> I don't think we can elide the call __tlb_reset_range() entirely, since I >> think we do want to clear the freed_pXX bits to ensure that we walk the >> range with the smallest mapping granule that we have. Otherwise couldn't we >> have a problem if we hit a PMD that had been cleared, but the TLB >> invalidation for the PTEs that used to be linked below it was still pending? > That's tlb->cleared_p*, and yes agreed. That is, right until some > architecture has level dependent TLBI instructions, at which point we'll > need to have them all set instead of cleared. > >> Perhaps we should just set fullmm if we see that here's a concurrent >> unmapper rather than do a worst-case range invalidation. Do you have a feeling >> for often the mm_tlb_flush_nested() triggers in practice? > Quite a bit for certain workloads I imagine, that was the whole point of > doing it. > > Anyway; am I correct in understanding that the actual problem is that > we've cleared freed_tables and the ARM64 tlb_flush() will then not > invalidate the cache and badness happens? Yes. > > Because so far nobody has actually provided a coherent description of > the actual problem we're trying to solve. But I'm thinking something > like the below ought to do. Thanks for the suggestion, will do in v2. > > > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > index 99740e1dd273..fe768f8d612e 100644 > --- a/mm/mmu_gather.c > +++ b/mm/mmu_gather.c > @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, > unsigned long start, unsigned long end) > { > /* > - * If there are parallel threads are doing PTE changes on same range > - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB > - * flush by batching, a thread has stable TLB entry can fail to flush > - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB > - * forcefully if we detect parallel PTE batching threads. > + * Sensible comment goes here.. > */ > - if (mm_tlb_flush_nested(tlb->mm)) { > - __tlb_reset_range(tlb); > - __tlb_adjust_range(tlb, start, end - start); > + if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) { > + /* > + * Since we're can't tell what we actually should have > + * flushed flush everything in the given range. > + */ > + tlb->start = start; > + tlb->end = end; > + tlb->freed_tables = 1; > + tlb->cleared_ptes = 1; > + tlb->cleared_pmds = 1; > + tlb->cleared_puds = 1; > + tlb->cleared_p4ds = 1; > } > > tlb_flush_mmu(tlb);
On Thu, May 09, 2019 at 05:36:29PM +0000, Nadav Amit wrote: > > On May 9, 2019, at 3:38 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > > index 99740e1dd273..fe768f8d612e 100644 > > --- a/mm/mmu_gather.c > > +++ b/mm/mmu_gather.c > > @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, > > unsigned long start, unsigned long end) > > { > > /* > > - * If there are parallel threads are doing PTE changes on same range > > - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB > > - * flush by batching, a thread has stable TLB entry can fail to flush > > - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB > > - * forcefully if we detect parallel PTE batching threads. > > + * Sensible comment goes here.. > > */ > > - if (mm_tlb_flush_nested(tlb->mm)) { > > - __tlb_reset_range(tlb); > > - __tlb_adjust_range(tlb, start, end - start); > > + if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) { > > + /* > > + * Since we're can't tell what we actually should have > > + * flushed flush everything in the given range. > > + */ > > + tlb->start = start; > > + tlb->end = end; > > + tlb->freed_tables = 1; > > + tlb->cleared_ptes = 1; > > + tlb->cleared_pmds = 1; > > + tlb->cleared_puds = 1; > > + tlb->cleared_p4ds = 1; > > } > > > > tlb_flush_mmu(tlb); > > As a simple optimization, I think it is possible to hold multiple nesting > counters in the mm, similar to tlb_flush_pending, for freed_tables, > cleared_ptes, etc. > > The first time you set tlb->freed_tables, you also atomically increase > mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use > mm->tlb_flush_freed_tables instead of tlb->freed_tables. That sounds fraught with races and expensive; I would much prefer to not go there for this arguably rare case. Consider such fun cases as where CPU-0 sees and clears a PTE, CPU-1 races and doesn't see that PTE. Therefore CPU-0 sets and counts cleared_ptes. Then if CPU-1 flushes while CPU-0 is still in mmu_gather, it will see cleared_ptes count increased and flush that granularity, OTOH if CPU-1 flushes after CPU-0 completes, it will not and potentiall miss an invalidate it should have had. This whole concurrent mmu_gather stuff is horrible. /me ponders more.... So I think the fundamental race here is this: CPU-0 CPU-1 tlb_gather_mmu(.start=1, tlb_gather_mmu(.start=2, .end=3); .end=4); ptep_get_and_clear_full(2) tlb_remove_tlb_entry(2); __tlb_remove_page(); if (pte_present(2)) // nope tlb_finish_mmu(); // continue without TLBI(2) // whoopsie tlb_finish_mmu(); tlb_flush() -> TLBI(2) And we can fix that by having tlb_finish_mmu() sync up. Never let a concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers have completed. This should not be too hard to make happen.
On 5/9/19 3:54 AM, Peter Zijlstra wrote: > On Thu, May 09, 2019 at 12:38:13PM +0200, Peter Zijlstra wrote: > >> That's tlb->cleared_p*, and yes agreed. That is, right until some >> architecture has level dependent TLBI instructions, at which point we'll >> need to have them all set instead of cleared. >> Anyway; am I correct in understanding that the actual problem is that >> we've cleared freed_tables and the ARM64 tlb_flush() will then not >> invalidate the cache and badness happens? >> >> Because so far nobody has actually provided a coherent description of >> the actual problem we're trying to solve. But I'm thinking something >> like the below ought to do. > There's another 'fun' issue I think. For architectures like ARM that > have range invalidation and care about VM_EXEC for I$ invalidation, the > below doesn't quite work right either. > > I suspect we also have to force: tlb->vma_exec = 1. Isn't the below code in tlb_flush enough to guarantee this? ... } else if (tlb->end) { struct vm_area_struct vma = { .vm_mm = tlb->mm, .vm_flags = (tlb->vma_exec ? VM_EXEC : 0) | (tlb->vma_huge ? VM_HUGETLB : 0), }; ... > > And I don't think there's an architecture that cares, but depending on > details I can construct cases where any setting of tlb->vm_hugetlb is > wrong, so that is _awesome_. But I suspect the sane thing for now is to > force it 0. > >> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c >> index 99740e1dd273..fe768f8d612e 100644 >> --- a/mm/mmu_gather.c >> +++ b/mm/mmu_gather.c >> @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, >> unsigned long start, unsigned long end) >> { >> /* >> - * If there are parallel threads are doing PTE changes on same range >> - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB >> - * flush by batching, a thread has stable TLB entry can fail to flush >> - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB >> - * forcefully if we detect parallel PTE batching threads. >> + * Sensible comment goes here.. >> */ >> - if (mm_tlb_flush_nested(tlb->mm)) { >> - __tlb_reset_range(tlb); >> - __tlb_adjust_range(tlb, start, end - start); >> + if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) { >> + /* >> + * Since we're can't tell what we actually should have >> + * flushed flush everything in the given range. >> + */ >> + tlb->start = start; >> + tlb->end = end; >> + tlb->freed_tables = 1; >> + tlb->cleared_ptes = 1; >> + tlb->cleared_pmds = 1; >> + tlb->cleared_puds = 1; >> + tlb->cleared_p4ds = 1; >> } >> >> tlb_flush_mmu(tlb);
On Thu, May 09, 2019 at 11:35:55AM -0700, Yang Shi wrote: > > > On 5/9/19 3:54 AM, Peter Zijlstra wrote: > > On Thu, May 09, 2019 at 12:38:13PM +0200, Peter Zijlstra wrote: > > > > > That's tlb->cleared_p*, and yes agreed. That is, right until some > > > architecture has level dependent TLBI instructions, at which point we'll > > > need to have them all set instead of cleared. > > > Anyway; am I correct in understanding that the actual problem is that > > > we've cleared freed_tables and the ARM64 tlb_flush() will then not > > > invalidate the cache and badness happens? > > > > > > Because so far nobody has actually provided a coherent description of > > > the actual problem we're trying to solve. But I'm thinking something > > > like the below ought to do. > > There's another 'fun' issue I think. For architectures like ARM that > > have range invalidation and care about VM_EXEC for I$ invalidation, the > > below doesn't quite work right either. > > > > I suspect we also have to force: tlb->vma_exec = 1. > > Isn't the below code in tlb_flush enough to guarantee this? > > ... > } else if (tlb->end) { > struct vm_area_struct vma = { > .vm_mm = tlb->mm, > .vm_flags = (tlb->vma_exec ? VM_EXEC : 0) | > (tlb->vma_huge ? VM_HUGETLB : 0), > }; Only when vma_exec is actually set... and there is no guarantee of that in the concurrent path (the last VMA we iterate might not be executable, but the TLBI we've missed might have been). More specific, the 'fun' case is if we have no present page in the whole executable page, in that case tlb->end == 0 and we never call into the arch code, never giving it chance to flush I$.
On 5/9/19 11:24 AM, Peter Zijlstra wrote: > On Thu, May 09, 2019 at 05:36:29PM +0000, Nadav Amit wrote: >>> On May 9, 2019, at 3:38 AM, Peter Zijlstra <peterz@infradead.org> wrote: >>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c >>> index 99740e1dd273..fe768f8d612e 100644 >>> --- a/mm/mmu_gather.c >>> +++ b/mm/mmu_gather.c >>> @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, >>> unsigned long start, unsigned long end) >>> { >>> /* >>> - * If there are parallel threads are doing PTE changes on same range >>> - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB >>> - * flush by batching, a thread has stable TLB entry can fail to flush >>> - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB >>> - * forcefully if we detect parallel PTE batching threads. >>> + * Sensible comment goes here.. >>> */ >>> - if (mm_tlb_flush_nested(tlb->mm)) { >>> - __tlb_reset_range(tlb); >>> - __tlb_adjust_range(tlb, start, end - start); >>> + if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) { >>> + /* >>> + * Since we're can't tell what we actually should have >>> + * flushed flush everything in the given range. >>> + */ >>> + tlb->start = start; >>> + tlb->end = end; >>> + tlb->freed_tables = 1; >>> + tlb->cleared_ptes = 1; >>> + tlb->cleared_pmds = 1; >>> + tlb->cleared_puds = 1; >>> + tlb->cleared_p4ds = 1; >>> } >>> >>> tlb_flush_mmu(tlb); >> As a simple optimization, I think it is possible to hold multiple nesting >> counters in the mm, similar to tlb_flush_pending, for freed_tables, >> cleared_ptes, etc. >> >> The first time you set tlb->freed_tables, you also atomically increase >> mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use >> mm->tlb_flush_freed_tables instead of tlb->freed_tables. > That sounds fraught with races and expensive; I would much prefer to not > go there for this arguably rare case. > > Consider such fun cases as where CPU-0 sees and clears a PTE, CPU-1 > races and doesn't see that PTE. Therefore CPU-0 sets and counts > cleared_ptes. Then if CPU-1 flushes while CPU-0 is still in mmu_gather, > it will see cleared_ptes count increased and flush that granularity, > OTOH if CPU-1 flushes after CPU-0 completes, it will not and potentiall > miss an invalidate it should have had. > > This whole concurrent mmu_gather stuff is horrible. > > /me ponders more.... > > So I think the fundamental race here is this: > > CPU-0 CPU-1 > > tlb_gather_mmu(.start=1, tlb_gather_mmu(.start=2, > .end=3); .end=4); > > ptep_get_and_clear_full(2) > tlb_remove_tlb_entry(2); > __tlb_remove_page(); > if (pte_present(2)) // nope > > tlb_finish_mmu(); > > // continue without TLBI(2) > // whoopsie > > tlb_finish_mmu(); > tlb_flush() -> TLBI(2) I'm not quite sure if this is the case Jan really met. But, according to his test, once correct tlb->freed_tables and tlb->cleared_* are set, his test works well. > > > And we can fix that by having tlb_finish_mmu() sync up. Never let a > concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers > have completed. Not sure if this will scale well. > > This should not be too hard to make happen.
On Thu, May 09, 2019 at 12:38:13PM +0200, Peter Zijlstra wrote: > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > index 99740e1dd273..fe768f8d612e 100644 > --- a/mm/mmu_gather.c > +++ b/mm/mmu_gather.c > @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, > unsigned long start, unsigned long end) > { > /* > - * If there are parallel threads are doing PTE changes on same range > - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB > - * flush by batching, a thread has stable TLB entry can fail to flush > - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB > - * forcefully if we detect parallel PTE batching threads. > + * Sensible comment goes here.. > */ > - if (mm_tlb_flush_nested(tlb->mm)) { > - __tlb_reset_range(tlb); > - __tlb_adjust_range(tlb, start, end - start); > + if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) { > + /* > + * Since we're can't tell what we actually should have > + * flushed flush everything in the given range. > + */ > + tlb->start = start; > + tlb->end = end; > + tlb->freed_tables = 1; > + tlb->cleared_ptes = 1; > + tlb->cleared_pmds = 1; > + tlb->cleared_puds = 1; > + tlb->cleared_p4ds = 1; > } > > tlb_flush_mmu(tlb); So PPC-radix has page-size dependent TLBI, but the above doesn't work for them, because they use the tlb_change_page_size() interface and don't look at tlb->cleared_p*(). Concequently, they have their own special magic :/ Nick, how about you use the tlb_change_page_size() interface to find/flush on the page-size boundaries, but otherwise use the tlb->cleared_p* flags to select which actual sizes to flush? AFAICT that should work just fine for you guys. Maybe something like so? (fwiw, there's an aweful lot of almost identical functions there) --- diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c index 6a23b9ebd2a1..efc99ef78db6 100644 --- a/arch/powerpc/mm/tlb-radix.c +++ b/arch/powerpc/mm/tlb-radix.c @@ -692,7 +692,7 @@ static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = POWER9_ static inline void __radix__flush_tlb_range(struct mm_struct *mm, unsigned long start, unsigned long end, - bool flush_all_sizes) + bool pflush, bool hflush, bool gflush) { unsigned long pid; @@ -734,14 +734,9 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm, _tlbie_pid(pid, RIC_FLUSH_TLB); } } else { - bool hflush = flush_all_sizes; - bool gflush = flush_all_sizes; unsigned long hstart, hend; unsigned long gstart, gend; - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) - hflush = true; - if (hflush) { hstart = (start + PMD_SIZE - 1) & PMD_MASK; hend = end & PMD_MASK; @@ -758,7 +753,9 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm, asm volatile("ptesync": : :"memory"); if (local) { - __tlbiel_va_range(start, end, pid, page_size, mmu_virtual_psize); + if (pflush) + __tlbiel_va_range(start, end, pid, + page_size, mmu_virtual_psize); if (hflush) __tlbiel_va_range(hstart, hend, pid, PMD_SIZE, MMU_PAGE_2M); @@ -767,7 +764,9 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm, PUD_SIZE, MMU_PAGE_1G); asm volatile("ptesync": : :"memory"); } else { - __tlbie_va_range(start, end, pid, page_size, mmu_virtual_psize); + if (pflush) + __tlbie_va_range(start, end, pid, + page_size, mmu_virtual_psize); if (hflush) __tlbie_va_range(hstart, hend, pid, PMD_SIZE, MMU_PAGE_2M); @@ -785,12 +784,17 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) { + bool hflush = false; + #ifdef CONFIG_HUGETLB_PAGE if (is_vm_hugetlb_page(vma)) return radix__flush_hugetlb_tlb_range(vma, start, end); #endif - __radix__flush_tlb_range(vma->vm_mm, start, end, false); + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) + hflush = true; + + __radix__flush_tlb_range(vma->vm_mm, start, end, true, hflush, false); } EXPORT_SYMBOL(radix__flush_tlb_range); @@ -881,49 +885,14 @@ void radix__tlb_flush(struct mmu_gather *tlb) */ if (tlb->fullmm) { __flush_all_mm(mm, true); -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE) - } else if (mm_tlb_flush_nested(mm)) { - /* - * If there is a concurrent invalidation that is clearing ptes, - * then it's possible this invalidation will miss one of those - * cleared ptes and miss flushing the TLB. If this invalidate - * returns before the other one flushes TLBs, that can result - * in it returning while there are still valid TLBs inside the - * range to be invalidated. - * - * See mm/memory.c:tlb_finish_mmu() for more details. - * - * The solution to this is ensure the entire range is always - * flushed here. The problem for powerpc is that the flushes - * are page size specific, so this "forced flush" would not - * do the right thing if there are a mix of page sizes in - * the range to be invalidated. So use __flush_tlb_range - * which invalidates all possible page sizes in the range. - * - * PWC flush probably is not be required because the core code - * shouldn't free page tables in this path, but accounting - * for the possibility makes us a bit more robust. - * - * need_flush_all is an uncommon case because page table - * teardown should be done with exclusive locks held (but - * after locks are dropped another invalidate could come - * in), it could be optimized further if necessary. - */ - if (!tlb->need_flush_all) - __radix__flush_tlb_range(mm, start, end, true); - else - radix__flush_all_mm(mm); -#endif - } else if ( (psize = radix_get_mmu_psize(page_size)) == -1) { - if (!tlb->need_flush_all) - radix__flush_tlb_mm(mm); - else - radix__flush_all_mm(mm); } else { if (!tlb->need_flush_all) - radix__flush_tlb_range_psize(mm, start, end, psize); + __radix__flush_tlb_range(mm, start, end, + tlb->cleared_pte, + tlb->cleared_pmd, + tlb->cleared_pud); else - radix__flush_tlb_pwc_range_psize(mm, start, end, psize); + radix__flush_all_mm(mm); } tlb->need_flush_all = 0; }
----- Original Message ----- > > > On 5/9/19 11:24 AM, Peter Zijlstra wrote: > > On Thu, May 09, 2019 at 05:36:29PM +0000, Nadav Amit wrote: > >>> On May 9, 2019, at 3:38 AM, Peter Zijlstra <peterz@infradead.org> wrote: > >>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > >>> index 99740e1dd273..fe768f8d612e 100644 > >>> --- a/mm/mmu_gather.c > >>> +++ b/mm/mmu_gather.c > >>> @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, > >>> unsigned long start, unsigned long end) > >>> { > >>> /* > >>> - * If there are parallel threads are doing PTE changes on same range > >>> - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB > >>> - * flush by batching, a thread has stable TLB entry can fail to flush > >>> - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB > >>> - * forcefully if we detect parallel PTE batching threads. > >>> + * Sensible comment goes here.. > >>> */ > >>> - if (mm_tlb_flush_nested(tlb->mm)) { > >>> - __tlb_reset_range(tlb); > >>> - __tlb_adjust_range(tlb, start, end - start); > >>> + if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) { > >>> + /* > >>> + * Since we're can't tell what we actually should have > >>> + * flushed flush everything in the given range. > >>> + */ > >>> + tlb->start = start; > >>> + tlb->end = end; > >>> + tlb->freed_tables = 1; > >>> + tlb->cleared_ptes = 1; > >>> + tlb->cleared_pmds = 1; > >>> + tlb->cleared_puds = 1; > >>> + tlb->cleared_p4ds = 1; > >>> } > >>> > >>> tlb_flush_mmu(tlb); > >> As a simple optimization, I think it is possible to hold multiple nesting > >> counters in the mm, similar to tlb_flush_pending, for freed_tables, > >> cleared_ptes, etc. > >> > >> The first time you set tlb->freed_tables, you also atomically increase > >> mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use > >> mm->tlb_flush_freed_tables instead of tlb->freed_tables. > > That sounds fraught with races and expensive; I would much prefer to not > > go there for this arguably rare case. > > > > Consider such fun cases as where CPU-0 sees and clears a PTE, CPU-1 > > races and doesn't see that PTE. Therefore CPU-0 sets and counts > > cleared_ptes. Then if CPU-1 flushes while CPU-0 is still in mmu_gather, > > it will see cleared_ptes count increased and flush that granularity, > > OTOH if CPU-1 flushes after CPU-0 completes, it will not and potentiall > > miss an invalidate it should have had. > > > > This whole concurrent mmu_gather stuff is horrible. > > > > /me ponders more.... > > > > So I think the fundamental race here is this: > > > > CPU-0 CPU-1 > > > > tlb_gather_mmu(.start=1, tlb_gather_mmu(.start=2, > > .end=3); .end=4); > > > > ptep_get_and_clear_full(2) > > tlb_remove_tlb_entry(2); > > __tlb_remove_page(); > > if (pte_present(2)) // nope > > > > tlb_finish_mmu(); > > > > // continue without TLBI(2) > > // whoopsie > > > > tlb_finish_mmu(); > > tlb_flush() -> TLBI(2) > > I'm not quite sure if this is the case Jan really met. But, according to > his test, once correct tlb->freed_tables and tlb->cleared_* are set, his > test works well. My theory was following sequence: t1: map_write_unmap() t2: dummy() map_address = mmap() map_address[i] = 'b' munmap(map_address) downgrade_write(&mm->mmap_sem); unmap_region() tlb_gather_mmu() inc_tlb_flush_pending(tlb->mm); free_pgtables() tlb->freed_tables = 1 tlb->cleared_pmds = 1 pthread_exit() madvise(thread_stack, 8M, MADV_DONTNEED) zap_page_range() tlb_gather_mmu() inc_tlb_flush_pending(tlb->mm); tlb_finish_mmu() if (mm_tlb_flush_nested(tlb->mm)) __tlb_reset_range() tlb->freed_tables = 0 tlb->cleared_pmds = 0 __flush_tlb_range(last_level = 0) ... map_address = mmap() map_address[i] = 'b' <page fault loop> # PTE appeared valid to me, # so I suspected stale TLB entry at higher level as result of "freed_tables = 0" I'm happy to apply/run any debug patches to get more data that would help. > > > > > > > And we can fix that by having tlb_finish_mmu() sync up. Never let a > > concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers > > have completed. > > Not sure if this will scale well. > > > > > This should not be too hard to make happen. > >
[ Restoring the recipients after mistakenly pressing reply instead of reply-all ] > On May 9, 2019, at 12:11 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, May 09, 2019 at 06:50:00PM +0000, Nadav Amit wrote: >>> On May 9, 2019, at 11:24 AM, Peter Zijlstra <peterz@infradead.org> wrote: >>> >>> On Thu, May 09, 2019 at 05:36:29PM +0000, Nadav Amit wrote: > >>>> As a simple optimization, I think it is possible to hold multiple nesting >>>> counters in the mm, similar to tlb_flush_pending, for freed_tables, >>>> cleared_ptes, etc. >>>> >>>> The first time you set tlb->freed_tables, you also atomically increase >>>> mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use >>>> mm->tlb_flush_freed_tables instead of tlb->freed_tables. >>> >>> That sounds fraught with races and expensive; I would much prefer to not >>> go there for this arguably rare case. >>> >>> Consider such fun cases as where CPU-0 sees and clears a PTE, CPU-1 >>> races and doesn't see that PTE. Therefore CPU-0 sets and counts >>> cleared_ptes. Then if CPU-1 flushes while CPU-0 is still in mmu_gather, >>> it will see cleared_ptes count increased and flush that granularity, >>> OTOH if CPU-1 flushes after CPU-0 completes, it will not and potentiall >>> miss an invalidate it should have had. >> >> CPU-0 would send a TLB shootdown request to CPU-1 when it is done, so I >> don’t see the problem. The TLB shootdown mechanism is independent of the >> mmu_gather for the matter. > > Duh.. I still don't like those unconditional mm wide atomic counters. > >>> This whole concurrent mmu_gather stuff is horrible. >>> >>> /me ponders more.... >>> >>> So I think the fundamental race here is this: >>> >>> CPU-0 CPU-1 >>> >>> tlb_gather_mmu(.start=1, tlb_gather_mmu(.start=2, >>> .end=3); .end=4); >>> >>> ptep_get_and_clear_full(2) >>> tlb_remove_tlb_entry(2); >>> __tlb_remove_page(); >>> if (pte_present(2)) // nope >>> >>> tlb_finish_mmu(); >>> >>> // continue without TLBI(2) >>> // whoopsie >>> >>> tlb_finish_mmu(); >>> tlb_flush() -> TLBI(2) >>> >>> >>> And we can fix that by having tlb_finish_mmu() sync up. Never let a >>> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers >>> have completed. >>> >>> This should not be too hard to make happen. >> >> This synchronization sounds much more expensive than what I proposed. But I >> agree that cache-lines that move from one CPU to another might become an >> issue. But I think that the scheme I suggested would minimize this overhead. > > Well, it would have a lot more unconditional atomic ops. My scheme only > waits when there is actual concurrency. Well, something has to give. I didn’t think that if the same core does the atomic op it would be too expensive. > I _think_ something like the below ought to work, but its not even been > near a compiler. The only problem is the unconditional wakeup; we can > play games to avoid that if we want to continue with this. > > Ideally we'd only do this when there's been actual overlap, but I've not > found a sensible way to detect that. > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 4ef4bbe78a1d..b70e35792d29 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm) > * > * Therefore we must rely on tlb_flush_*() to guarantee order. > */ > - atomic_dec(&mm->tlb_flush_pending); > + if (atomic_dec_and_test(&mm->tlb_flush_pending)) { > + wake_up_var(&mm->tlb_flush_pending); > + } else { > + wait_event_var(&mm->tlb_flush_pending, > + !atomic_read_acquire(&mm->tlb_flush_pending)); > + } > } It still seems very expensive to me, at least for certain workloads (e.g., Apache with multithreaded MPM). It may be possible to avoid false-positive nesting indications (when the flushes do not overlap) by creating a new struct mmu_gather_pending, with something like: struct mmu_gather_pending { u64 start; u64 end; struct mmu_gather_pending *next; } tlb_finish_mmu() would then iterate over the mm->mmu_gather_pending (pointing to the linked list) and find whether there is any overlap. This would still require synchronization (acquiring a lock when allocating and deallocating or something fancier).
On 5/9/19 2:06 PM, Jan Stancek wrote: > ----- Original Message ----- >> >> On 5/9/19 11:24 AM, Peter Zijlstra wrote: >>> On Thu, May 09, 2019 at 05:36:29PM +0000, Nadav Amit wrote: >>>>> On May 9, 2019, at 3:38 AM, Peter Zijlstra <peterz@infradead.org> wrote: >>>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c >>>>> index 99740e1dd273..fe768f8d612e 100644 >>>>> --- a/mm/mmu_gather.c >>>>> +++ b/mm/mmu_gather.c >>>>> @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, >>>>> unsigned long start, unsigned long end) >>>>> { >>>>> /* >>>>> - * If there are parallel threads are doing PTE changes on same range >>>>> - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB >>>>> - * flush by batching, a thread has stable TLB entry can fail to flush >>>>> - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB >>>>> - * forcefully if we detect parallel PTE batching threads. >>>>> + * Sensible comment goes here.. >>>>> */ >>>>> - if (mm_tlb_flush_nested(tlb->mm)) { >>>>> - __tlb_reset_range(tlb); >>>>> - __tlb_adjust_range(tlb, start, end - start); >>>>> + if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) { >>>>> + /* >>>>> + * Since we're can't tell what we actually should have >>>>> + * flushed flush everything in the given range. >>>>> + */ >>>>> + tlb->start = start; >>>>> + tlb->end = end; >>>>> + tlb->freed_tables = 1; >>>>> + tlb->cleared_ptes = 1; >>>>> + tlb->cleared_pmds = 1; >>>>> + tlb->cleared_puds = 1; >>>>> + tlb->cleared_p4ds = 1; >>>>> } >>>>> >>>>> tlb_flush_mmu(tlb); >>>> As a simple optimization, I think it is possible to hold multiple nesting >>>> counters in the mm, similar to tlb_flush_pending, for freed_tables, >>>> cleared_ptes, etc. >>>> >>>> The first time you set tlb->freed_tables, you also atomically increase >>>> mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use >>>> mm->tlb_flush_freed_tables instead of tlb->freed_tables. >>> That sounds fraught with races and expensive; I would much prefer to not >>> go there for this arguably rare case. >>> >>> Consider such fun cases as where CPU-0 sees and clears a PTE, CPU-1 >>> races and doesn't see that PTE. Therefore CPU-0 sets and counts >>> cleared_ptes. Then if CPU-1 flushes while CPU-0 is still in mmu_gather, >>> it will see cleared_ptes count increased and flush that granularity, >>> OTOH if CPU-1 flushes after CPU-0 completes, it will not and potentiall >>> miss an invalidate it should have had. >>> >>> This whole concurrent mmu_gather stuff is horrible. >>> >>> /me ponders more.... >>> >>> So I think the fundamental race here is this: >>> >>> CPU-0 CPU-1 >>> >>> tlb_gather_mmu(.start=1, tlb_gather_mmu(.start=2, >>> .end=3); .end=4); >>> >>> ptep_get_and_clear_full(2) >>> tlb_remove_tlb_entry(2); >>> __tlb_remove_page(); >>> if (pte_present(2)) // nope >>> >>> tlb_finish_mmu(); >>> >>> // continue without TLBI(2) >>> // whoopsie >>> >>> tlb_finish_mmu(); >>> tlb_flush() -> TLBI(2) >> I'm not quite sure if this is the case Jan really met. But, according to >> his test, once correct tlb->freed_tables and tlb->cleared_* are set, his >> test works well. > My theory was following sequence: > > t1: map_write_unmap() t2: dummy() > > map_address = mmap() > map_address[i] = 'b' > munmap(map_address) > downgrade_write(&mm->mmap_sem); > unmap_region() > tlb_gather_mmu() > inc_tlb_flush_pending(tlb->mm); > free_pgtables() > tlb->freed_tables = 1 > tlb->cleared_pmds = 1 > > pthread_exit() > madvise(thread_stack, 8M, MADV_DONTNEED) I'm not quite familiar with the implementation detail of pthread_exit(), does pthread_exit() call MADV_DONTNEED all the time? I don't see your test call it. If so this pattern is definitely possible. > zap_page_range() > tlb_gather_mmu() > inc_tlb_flush_pending(tlb->mm); > > tlb_finish_mmu() > if (mm_tlb_flush_nested(tlb->mm)) > __tlb_reset_range() > tlb->freed_tables = 0 > tlb->cleared_pmds = 0 > __flush_tlb_range(last_level = 0) > ... > map_address = mmap() > map_address[i] = 'b' > <page fault loop> > # PTE appeared valid to me, > # so I suspected stale TLB entry at higher level as result of "freed_tables = 0" > > > I'm happy to apply/run any debug patches to get more data that would help. > >>> >>> And we can fix that by having tlb_finish_mmu() sync up. Never let a >>> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers >>> have completed. >> Not sure if this will scale well. >> >>> This should not be too hard to make happen. >>
----- Original Message ----- > > > On 5/9/19 2:06 PM, Jan Stancek wrote: > > ----- Original Message ----- > >> > >> On 5/9/19 11:24 AM, Peter Zijlstra wrote: > >>> On Thu, May 09, 2019 at 05:36:29PM +0000, Nadav Amit wrote: > >>>>> On May 9, 2019, at 3:38 AM, Peter Zijlstra <peterz@infradead.org> > >>>>> wrote: > >>>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > >>>>> index 99740e1dd273..fe768f8d612e 100644 > >>>>> --- a/mm/mmu_gather.c > >>>>> +++ b/mm/mmu_gather.c > >>>>> @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb, > >>>>> unsigned long start, unsigned long end) > >>>>> { > >>>>> /* > >>>>> - * If there are parallel threads are doing PTE changes on same range > >>>>> - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB > >>>>> - * flush by batching, a thread has stable TLB entry can fail to flush > >>>>> - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB > >>>>> - * forcefully if we detect parallel PTE batching threads. > >>>>> + * Sensible comment goes here.. > >>>>> */ > >>>>> - if (mm_tlb_flush_nested(tlb->mm)) { > >>>>> - __tlb_reset_range(tlb); > >>>>> - __tlb_adjust_range(tlb, start, end - start); > >>>>> + if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) { > >>>>> + /* > >>>>> + * Since we're can't tell what we actually should have > >>>>> + * flushed flush everything in the given range. > >>>>> + */ > >>>>> + tlb->start = start; > >>>>> + tlb->end = end; > >>>>> + tlb->freed_tables = 1; > >>>>> + tlb->cleared_ptes = 1; > >>>>> + tlb->cleared_pmds = 1; > >>>>> + tlb->cleared_puds = 1; > >>>>> + tlb->cleared_p4ds = 1; > >>>>> } > >>>>> > >>>>> tlb_flush_mmu(tlb); > >>>> As a simple optimization, I think it is possible to hold multiple > >>>> nesting > >>>> counters in the mm, similar to tlb_flush_pending, for freed_tables, > >>>> cleared_ptes, etc. > >>>> > >>>> The first time you set tlb->freed_tables, you also atomically increase > >>>> mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use > >>>> mm->tlb_flush_freed_tables instead of tlb->freed_tables. > >>> That sounds fraught with races and expensive; I would much prefer to not > >>> go there for this arguably rare case. > >>> > >>> Consider such fun cases as where CPU-0 sees and clears a PTE, CPU-1 > >>> races and doesn't see that PTE. Therefore CPU-0 sets and counts > >>> cleared_ptes. Then if CPU-1 flushes while CPU-0 is still in mmu_gather, > >>> it will see cleared_ptes count increased and flush that granularity, > >>> OTOH if CPU-1 flushes after CPU-0 completes, it will not and potentiall > >>> miss an invalidate it should have had. > >>> > >>> This whole concurrent mmu_gather stuff is horrible. > >>> > >>> /me ponders more.... > >>> > >>> So I think the fundamental race here is this: > >>> > >>> CPU-0 CPU-1 > >>> > >>> tlb_gather_mmu(.start=1, tlb_gather_mmu(.start=2, > >>> .end=3); .end=4); > >>> > >>> ptep_get_and_clear_full(2) > >>> tlb_remove_tlb_entry(2); > >>> __tlb_remove_page(); > >>> if (pte_present(2)) // nope > >>> > >>> tlb_finish_mmu(); > >>> > >>> // continue without TLBI(2) > >>> // whoopsie > >>> > >>> tlb_finish_mmu(); > >>> tlb_flush() -> TLBI(2) > >> I'm not quite sure if this is the case Jan really met. But, according to > >> his test, once correct tlb->freed_tables and tlb->cleared_* are set, his > >> test works well. > > My theory was following sequence: > > > > t1: map_write_unmap() t2: dummy() > > > > map_address = mmap() > > map_address[i] = 'b' > > munmap(map_address) > > downgrade_write(&mm->mmap_sem); > > unmap_region() > > tlb_gather_mmu() > > inc_tlb_flush_pending(tlb->mm); > > free_pgtables() > > tlb->freed_tables = 1 > > tlb->cleared_pmds = 1 > > > > pthread_exit() > > madvise(thread_stack, 8M, > > MADV_DONTNEED) > > I'm not quite familiar with the implementation detail of pthread_exit(), > does pthread_exit() call MADV_DONTNEED all the time? I don't see your > test call it. It's called by glibc: https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/allocatestack.c;h=fcbc46f0d796abce8d58970d4a1d3df685981e33;hb=refs/heads/master#l380 https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_create.c;h=18b7bbe7659c027dfd7b0ce3b0c83f54a6f15b18;hb=refs/heads/master#l569 (gdb) bt #0 madvise () at ../sysdeps/unix/syscall-template.S:78 #1 0x0000ffffbe7679f8 in advise_stack_range (guardsize=<optimized out>, pd=281474976706191, size=<optimized out>, mem=0xffffbddd0000) at allocatestack.c:392 #2 start_thread (arg=0xffffffffee8f) at pthread_create.c:576 #3 0x0000ffffbe6b157c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78 Dump of assembler code for function madvise: => 0x0000ffffbe6adaf0 <+0>: mov x8, #0xe9 // #233 0x0000ffffbe6adaf4 <+4>: svc #0x0 0x0000ffffbe6adaf8 <+8>: cmn x0, #0xfff 0x0000ffffbe6adafc <+12>: b.cs 0xffffbe6adb04 <madvise+20> // b.hs, b.nlast 0x0000ffffbe6adb00 <+16>: ret 0x0000ffffbe6adb04 <+20>: b 0xffffbe600e18 <__GI___syscall_error> > If so this pattern is definitely possible. > > > zap_page_range() > > tlb_gather_mmu() > > inc_tlb_flush_pending(tlb->mm); > > > > tlb_finish_mmu() > > if (mm_tlb_flush_nested(tlb->mm)) > > __tlb_reset_range() > > tlb->freed_tables = 0 > > tlb->cleared_pmds = 0 > > __flush_tlb_range(last_level = 0) > > ... > > map_address = mmap() > > map_address[i] = 'b' > > <page fault loop> > > # PTE appeared valid to me, > > # so I suspected stale TLB entry at higher level as result of > > "freed_tables = 0" > > > > > > I'm happy to apply/run any debug patches to get more data that would help. > > > >>> > >>> And we can fix that by having tlb_finish_mmu() sync up. Never let a > >>> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers > >>> have completed. > >> Not sure if this will scale well. > >> > >>> This should not be too hard to make happen. > >> > >
On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote: > >>> And we can fix that by having tlb_finish_mmu() sync up. Never let a > >>> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers > >>> have completed. > >>> > >>> This should not be too hard to make happen. > >> > >> This synchronization sounds much more expensive than what I proposed. But I > >> agree that cache-lines that move from one CPU to another might become an > >> issue. But I think that the scheme I suggested would minimize this overhead. > > > > Well, it would have a lot more unconditional atomic ops. My scheme only > > waits when there is actual concurrency. > > Well, something has to give. I didn’t think that if the same core does the > atomic op it would be too expensive. They're still at least 20 cycles a pop, uncontended. > > I _think_ something like the below ought to work, but its not even been > > near a compiler. The only problem is the unconditional wakeup; we can > > play games to avoid that if we want to continue with this. > > > > Ideally we'd only do this when there's been actual overlap, but I've not > > found a sensible way to detect that. > > > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > index 4ef4bbe78a1d..b70e35792d29 100644 > > --- a/include/linux/mm_types.h > > +++ b/include/linux/mm_types.h > > @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm) > > * > > * Therefore we must rely on tlb_flush_*() to guarantee order. > > */ > > - atomic_dec(&mm->tlb_flush_pending); > > + if (atomic_dec_and_test(&mm->tlb_flush_pending)) { > > + wake_up_var(&mm->tlb_flush_pending); > > + } else { > > + wait_event_var(&mm->tlb_flush_pending, > > + !atomic_read_acquire(&mm->tlb_flush_pending)); > > + } > > } > > It still seems very expensive to me, at least for certain workloads (e.g., > Apache with multithreaded MPM). Is that Apache-MPM workload triggering this lots? Having a known benchmark for this stuff is good for when someone has time to play with things. > It may be possible to avoid false-positive nesting indications (when the > flushes do not overlap) by creating a new struct mmu_gather_pending, with > something like: > > struct mmu_gather_pending { > u64 start; > u64 end; > struct mmu_gather_pending *next; > } > > tlb_finish_mmu() would then iterate over the mm->mmu_gather_pending > (pointing to the linked list) and find whether there is any overlap. This > would still require synchronization (acquiring a lock when allocating and > deallocating or something fancier). We have an interval_tree for this, and yes, that's how far I got :/ The other thing I was thinking of is trying to detect overlap through the page-tables themselves, but we have a distinct lack of storage there. The things is, if this threaded monster runs on all CPUs (busy front end server) and does a ton of invalidation due to all the short lived request crud, then all the extra invalidations will add up too. Having to do process (machine in this case) wide invalidations is expensive, having to do more of them surely isn't cheap either. So there might be something to win here.
> On May 13, 2019, at 1:36 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote: > >>>>> And we can fix that by having tlb_finish_mmu() sync up. Never let a >>>>> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers >>>>> have completed. >>>>> >>>>> This should not be too hard to make happen. >>>> >>>> This synchronization sounds much more expensive than what I proposed. But I >>>> agree that cache-lines that move from one CPU to another might become an >>>> issue. But I think that the scheme I suggested would minimize this overhead. >>> >>> Well, it would have a lot more unconditional atomic ops. My scheme only >>> waits when there is actual concurrency. >> >> Well, something has to give. I didn’t think that if the same core does the >> atomic op it would be too expensive. > > They're still at least 20 cycles a pop, uncontended. > >>> I _think_ something like the below ought to work, but its not even been >>> near a compiler. The only problem is the unconditional wakeup; we can >>> play games to avoid that if we want to continue with this. >>> >>> Ideally we'd only do this when there's been actual overlap, but I've not >>> found a sensible way to detect that. >>> >>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >>> index 4ef4bbe78a1d..b70e35792d29 100644 >>> --- a/include/linux/mm_types.h >>> +++ b/include/linux/mm_types.h >>> @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm) >>> * >>> * Therefore we must rely on tlb_flush_*() to guarantee order. >>> */ >>> - atomic_dec(&mm->tlb_flush_pending); >>> + if (atomic_dec_and_test(&mm->tlb_flush_pending)) { >>> + wake_up_var(&mm->tlb_flush_pending); >>> + } else { >>> + wait_event_var(&mm->tlb_flush_pending, >>> + !atomic_read_acquire(&mm->tlb_flush_pending)); >>> + } >>> } >> >> It still seems very expensive to me, at least for certain workloads (e.g., >> Apache with multithreaded MPM). > > Is that Apache-MPM workload triggering this lots? Having a known > benchmark for this stuff is good for when someone has time to play with > things. Setting Apache2 with mpm_worker causes every request to go through mmap-writev-munmap flow on every thread. I didn’t run this workload after the patches that downgrade the mmap_sem to read before the page-table zapping were introduced. I presume these patches would allow the page-table zapping to be done concurrently, and therefore would hit this flow. >> It may be possible to avoid false-positive nesting indications (when the >> flushes do not overlap) by creating a new struct mmu_gather_pending, with >> something like: >> >> struct mmu_gather_pending { >> u64 start; >> u64 end; >> struct mmu_gather_pending *next; >> } >> >> tlb_finish_mmu() would then iterate over the mm->mmu_gather_pending >> (pointing to the linked list) and find whether there is any overlap. This >> would still require synchronization (acquiring a lock when allocating and >> deallocating or something fancier). > > We have an interval_tree for this, and yes, that's how far I got :/ > > The other thing I was thinking of is trying to detect overlap through > the page-tables themselves, but we have a distinct lack of storage > there. I tried to think about saving some generation info somewhere in the page-struct, but I could not come up with a reasonable solution that would not requite to traverse all the page tables again one the TLB flush is done. > The things is, if this threaded monster runs on all CPUs (busy front end > server) and does a ton of invalidation due to all the short lived > request crud, then all the extra invalidations will add up too. Having > to do process (machine in this case) wide invalidations is expensive, > having to do more of them surely isn't cheap either. > > So there might be something to win here. Yes. I remember that these full TLB flushes leave their mark. BTW: sometimes you don’t see the effect of these full TLB flushes as much in VMs. I encountered a strange phenomenon at the time - INVLPG for an arbitrary page cause my Haswell machine flush the entire TLB, when the INVLPG was issued inside a VM. It took me quite some time to analyze this problem. Eventually Intel told me that’s part of what is called “page fracturing” - if the host uses 4k pages in the EPT, they (usually) need to flush the entire TLB for any INVLPG. That’s happens since they don’t know the size of the flushed page. I really need to finish my blog-post about it some day.
On Mon, May 13, 2019 at 10:36:06AM +0200, Peter Zijlstra wrote: > On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote: > > It may be possible to avoid false-positive nesting indications (when the > > flushes do not overlap) by creating a new struct mmu_gather_pending, with > > something like: > > > > struct mmu_gather_pending { > > u64 start; > > u64 end; > > struct mmu_gather_pending *next; > > } > > > > tlb_finish_mmu() would then iterate over the mm->mmu_gather_pending > > (pointing to the linked list) and find whether there is any overlap. This > > would still require synchronization (acquiring a lock when allocating and > > deallocating or something fancier). > > We have an interval_tree for this, and yes, that's how far I got :/ > > The other thing I was thinking of is trying to detect overlap through > the page-tables themselves, but we have a distinct lack of storage > there. We might just use some state in the pmd, there's still 2 _pt_pad_[12] in struct page to 'use'. So we could come up with some tlb generation scheme that would detect conflict.
> On May 13, 2019, at 2:12 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, May 13, 2019 at 10:36:06AM +0200, Peter Zijlstra wrote: >> On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote: >>> It may be possible to avoid false-positive nesting indications (when the >>> flushes do not overlap) by creating a new struct mmu_gather_pending, with >>> something like: >>> >>> struct mmu_gather_pending { >>> u64 start; >>> u64 end; >>> struct mmu_gather_pending *next; >>> } >>> >>> tlb_finish_mmu() would then iterate over the mm->mmu_gather_pending >>> (pointing to the linked list) and find whether there is any overlap. This >>> would still require synchronization (acquiring a lock when allocating and >>> deallocating or something fancier). >> >> We have an interval_tree for this, and yes, that's how far I got :/ >> >> The other thing I was thinking of is trying to detect overlap through >> the page-tables themselves, but we have a distinct lack of storage >> there. > > We might just use some state in the pmd, there's still 2 _pt_pad_[12] in > struct page to 'use'. So we could come up with some tlb generation > scheme that would detect conflict. It is rather easy to come up with a scheme (and I did similar things) if you flush the table while you hold the page-tables lock. But if you batch across page-tables it becomes harder. Thinking about it while typing, perhaps it is simpler than I think - if you need to flush range that runs across more than a single table, you are very likely to flush a range of more than 33 entries, so anyhow you are likely to do a full TLB flush. So perhaps just avoiding the batching if only entries from a single table are flushed would be enough.
On Mon, May 13, 2019 at 09:21:01AM +0000, Nadav Amit wrote: > > On May 13, 2019, at 2:12 AM, Peter Zijlstra <peterz@infradead.org> wrote: > >> The other thing I was thinking of is trying to detect overlap through > >> the page-tables themselves, but we have a distinct lack of storage > >> there. > > > > We might just use some state in the pmd, there's still 2 _pt_pad_[12] in > > struct page to 'use'. So we could come up with some tlb generation > > scheme that would detect conflict. > > It is rather easy to come up with a scheme (and I did similar things) if you > flush the table while you hold the page-tables lock. But if you batch across > page-tables it becomes harder. Yeah; finding that out now. I keep finding holes :/ > Thinking about it while typing, perhaps it is simpler than I think - if you > need to flush range that runs across more than a single table, you are very > likely to flush a range of more than 33 entries, so anyhow you are likely to > do a full TLB flush. We can't rely on the 33, that x86 specific. Other architectures can have another (or no) limit on that. > So perhaps just avoiding the batching if only entries from a single table > are flushed would be enough. That's near to what Will suggested initially, just flush the entire thing when there's a conflict.
On Mon, May 13, 2019 at 09:11:38AM +0000, Nadav Amit wrote: > BTW: sometimes you don’t see the effect of these full TLB flushes as much in > VMs. I encountered a strange phenomenon at the time - INVLPG for an > arbitrary page cause my Haswell machine flush the entire TLB, when the > INVLPG was issued inside a VM. It took me quite some time to analyze this > problem. Eventually Intel told me that’s part of what is called “page > fracturing” - if the host uses 4k pages in the EPT, they (usually) need to > flush the entire TLB for any INVLPG. That’s happens since they don’t know > the size of the flushed page. Cute... if only they'd given us an interface to tell them... :-)
On Mon, May 13, 2019 at 09:11:38AM +0000, Nadav Amit wrote: > > On May 13, 2019, at 1:36 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote: > > > >>>>> And we can fix that by having tlb_finish_mmu() sync up. Never let a > >>>>> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers > >>>>> have completed. > >>>>> > >>>>> This should not be too hard to make happen. > >>>> > >>>> This synchronization sounds much more expensive than what I proposed. But I > >>>> agree that cache-lines that move from one CPU to another might become an > >>>> issue. But I think that the scheme I suggested would minimize this overhead. > >>> > >>> Well, it would have a lot more unconditional atomic ops. My scheme only > >>> waits when there is actual concurrency. > >> > >> Well, something has to give. I didn’t think that if the same core does the > >> atomic op it would be too expensive. > > > > They're still at least 20 cycles a pop, uncontended. > > > >>> I _think_ something like the below ought to work, but its not even been > >>> near a compiler. The only problem is the unconditional wakeup; we can > >>> play games to avoid that if we want to continue with this. > >>> > >>> Ideally we'd only do this when there's been actual overlap, but I've not > >>> found a sensible way to detect that. > >>> > >>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > >>> index 4ef4bbe78a1d..b70e35792d29 100644 > >>> --- a/include/linux/mm_types.h > >>> +++ b/include/linux/mm_types.h > >>> @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm) > >>> * > >>> * Therefore we must rely on tlb_flush_*() to guarantee order. > >>> */ > >>> - atomic_dec(&mm->tlb_flush_pending); > >>> + if (atomic_dec_and_test(&mm->tlb_flush_pending)) { > >>> + wake_up_var(&mm->tlb_flush_pending); > >>> + } else { > >>> + wait_event_var(&mm->tlb_flush_pending, > >>> + !atomic_read_acquire(&mm->tlb_flush_pending)); > >>> + } > >>> } > >> > >> It still seems very expensive to me, at least for certain workloads (e.g., > >> Apache with multithreaded MPM). > > > > Is that Apache-MPM workload triggering this lots? Having a known > > benchmark for this stuff is good for when someone has time to play with > > things. > > Setting Apache2 with mpm_worker causes every request to go through > mmap-writev-munmap flow on every thread. I didn’t run this workload after > the patches that downgrade the mmap_sem to read before the page-table > zapping were introduced. I presume these patches would allow the page-table > zapping to be done concurrently, and therefore would hit this flow. Hmm, I don't think so: munmap() still has to take the semaphore for write initially, so it will be serialised against other munmap() threads even after they've downgraded afaict. The initial bug report was about concurrent madvise() vs munmap(). Will
> On May 13, 2019, at 9:37 AM, Will Deacon <will.deacon@arm.com> wrote: > > On Mon, May 13, 2019 at 09:11:38AM +0000, Nadav Amit wrote: >>> On May 13, 2019, at 1:36 AM, Peter Zijlstra <peterz@infradead.org> wrote: >>> >>> On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote: >>> >>>>>>> And we can fix that by having tlb_finish_mmu() sync up. Never let a >>>>>>> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers >>>>>>> have completed. >>>>>>> >>>>>>> This should not be too hard to make happen. >>>>>> >>>>>> This synchronization sounds much more expensive than what I proposed. But I >>>>>> agree that cache-lines that move from one CPU to another might become an >>>>>> issue. But I think that the scheme I suggested would minimize this overhead. >>>>> >>>>> Well, it would have a lot more unconditional atomic ops. My scheme only >>>>> waits when there is actual concurrency. >>>> >>>> Well, something has to give. I didn’t think that if the same core does the >>>> atomic op it would be too expensive. >>> >>> They're still at least 20 cycles a pop, uncontended. >>> >>>>> I _think_ something like the below ought to work, but its not even been >>>>> near a compiler. The only problem is the unconditional wakeup; we can >>>>> play games to avoid that if we want to continue with this. >>>>> >>>>> Ideally we'd only do this when there's been actual overlap, but I've not >>>>> found a sensible way to detect that. >>>>> >>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >>>>> index 4ef4bbe78a1d..b70e35792d29 100644 >>>>> --- a/include/linux/mm_types.h >>>>> +++ b/include/linux/mm_types.h >>>>> @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm) >>>>> * >>>>> * Therefore we must rely on tlb_flush_*() to guarantee order. >>>>> */ >>>>> - atomic_dec(&mm->tlb_flush_pending); >>>>> + if (atomic_dec_and_test(&mm->tlb_flush_pending)) { >>>>> + wake_up_var(&mm->tlb_flush_pending); >>>>> + } else { >>>>> + wait_event_var(&mm->tlb_flush_pending, >>>>> + !atomic_read_acquire(&mm->tlb_flush_pending)); >>>>> + } >>>>> } >>>> >>>> It still seems very expensive to me, at least for certain workloads (e.g., >>>> Apache with multithreaded MPM). >>> >>> Is that Apache-MPM workload triggering this lots? Having a known >>> benchmark for this stuff is good for when someone has time to play with >>> things. >> >> Setting Apache2 with mpm_worker causes every request to go through >> mmap-writev-munmap flow on every thread. I didn’t run this workload after >> the patches that downgrade the mmap_sem to read before the page-table >> zapping were introduced. I presume these patches would allow the page-table >> zapping to be done concurrently, and therefore would hit this flow. > > Hmm, I don't think so: munmap() still has to take the semaphore for write > initially, so it will be serialised against other munmap() threads even > after they've downgraded afaict. > > The initial bug report was about concurrent madvise() vs munmap(). I guess you are right (and I’m wrong). Short search suggests that ebizzy might be affected (a thread by Mel Gorman): https://lkml.org/lkml/2015/2/2/493
> On May 13, 2019, at 4:27 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, May 13, 2019 at 09:21:01AM +0000, Nadav Amit wrote: >>> On May 13, 2019, at 2:12 AM, Peter Zijlstra <peterz@infradead.org> wrote: > >>>> The other thing I was thinking of is trying to detect overlap through >>>> the page-tables themselves, but we have a distinct lack of storage >>>> there. >>> >>> We might just use some state in the pmd, there's still 2 _pt_pad_[12] in >>> struct page to 'use'. So we could come up with some tlb generation >>> scheme that would detect conflict. >> >> It is rather easy to come up with a scheme (and I did similar things) if you >> flush the table while you hold the page-tables lock. But if you batch across >> page-tables it becomes harder. > > Yeah; finding that out now. I keep finding holes :/ You can use Uhlig’s dissertation for inspiration (Section 4.4). [1] https://www.researchgate.net/publication/36450482_Scalability_of_microkernel-based_systems/download > >> Thinking about it while typing, perhaps it is simpler than I think - if you >> need to flush range that runs across more than a single table, you are very >> likely to flush a range of more than 33 entries, so anyhow you are likely to >> do a full TLB flush. > > We can't rely on the 33, that x86 specific. Other architectures can have > another (or no) limit on that. I wonder whether there are architectures that do no invalidate the TLB entry by entry, experiencing these kind of overheads. >> So perhaps just avoiding the batching if only entries from a single table >> are flushed would be enough. > > That's near to what Will suggested initially, just flush the entire > thing when there's a conflict. One question is how you define a conflict. IIUC, Will suggests same mm marks a conflict. In addition, I suggest that if you only remove a single entry (or few ones), you would just not batch and do the flushing while holding the page-table lock.
On Mon, May 13, 2019 at 05:06:03PM +0000, Nadav Amit wrote: > > On May 13, 2019, at 9:37 AM, Will Deacon <will.deacon@arm.com> wrote: > > > > On Mon, May 13, 2019 at 09:11:38AM +0000, Nadav Amit wrote: > >>> On May 13, 2019, at 1:36 AM, Peter Zijlstra <peterz@infradead.org> wrote: > >>> > >>> On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote: > >>> > >>>>>>> And we can fix that by having tlb_finish_mmu() sync up. Never let a > >>>>>>> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers > >>>>>>> have completed. > >>>>>>> > >>>>>>> This should not be too hard to make happen. > >>>>>> > >>>>>> This synchronization sounds much more expensive than what I proposed. But I > >>>>>> agree that cache-lines that move from one CPU to another might become an > >>>>>> issue. But I think that the scheme I suggested would minimize this overhead. > >>>>> > >>>>> Well, it would have a lot more unconditional atomic ops. My scheme only > >>>>> waits when there is actual concurrency. > >>>> > >>>> Well, something has to give. I didn???t think that if the same core does the > >>>> atomic op it would be too expensive. > >>> > >>> They're still at least 20 cycles a pop, uncontended. > >>> > >>>>> I _think_ something like the below ought to work, but its not even been > >>>>> near a compiler. The only problem is the unconditional wakeup; we can > >>>>> play games to avoid that if we want to continue with this. > >>>>> > >>>>> Ideally we'd only do this when there's been actual overlap, but I've not > >>>>> found a sensible way to detect that. > >>>>> > >>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > >>>>> index 4ef4bbe78a1d..b70e35792d29 100644 > >>>>> --- a/include/linux/mm_types.h > >>>>> +++ b/include/linux/mm_types.h > >>>>> @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm) > >>>>> * > >>>>> * Therefore we must rely on tlb_flush_*() to guarantee order. > >>>>> */ > >>>>> - atomic_dec(&mm->tlb_flush_pending); > >>>>> + if (atomic_dec_and_test(&mm->tlb_flush_pending)) { > >>>>> + wake_up_var(&mm->tlb_flush_pending); > >>>>> + } else { > >>>>> + wait_event_var(&mm->tlb_flush_pending, > >>>>> + !atomic_read_acquire(&mm->tlb_flush_pending)); > >>>>> + } > >>>>> } > >>>> > >>>> It still seems very expensive to me, at least for certain workloads (e.g., > >>>> Apache with multithreaded MPM). > >>> > >>> Is that Apache-MPM workload triggering this lots? Having a known > >>> benchmark for this stuff is good for when someone has time to play with > >>> things. > >> > >> Setting Apache2 with mpm_worker causes every request to go through > >> mmap-writev-munmap flow on every thread. I didn???t run this workload after > >> the patches that downgrade the mmap_sem to read before the page-table > >> zapping were introduced. I presume these patches would allow the page-table > >> zapping to be done concurrently, and therefore would hit this flow. > > > > Hmm, I don't think so: munmap() still has to take the semaphore for write > > initially, so it will be serialised against other munmap() threads even > > after they've downgraded afaict. > > > > The initial bug report was about concurrent madvise() vs munmap(). > > I guess you are right (and I???m wrong). > > Short search suggests that ebizzy might be affected (a thread by Mel > Gorman): https://lkml.org/lkml/2015/2/2/493 > Glibc has since been fixed to be less munmap/mmap intensive and the system CPU usage of ebizzy is generally negligible unless configured so specifically use mmap/munmap instead of malloc/free which is unrealistic for good application behaviour.
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1..9fd5272 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -249,11 +249,12 @@ void tlb_finish_mmu(struct mmu_gather *tlb, * flush by batching, a thread has stable TLB entry can fail to flush * the TLB by observing pte_none|!pte_dirty, for example so flush TLB * forcefully if we detect parallel PTE batching threads. + * + * munmap() may change mapping under non-excluse lock and also free + * page tables. Do not call __tlb_reset_range() for it. */ - if (mm_tlb_flush_nested(tlb->mm)) { - __tlb_reset_range(tlb); + if (mm_tlb_flush_nested(tlb->mm)) __tlb_adjust_range(tlb, start, end - start); - } tlb_flush_mmu(tlb);