Message ID | 1557444414-12090-1-git-send-email-yang.shi@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] mm: mmu_gather: remove __tlb_reset_range() for force flush | expand |
On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote: > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > index 99740e1..469492d 100644 > --- a/mm/mmu_gather.c > +++ b/mm/mmu_gather.c > @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb, > { > /* > * If there are parallel threads are doing PTE changes on same range > - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB > - * flush by batching, a thread has stable TLB entry can fail to flush > - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB > - * forcefully if we detect parallel PTE batching threads. > + * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB > + * flush by batching, one thread may end up seeing inconsistent PTEs > + * and result in having stale TLB entries. So flush TLB forcefully > + * if we detect parallel PTE batching threads. > + * > + * However, some syscalls, e.g. munmap(), may free page tables, this > + * needs force flush everything in the given range. Otherwise this > + * may result in having stale TLB entries for some architectures, > + * e.g. aarch64, that could specify flush what level TLB. > */ > - if (mm_tlb_flush_nested(tlb->mm)) { > - __tlb_reset_range(tlb); > - __tlb_adjust_range(tlb, start, end - start); > + if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) { > + /* > + * Since we can't tell what we actually should have > + * flushed, flush everything in the given range. > + */ > + tlb->freed_tables = 1; > + tlb->cleared_ptes = 1; > + tlb->cleared_pmds = 1; > + tlb->cleared_puds = 1; > + tlb->cleared_p4ds = 1; > + > + /* > + * Some architectures, e.g. ARM, that have range invalidation > + * and care about VM_EXEC for I-Cache invalidation, need force > + * vma_exec set. > + */ > + tlb->vma_exec = 1; > + > + /* Force vma_huge clear to guarantee safer flush */ > + tlb->vma_huge = 0; > + > + tlb->start = start; > + tlb->end = end; > } Whilst I think this is correct, it would be interesting to see whether or not it's actually faster than just nuking the whole mm, as I mentioned before. At least in terms of getting a short-term fix, I'd prefer the diff below if it's not measurably worse. Will --->8 diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1dd273..cc251422d307 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -251,8 +251,9 @@ void tlb_finish_mmu(struct mmu_gather *tlb, * forcefully if we detect parallel PTE batching threads. */ if (mm_tlb_flush_nested(tlb->mm)) { + tlb->fullmm = 1; __tlb_reset_range(tlb); - __tlb_adjust_range(tlb, start, end - start); + tlb->freed_tables = 1; } tlb_flush_mmu(tlb);
On 5/13/19 9:38 AM, Will Deacon wrote: > On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote: >> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c >> index 99740e1..469492d 100644 >> --- a/mm/mmu_gather.c >> +++ b/mm/mmu_gather.c >> @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb, >> { >> /* >> * If there are parallel threads are doing PTE changes on same range >> - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB >> - * flush by batching, a thread has stable TLB entry can fail to flush >> - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB >> - * forcefully if we detect parallel PTE batching threads. >> + * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB >> + * flush by batching, one thread may end up seeing inconsistent PTEs >> + * and result in having stale TLB entries. So flush TLB forcefully >> + * if we detect parallel PTE batching threads. >> + * >> + * However, some syscalls, e.g. munmap(), may free page tables, this >> + * needs force flush everything in the given range. Otherwise this >> + * may result in having stale TLB entries for some architectures, >> + * e.g. aarch64, that could specify flush what level TLB. >> */ >> - if (mm_tlb_flush_nested(tlb->mm)) { >> - __tlb_reset_range(tlb); >> - __tlb_adjust_range(tlb, start, end - start); >> + if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) { >> + /* >> + * Since we can't tell what we actually should have >> + * flushed, flush everything in the given range. >> + */ >> + tlb->freed_tables = 1; >> + tlb->cleared_ptes = 1; >> + tlb->cleared_pmds = 1; >> + tlb->cleared_puds = 1; >> + tlb->cleared_p4ds = 1; >> + >> + /* >> + * Some architectures, e.g. ARM, that have range invalidation >> + * and care about VM_EXEC for I-Cache invalidation, need force >> + * vma_exec set. >> + */ >> + tlb->vma_exec = 1; >> + >> + /* Force vma_huge clear to guarantee safer flush */ >> + tlb->vma_huge = 0; >> + >> + tlb->start = start; >> + tlb->end = end; >> } > Whilst I think this is correct, it would be interesting to see whether > or not it's actually faster than just nuking the whole mm, as I mentioned > before. > > At least in terms of getting a short-term fix, I'd prefer the diff below > if it's not measurably worse. I did a quick test with ebizzy (96 threads with 5 iterations) on my x86 VM, it shows slightly slowdown on records/s but much more sys time spent with fullmm flush, the below is the data. nofullmm fullmm ops (records/s) 225606 225119 sys (s) 0.69 1.14 It looks the slight reduction of records/s is caused by the increase of sys time. > > Will > > --->8 > > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > index 99740e1dd273..cc251422d307 100644 > --- a/mm/mmu_gather.c > +++ b/mm/mmu_gather.c > @@ -251,8 +251,9 @@ void tlb_finish_mmu(struct mmu_gather *tlb, > * forcefully if we detect parallel PTE batching threads. > */ > if (mm_tlb_flush_nested(tlb->mm)) { > + tlb->fullmm = 1; > __tlb_reset_range(tlb); > - __tlb_adjust_range(tlb, start, end - start); > + tlb->freed_tables = 1; > } > > tlb_flush_mmu(tlb);
On Mon, May 13, 2019 at 05:38:04PM +0100, Will Deacon wrote: > On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote: > > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > > index 99740e1..469492d 100644 > > --- a/mm/mmu_gather.c > > +++ b/mm/mmu_gather.c > > @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb, > > { > > /* > > * If there are parallel threads are doing PTE changes on same range > > + * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB > > + * flush by batching, one thread may end up seeing inconsistent PTEs > > + * and result in having stale TLB entries. So flush TLB forcefully > > + * if we detect parallel PTE batching threads. > > + * > > + * However, some syscalls, e.g. munmap(), may free page tables, this > > + * needs force flush everything in the given range. Otherwise this > > + * may result in having stale TLB entries for some architectures, > > + * e.g. aarch64, that could specify flush what level TLB. > > */ > > + if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) { > > + /* > > + * Since we can't tell what we actually should have > > + * flushed, flush everything in the given range. > > + */ > > + tlb->freed_tables = 1; > > + tlb->cleared_ptes = 1; > > + tlb->cleared_pmds = 1; > > + tlb->cleared_puds = 1; > > + tlb->cleared_p4ds = 1; > > + > > + /* > > + * Some architectures, e.g. ARM, that have range invalidation > > + * and care about VM_EXEC for I-Cache invalidation, need force > > + * vma_exec set. > > + */ > > + tlb->vma_exec = 1; > > + > > + /* Force vma_huge clear to guarantee safer flush */ > > + tlb->vma_huge = 0; > > + > > + tlb->start = start; > > + tlb->end = end; > > } > > Whilst I think this is correct, it would be interesting to see whether > or not it's actually faster than just nuking the whole mm, as I mentioned > before. > > At least in terms of getting a short-term fix, I'd prefer the diff below > if it's not measurably worse. So what point? General paranoia? Either change should allow PPC to get rid of its magic mushrooms, the below would be a little bit easier for them because they already do full invalidate correct. > --->8 > > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > index 99740e1dd273..cc251422d307 100644 > --- a/mm/mmu_gather.c > +++ b/mm/mmu_gather.c > @@ -251,8 +251,9 @@ void tlb_finish_mmu(struct mmu_gather *tlb, > * forcefully if we detect parallel PTE batching threads. > */ > if (mm_tlb_flush_nested(tlb->mm)) { > + tlb->fullmm = 1; > __tlb_reset_range(tlb); > - __tlb_adjust_range(tlb, start, end - start); > + tlb->freed_tables = 1; > } > > tlb_flush_mmu(tlb);
On Tue, May 14, 2019 at 01:52:23PM +0200, Peter Zijlstra wrote: > On Mon, May 13, 2019 at 05:38:04PM +0100, Will Deacon wrote: > > On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote: > > > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > > > index 99740e1..469492d 100644 > > > --- a/mm/mmu_gather.c > > > +++ b/mm/mmu_gather.c > > > @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb, > > > { > > > /* > > > * If there are parallel threads are doing PTE changes on same range > > > + * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB > > > + * flush by batching, one thread may end up seeing inconsistent PTEs > > > + * and result in having stale TLB entries. So flush TLB forcefully > > > + * if we detect parallel PTE batching threads. > > > + * > > > + * However, some syscalls, e.g. munmap(), may free page tables, this > > > + * needs force flush everything in the given range. Otherwise this > > > + * may result in having stale TLB entries for some architectures, > > > + * e.g. aarch64, that could specify flush what level TLB. > > > */ > > > + if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) { > > > + /* > > > + * Since we can't tell what we actually should have > > > + * flushed, flush everything in the given range. > > > + */ > > > + tlb->freed_tables = 1; > > > + tlb->cleared_ptes = 1; > > > + tlb->cleared_pmds = 1; > > > + tlb->cleared_puds = 1; > > > + tlb->cleared_p4ds = 1; > > > + > > > + /* > > > + * Some architectures, e.g. ARM, that have range invalidation > > > + * and care about VM_EXEC for I-Cache invalidation, need force > > > + * vma_exec set. > > > + */ > > > + tlb->vma_exec = 1; > > > + > > > + /* Force vma_huge clear to guarantee safer flush */ > > > + tlb->vma_huge = 0; > > > + > > > + tlb->start = start; > > > + tlb->end = end; > > > } > > > > Whilst I think this is correct, it would be interesting to see whether > > or not it's actually faster than just nuking the whole mm, as I mentioned > > before. > > > > At least in terms of getting a short-term fix, I'd prefer the diff below > > if it's not measurably worse. > > So what point? General paranoia? Either change should allow PPC to get > rid of its magic mushrooms, the below would be a little bit easier for > them because they already do full invalidate correct. Right; a combination of paranoia (need to remember to update this code to "flush everything" if we add new fields to the gather structure) but I also expected the performance to be better on arm64, where having two CPUs spamming TLBI messages at the same time is likely to suck. I'm super confused about the system time being reported as higher with this change. That's really not what I expected. Will
On Mon, May 13, 2019 at 04:01:09PM -0700, Yang Shi wrote: > > > On 5/13/19 9:38 AM, Will Deacon wrote: > > On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote: > > > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > > > index 99740e1..469492d 100644 > > > --- a/mm/mmu_gather.c > > > +++ b/mm/mmu_gather.c > > > @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb, > > > { > > > /* > > > * If there are parallel threads are doing PTE changes on same range > > > - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB > > > - * flush by batching, a thread has stable TLB entry can fail to flush > > > - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB > > > - * forcefully if we detect parallel PTE batching threads. > > > + * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB > > > + * flush by batching, one thread may end up seeing inconsistent PTEs > > > + * and result in having stale TLB entries. So flush TLB forcefully > > > + * if we detect parallel PTE batching threads. > > > + * > > > + * However, some syscalls, e.g. munmap(), may free page tables, this > > > + * needs force flush everything in the given range. Otherwise this > > > + * may result in having stale TLB entries for some architectures, > > > + * e.g. aarch64, that could specify flush what level TLB. > > > */ > > > - if (mm_tlb_flush_nested(tlb->mm)) { > > > - __tlb_reset_range(tlb); > > > - __tlb_adjust_range(tlb, start, end - start); > > > + if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) { > > > + /* > > > + * Since we can't tell what we actually should have > > > + * flushed, flush everything in the given range. > > > + */ > > > + tlb->freed_tables = 1; > > > + tlb->cleared_ptes = 1; > > > + tlb->cleared_pmds = 1; > > > + tlb->cleared_puds = 1; > > > + tlb->cleared_p4ds = 1; > > > + > > > + /* > > > + * Some architectures, e.g. ARM, that have range invalidation > > > + * and care about VM_EXEC for I-Cache invalidation, need force > > > + * vma_exec set. > > > + */ > > > + tlb->vma_exec = 1; > > > + > > > + /* Force vma_huge clear to guarantee safer flush */ > > > + tlb->vma_huge = 0; > > > + > > > + tlb->start = start; > > > + tlb->end = end; > > > } > > Whilst I think this is correct, it would be interesting to see whether > > or not it's actually faster than just nuking the whole mm, as I mentioned > > before. > > > > At least in terms of getting a short-term fix, I'd prefer the diff below > > if it's not measurably worse. > > I did a quick test with ebizzy (96 threads with 5 iterations) on my x86 VM, > it shows slightly slowdown on records/s but much more sys time spent with > fullmm flush, the below is the data. > > nofullmm fullmm > ops (records/s) 225606 225119 > sys (s) 0.69 1.14 > > It looks the slight reduction of records/s is caused by the increase of sys > time. That's not what I expected, and I'm unable to explain why moving to fullmm would /increase/ the system time. I would've thought the time spent doing the invalidation would decrease, with the downside that the TLB is cold when returning back to userspace. FWIW, I ran 10 iterations of ebizzy on my arm64 box using a vanilla 5.1 kernel and the numbers are all over the place (see below). I think deducing anything meaningful from this benchmark will be a challenge. Will --->8 306090 records/s real 10.00 s user 1227.55 s sys 0.54 s 323547 records/s real 10.00 s user 1262.95 s sys 0.82 s 409148 records/s real 10.00 s user 1266.54 s sys 0.94 s 341507 records/s real 10.00 s user 1263.49 s sys 0.66 s 375910 records/s real 10.00 s user 1259.87 s sys 0.82 s 376152 records/s real 10.00 s user 1265.76 s sys 0.96 s 358862 records/s real 10.00 s user 1251.13 s sys 0.72 s 358164 records/s real 10.00 s user 1243.48 s sys 0.85 s 332148 records/s real 10.00 s user 1260.93 s sys 0.70 s 367021 records/s real 10.00 s user 1264.06 s sys 1.43 s
On 5/14/19 7:54 AM, Will Deacon wrote: > On Mon, May 13, 2019 at 04:01:09PM -0700, Yang Shi wrote: >> >> On 5/13/19 9:38 AM, Will Deacon wrote: >>> On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote: >>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c >>>> index 99740e1..469492d 100644 >>>> --- a/mm/mmu_gather.c >>>> +++ b/mm/mmu_gather.c >>>> @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb, >>>> { >>>> /* >>>> * If there are parallel threads are doing PTE changes on same range >>>> - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB >>>> - * flush by batching, a thread has stable TLB entry can fail to flush >>>> - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB >>>> - * forcefully if we detect parallel PTE batching threads. >>>> + * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB >>>> + * flush by batching, one thread may end up seeing inconsistent PTEs >>>> + * and result in having stale TLB entries. So flush TLB forcefully >>>> + * if we detect parallel PTE batching threads. >>>> + * >>>> + * However, some syscalls, e.g. munmap(), may free page tables, this >>>> + * needs force flush everything in the given range. Otherwise this >>>> + * may result in having stale TLB entries for some architectures, >>>> + * e.g. aarch64, that could specify flush what level TLB. >>>> */ >>>> - if (mm_tlb_flush_nested(tlb->mm)) { >>>> - __tlb_reset_range(tlb); >>>> - __tlb_adjust_range(tlb, start, end - start); >>>> + if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) { >>>> + /* >>>> + * Since we can't tell what we actually should have >>>> + * flushed, flush everything in the given range. >>>> + */ >>>> + tlb->freed_tables = 1; >>>> + tlb->cleared_ptes = 1; >>>> + tlb->cleared_pmds = 1; >>>> + tlb->cleared_puds = 1; >>>> + tlb->cleared_p4ds = 1; >>>> + >>>> + /* >>>> + * Some architectures, e.g. ARM, that have range invalidation >>>> + * and care about VM_EXEC for I-Cache invalidation, need force >>>> + * vma_exec set. >>>> + */ >>>> + tlb->vma_exec = 1; >>>> + >>>> + /* Force vma_huge clear to guarantee safer flush */ >>>> + tlb->vma_huge = 0; >>>> + >>>> + tlb->start = start; >>>> + tlb->end = end; >>>> } >>> Whilst I think this is correct, it would be interesting to see whether >>> or not it's actually faster than just nuking the whole mm, as I mentioned >>> before. >>> >>> At least in terms of getting a short-term fix, I'd prefer the diff below >>> if it's not measurably worse. >> I did a quick test with ebizzy (96 threads with 5 iterations) on my x86 VM, >> it shows slightly slowdown on records/s but much more sys time spent with >> fullmm flush, the below is the data. >> >> nofullmm fullmm >> ops (records/s) 225606 225119 >> sys (s) 0.69 1.14 >> >> It looks the slight reduction of records/s is caused by the increase of sys >> time. > That's not what I expected, and I'm unable to explain why moving to fullmm > would /increase/ the system time. I would've thought the time spent doing > the invalidation would decrease, with the downside that the TLB is cold > when returning back to userspace. > > FWIW, I ran 10 iterations of ebizzy on my arm64 box using a vanilla 5.1 > kernel and the numbers are all over the place (see below). I think > deducing anything meaningful from this benchmark will be a challenge. Yes, it looks so. What else benchmark do you suggest? > > Will > > --->8 > > 306090 records/s > real 10.00 s > user 1227.55 s > sys 0.54 s > 323547 records/s > real 10.00 s > user 1262.95 s > sys 0.82 s > 409148 records/s > real 10.00 s > user 1266.54 s > sys 0.94 s > 341507 records/s > real 10.00 s > user 1263.49 s > sys 0.66 s > 375910 records/s > real 10.00 s > user 1259.87 s > sys 0.82 s > 376152 records/s > real 10.00 s > user 1265.76 s > sys 0.96 s > 358862 records/s > real 10.00 s > user 1251.13 s > sys 0.72 s > 358164 records/s > real 10.00 s > user 1243.48 s > sys 0.85 s > 332148 records/s > real 10.00 s > user 1260.93 s > sys 0.70 s > 367021 records/s > real 10.00 s > user 1264.06 s > sys 1.43 s
----- Original Message ----- > On Mon, May 13, 2019 at 04:01:09PM -0700, Yang Shi wrote: > > > > > > On 5/13/19 9:38 AM, Will Deacon wrote: > > > On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote: > > > > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c > > > > index 99740e1..469492d 100644 > > > > --- a/mm/mmu_gather.c > > > > +++ b/mm/mmu_gather.c > > > > @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb, > > > > { > > > > /* > > > > * If there are parallel threads are doing PTE changes on same range > > > > - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB > > > > - * flush by batching, a thread has stable TLB entry can fail to flush > > > > - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB > > > > - * forcefully if we detect parallel PTE batching threads. > > > > + * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB > > > > + * flush by batching, one thread may end up seeing inconsistent PTEs > > > > + * and result in having stale TLB entries. So flush TLB forcefully > > > > + * if we detect parallel PTE batching threads. > > > > + * > > > > + * However, some syscalls, e.g. munmap(), may free page tables, this > > > > + * needs force flush everything in the given range. Otherwise this > > > > + * may result in having stale TLB entries for some architectures, > > > > + * e.g. aarch64, that could specify flush what level TLB. > > > > */ > > > > - if (mm_tlb_flush_nested(tlb->mm)) { > > > > - __tlb_reset_range(tlb); > > > > - __tlb_adjust_range(tlb, start, end - start); > > > > + if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) { > > > > + /* > > > > + * Since we can't tell what we actually should have > > > > + * flushed, flush everything in the given range. > > > > + */ > > > > + tlb->freed_tables = 1; > > > > + tlb->cleared_ptes = 1; > > > > + tlb->cleared_pmds = 1; > > > > + tlb->cleared_puds = 1; > > > > + tlb->cleared_p4ds = 1; > > > > + > > > > + /* > > > > + * Some architectures, e.g. ARM, that have range invalidation > > > > + * and care about VM_EXEC for I-Cache invalidation, need force > > > > + * vma_exec set. > > > > + */ > > > > + tlb->vma_exec = 1; > > > > + > > > > + /* Force vma_huge clear to guarantee safer flush */ > > > > + tlb->vma_huge = 0; > > > > + > > > > + tlb->start = start; > > > > + tlb->end = end; > > > > } > > > Whilst I think this is correct, it would be interesting to see whether > > > or not it's actually faster than just nuking the whole mm, as I mentioned > > > before. > > > > > > At least in terms of getting a short-term fix, I'd prefer the diff below > > > if it's not measurably worse. > > > > I did a quick test with ebizzy (96 threads with 5 iterations) on my x86 VM, > > it shows slightly slowdown on records/s but much more sys time spent with > > fullmm flush, the below is the data. > > > > nofullmm fullmm > > ops (records/s) 225606 225119 > > sys (s) 0.69 1.14 > > > > It looks the slight reduction of records/s is caused by the increase of sys > > time. > > That's not what I expected, and I'm unable to explain why moving to fullmm > would /increase/ the system time. I would've thought the time spent doing > the invalidation would decrease, with the downside that the TLB is cold > when returning back to userspace. > I tried ebizzy with various parameters (malloc vs mmap, ran it for hour), but performance was very similar for both patches. So, I was looking for workload that would demonstrate the largest difference. Inspired by python xml-rpc, which can handle each request in new thread, I tried following [1]: 16 threads, each looping 100k times over: mmap(16M) touch 1 page madvise(DONTNEED) munmap(16M) This yields quite significant difference for 2 patches when running on my 46 CPU arm host. I checked it twice - applied patch, recompiled, rebooted, but numbers stayed +- couple seconds the same. Does it somewhat match your expectation? v2 patch --------- real 2m33.460s user 0m3.359s sys 15m32.307s real 2m33.895s user 0m2.749s sys 16m34.500s real 2m35.666s user 0m3.528s sys 15m23.377s real 2m32.898s user 0m2.789s sys 16m18.801s real 2m33.087s user 0m3.565s sys 16m23.815s fullmm version --------------- real 0m46.811s user 0m1.596s sys 1m47.500s real 0m47.322s user 0m1.803s sys 1m48.449s real 0m46.668s user 0m1.508s sys 1m47.352s real 0m46.742s user 0m2.007s sys 1m47.217s real 0m46.948s user 0m1.785s sys 1m47.906s [1] https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/mmap8.c
On 5/16/19 11:29 PM, Jan Stancek wrote: > > ----- Original Message ----- >> On Mon, May 13, 2019 at 04:01:09PM -0700, Yang Shi wrote: >>> >>> On 5/13/19 9:38 AM, Will Deacon wrote: >>>> On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote: >>>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c >>>>> index 99740e1..469492d 100644 >>>>> --- a/mm/mmu_gather.c >>>>> +++ b/mm/mmu_gather.c >>>>> @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb, >>>>> { >>>>> /* >>>>> * If there are parallel threads are doing PTE changes on same range >>>>> - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB >>>>> - * flush by batching, a thread has stable TLB entry can fail to flush >>>>> - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB >>>>> - * forcefully if we detect parallel PTE batching threads. >>>>> + * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB >>>>> + * flush by batching, one thread may end up seeing inconsistent PTEs >>>>> + * and result in having stale TLB entries. So flush TLB forcefully >>>>> + * if we detect parallel PTE batching threads. >>>>> + * >>>>> + * However, some syscalls, e.g. munmap(), may free page tables, this >>>>> + * needs force flush everything in the given range. Otherwise this >>>>> + * may result in having stale TLB entries for some architectures, >>>>> + * e.g. aarch64, that could specify flush what level TLB. >>>>> */ >>>>> - if (mm_tlb_flush_nested(tlb->mm)) { >>>>> - __tlb_reset_range(tlb); >>>>> - __tlb_adjust_range(tlb, start, end - start); >>>>> + if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) { >>>>> + /* >>>>> + * Since we can't tell what we actually should have >>>>> + * flushed, flush everything in the given range. >>>>> + */ >>>>> + tlb->freed_tables = 1; >>>>> + tlb->cleared_ptes = 1; >>>>> + tlb->cleared_pmds = 1; >>>>> + tlb->cleared_puds = 1; >>>>> + tlb->cleared_p4ds = 1; >>>>> + >>>>> + /* >>>>> + * Some architectures, e.g. ARM, that have range invalidation >>>>> + * and care about VM_EXEC for I-Cache invalidation, need force >>>>> + * vma_exec set. >>>>> + */ >>>>> + tlb->vma_exec = 1; >>>>> + >>>>> + /* Force vma_huge clear to guarantee safer flush */ >>>>> + tlb->vma_huge = 0; >>>>> + >>>>> + tlb->start = start; >>>>> + tlb->end = end; >>>>> } >>>> Whilst I think this is correct, it would be interesting to see whether >>>> or not it's actually faster than just nuking the whole mm, as I mentioned >>>> before. >>>> >>>> At least in terms of getting a short-term fix, I'd prefer the diff below >>>> if it's not measurably worse. >>> I did a quick test with ebizzy (96 threads with 5 iterations) on my x86 VM, >>> it shows slightly slowdown on records/s but much more sys time spent with >>> fullmm flush, the below is the data. >>> >>> nofullmm fullmm >>> ops (records/s) 225606 225119 >>> sys (s) 0.69 1.14 >>> >>> It looks the slight reduction of records/s is caused by the increase of sys >>> time. >> That's not what I expected, and I'm unable to explain why moving to fullmm >> would /increase/ the system time. I would've thought the time spent doing >> the invalidation would decrease, with the downside that the TLB is cold >> when returning back to userspace. >> > I tried ebizzy with various parameters (malloc vs mmap, ran it for hour), > but performance was very similar for both patches. > > So, I was looking for workload that would demonstrate the largest difference. > Inspired by python xml-rpc, which can handle each request in new thread, > I tried following [1]: > > 16 threads, each looping 100k times over: > mmap(16M) > touch 1 page > madvise(DONTNEED) > munmap(16M) > > This yields quite significant difference for 2 patches when running on > my 46 CPU arm host. I checked it twice - applied patch, recompiled, rebooted, > but numbers stayed +- couple seconds the same. Thanks for the testing. I'm a little bit surprised by the significant difference. I did the same test on my x86 VM (24 cores), they yield almost same number. Given the significant improvement on arm64 with fullmm version, I'm going to respin the patch. > > Does it somewhat match your expectation? > > v2 patch > --------- > real 2m33.460s > user 0m3.359s > sys 15m32.307s > > real 2m33.895s > user 0m2.749s > sys 16m34.500s > > real 2m35.666s > user 0m3.528s > sys 15m23.377s > > real 2m32.898s > user 0m2.789s > sys 16m18.801s > > real 2m33.087s > user 0m3.565s > sys 16m23.815s > > > fullmm version > --------------- > real 0m46.811s > user 0m1.596s > sys 1m47.500s > > real 0m47.322s > user 0m1.803s > sys 1m48.449s > > real 0m46.668s > user 0m1.508s > sys 1m47.352s > > real 0m46.742s > user 0m2.007s > sys 1m47.217s > > real 0m46.948s > user 0m1.785s > sys 1m47.906s > > [1] https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/mmap8.c
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1..469492d 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb, { /* * If there are parallel threads are doing PTE changes on same range - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB - * flush by batching, a thread has stable TLB entry can fail to flush - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB - * forcefully if we detect parallel PTE batching threads. + * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB + * flush by batching, one thread may end up seeing inconsistent PTEs + * and result in having stale TLB entries. So flush TLB forcefully + * if we detect parallel PTE batching threads. + * + * However, some syscalls, e.g. munmap(), may free page tables, this + * needs force flush everything in the given range. Otherwise this + * may result in having stale TLB entries for some architectures, + * e.g. aarch64, that could specify flush what level TLB. */ - if (mm_tlb_flush_nested(tlb->mm)) { - __tlb_reset_range(tlb); - __tlb_adjust_range(tlb, start, end - start); + if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) { + /* + * Since we can't tell what we actually should have + * flushed, flush everything in the given range. + */ + tlb->freed_tables = 1; + tlb->cleared_ptes = 1; + tlb->cleared_pmds = 1; + tlb->cleared_puds = 1; + tlb->cleared_p4ds = 1; + + /* + * Some architectures, e.g. ARM, that have range invalidation + * and care about VM_EXEC for I-Cache invalidation, need force + * vma_exec set. + */ + tlb->vma_exec = 1; + + /* Force vma_huge clear to guarantee safer flush */ + tlb->vma_huge = 0; + + tlb->start = start; + tlb->end = end; } tlb_flush_mmu(tlb);