diff mbox

mremap: Increase LATENCY_LIMIT of mremap to reduce the number of TLB shootdowns

Message ID 20180606140255.br5ztpeqdmwfto47@techsingularity.net (mailing list archive)
State New, archived
Headers show

Commit Message

Mel Gorman June 6, 2018, 2:02 p.m. UTC
Commit 5d1904204c99 ("mremap: fix race between mremap() and page cleanning")
fixed races between mremap and other operations for both file-backed and
anonymous mappings. The file-backed was the most critical as it allowed the
possibility that data could be changed on a physical page after page_mkclean
returned which could trigger data loss or data integrity issues. A customer
reported that the cost of the TLBs for anonymous regressions was excessive
and resulting in a 30-50% drop in performance overall since this commit
on a microbenchmark. Unfortunately I neither have access to the test-case
nor can I describe what it does other than saying that mremap operations
dominate heavily.

This patch increases the LATENCY_LIMIT to handle TLB flushes on a
PMD boundary instead of every 64 pages. This reduces the number of TLB
shootdowns by a factor of 8 which is not reported to completely restore
performance but gets it within an acceptable percentage. The given metric
here is simply described as "higher is better".

Baseline that was known good
002:  Metric:       91.05
004:  Metric:      109.45
008:  Metric:       73.08
016:  Metric:       58.14
032:  Metric:       61.09
064:  Metric:       57.76
128:  Metric:       55.43

Current
001:  Metric:       54.98
002:  Metric:       56.56
004:  Metric:       41.22
008:  Metric:       35.96
016:  Metric:       36.45
032:  Metric:       35.71
064:  Metric:       35.73
128:  Metric:       34.96

With patch
001:  Metric:       61.43
002:  Metric:       81.64
004:  Metric:       67.92
008:  Metric:       51.67
016:  Metric:       50.47
032:  Metric:       52.29
064:  Metric:       50.01
128:  Metric:       49.04

So for low threads, it's not restored but for larger number of threads,
it's closer to the "known good" baseline. The downside is that PTL lock
hold times will be slightly higher but it's unlikely that an mremap and
another operation will contend on the same PMD. This is the first time I
encountered a realistic workload that was mremap intensive (thousands of
calls per second with small ranges dominating).

Using a different mremap-intensive workload that is not representative of
the real workload there is little difference observed outside of noise in
the headline metrics However, the TLB shootdowns are reduced by 11% on
average and at the peak, TLB shootdowns were reduced by 21%. Interrupts
were sampled every second while the workload ran to get those figures.
It's known that the figures will vary as the non-representative load is
non-deterministic.

An alternative patch was posted that should have significantly reduced the
TLB flushes but unfortunately it does not perform as well as this version
on the customer test case. If revisited, the two patches can stack on top
of each other.

Signed-off-by: Mel Gorman <mgorman@suse.com>
---
 mm/mremap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Nadav Amit June 6, 2018, 3:55 p.m. UTC | #1
Mel Gorman <mgorman@techsingularity.net> wrote:

> Commit 5d1904204c99 ("mremap: fix race between mremap() and page cleanning")
> fixed races between mremap and other operations for both file-backed and
> anonymous mappings. The file-backed was the most critical as it allowed the
> possibility that data could be changed on a physical page after page_mkclean
> returned which could trigger data loss or data integrity issues. A customer
> reported that the cost of the TLBs for anonymous regressions was excessive
> and resulting in a 30-50% drop in performance overall since this commit
> on a microbenchmark. Unfortunately I neither have access to the test-case
> nor can I describe what it does other than saying that mremap operations
> dominate heavily.
> 
> This patch increases the LATENCY_LIMIT to handle TLB flushes on a
> PMD boundary instead of every 64 pages. This reduces the number of TLB
> shootdowns by a factor of 8 which is not reported to completely restore
> performance but gets it within an acceptable percentage. The given metric
> here is simply described as "higher is better".
> 
> Baseline that was known good
> 002:  Metric:       91.05
> 004:  Metric:      109.45
> 008:  Metric:       73.08
> 016:  Metric:       58.14
> 032:  Metric:       61.09
> 064:  Metric:       57.76
> 128:  Metric:       55.43
> 
> Current
> 001:  Metric:       54.98
> 002:  Metric:       56.56
> 004:  Metric:       41.22
> 008:  Metric:       35.96
> 016:  Metric:       36.45
> 032:  Metric:       35.71
> 064:  Metric:       35.73
> 128:  Metric:       34.96
> 
> With patch
> 001:  Metric:       61.43
> 002:  Metric:       81.64
> 004:  Metric:       67.92
> 008:  Metric:       51.67
> 016:  Metric:       50.47
> 032:  Metric:       52.29
> 064:  Metric:       50.01
> 128:  Metric:       49.04
> 
> So for low threads, it's not restored but for larger number of threads,
> it's closer to the "known good" baseline. The downside is that PTL lock
> hold times will be slightly higher but it's unlikely that an mremap and
> another operation will contend on the same PMD. This is the first time I
> encountered a realistic workload that was mremap intensive (thousands of
> calls per second with small ranges dominating).
> 
> Using a different mremap-intensive workload that is not representative of
> the real workload there is little difference observed outside of noise in
> the headline metrics However, the TLB shootdowns are reduced by 11% on
> average and at the peak, TLB shootdowns were reduced by 21%. Interrupts
> were sampled every second while the workload ran to get those figures.
> It's known that the figures will vary as the non-representative load is
> non-deterministic.
> 
> An alternative patch was posted that should have significantly reduced the
> TLB flushes but unfortunately it does not perform as well as this version
> on the customer test case. If revisited, the two patches can stack on top
> of each other.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.com>
> ---
> mm/mremap.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 049470aa1e3e..b5017cb2e1e9 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -191,7 +191,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
> 		drop_rmap_locks(vma);
> }
> 
> -#define LATENCY_LIMIT	(64 * PAGE_SIZE)
> +#define LATENCY_LIMIT	(PMD_SIZE)
> 
> unsigned long move_page_tables(struct vm_area_struct *vma,
> 		unsigned long old_addr, struct vm_area_struct *new_vma,

This LATENCY_LIMIT is only used in move_page_tables() in the following
manner:

  next = (new_addr + PMD_SIZE) & PMD_MASK;
  if (extent > next - new_addr)
      extent = next - new_addr;
  if (extent > LATENCY_LIMIT)
      extent = LATENCY_LIMIT;
   
If LATENCY_LIMIT is to be changed to PMD_SIZE, then IIUC the last condition
is not required, and LATENCY_LIMIT can just be removed (assuming there is no
underflow case that hides somewhere).

No?
Mel Gorman June 6, 2018, 5:47 p.m. UTC | #2
On Wed, Jun 06, 2018 at 08:55:15AM -0700, Nadav Amit wrote:
> > -#define LATENCY_LIMIT	(64 * PAGE_SIZE)
> > +#define LATENCY_LIMIT	(PMD_SIZE)
> > 
> > unsigned long move_page_tables(struct vm_area_struct *vma,
> > 		unsigned long old_addr, struct vm_area_struct *new_vma,
> 
> This LATENCY_LIMIT is only used in move_page_tables() in the following
> manner:
> 
>   next = (new_addr + PMD_SIZE) & PMD_MASK;
>   if (extent > next - new_addr)
>       extent = next - new_addr;
>   if (extent > LATENCY_LIMIT)
>       extent = LATENCY_LIMIT;
>    
> If LATENCY_LIMIT is to be changed to PMD_SIZE, then IIUC the last condition
> is not required, and LATENCY_LIMIT can just be removed (assuming there is no
> underflow case that hides somewhere).
> 

I see no problem removing it other than we may forget that we ever limited
PTE lock hold times for any reason. I'm skeptical it will matter unless
mremap-intensive workloads are a lot more common than I believe.
Nadav Amit June 6, 2018, 6:20 p.m. UTC | #3
Mel Gorman <mgorman@techsingularity.net> wrote:

> On Wed, Jun 06, 2018 at 08:55:15AM -0700, Nadav Amit wrote:
>>> -#define LATENCY_LIMIT	(64 * PAGE_SIZE)
>>> +#define LATENCY_LIMIT	(PMD_SIZE)
>>> 
>>> unsigned long move_page_tables(struct vm_area_struct *vma,
>>> 		unsigned long old_addr, struct vm_area_struct *new_vma,
>> 
>> This LATENCY_LIMIT is only used in move_page_tables() in the following
>> manner:
>> 
>>  next = (new_addr + PMD_SIZE) & PMD_MASK;
>>  if (extent > next - new_addr)
>>      extent = next - new_addr;
>>  if (extent > LATENCY_LIMIT)
>>      extent = LATENCY_LIMIT;
>> 
>> If LATENCY_LIMIT is to be changed to PMD_SIZE, then IIUC the last condition
>> is not required, and LATENCY_LIMIT can just be removed (assuming there is no
>> underflow case that hides somewhere).
> 
> I see no problem removing it other than we may forget that we ever limited
> PTE lock hold times for any reason. I'm skeptical it will matter unless
> mremap-intensive workloads are a lot more common than I believe.

I have no opinion regarding the behavior change. It is just that code with
no effect is oftentimes confusing. A comment (if needed) can replace the
code, and git history would provide how it was once supported.
diff mbox

Patch

diff --git a/mm/mremap.c b/mm/mremap.c
index 049470aa1e3e..b5017cb2e1e9 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -191,7 +191,7 @@  static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		drop_rmap_locks(vma);
 }
 
-#define LATENCY_LIMIT	(64 * PAGE_SIZE)
+#define LATENCY_LIMIT	(PMD_SIZE)
 
 unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,