[0/5] Speed up mremap on large regions

Message ID	20200930222130.4175584-1-kaleshsingh@google.com (mailing list archive)
Headers	show Return-Path: <SRS0=5jaa=DH=vger.kernel.org=linux-kselftest-owner@kernel.org> Sender: "kaleshsingh via sendgmr" <kaleshsingh@kaleshsingh.c.googlers.com> Date: Wed, 30 Sep 2020 22:21:17 +0000 Message-Id: <20200930222130.4175584-1-kaleshsingh@google.com> Mime-Version: 1.0 Subject: [PATCH 0/5] Speed up mremap on large regions From: Kalesh Singh <kaleshsingh@google.com> Cc: surenb@google.com, minchan@google.com, joelaf@google.com, lokeshgidra@google.com, kaleshsingh@google.com, kernel-team@android.com, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>, Andrew Morton <akpm@linux-foundation.org>, Shuah Khan <shuah@kernel.org>, "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>, Kees Cook <keescook@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Sami Tolvanen <samitolvanen@google.com>, Masahiro Yamada <masahiroy@kernel.org>, Arnd Bergmann <arnd@arndb.de>, Frederic Weisbecker <frederic@kernel.org>, Krzysztof Kozlowski <krzk@kernel.org>, Hassan Naveed <hnaveed@wavecomp.com>, Christian Brauner <christian.brauner@ubuntu.com>, Mark Rutland <mark.rutland@arm.com>, Mike Rapoport <rppt@kernel.org>, Gavin Shan <gshan@redhat.com>, Zhenyu Ye <yezhenyu2@huawei.com>, Jia He <justin.he@arm.com>, John Hubbard <jhubbard@nvidia.com>, William Kucharski <william.kucharski@oracle.com>, Sandipan Das <sandipan@linux.ibm.com>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Ralph Campbell <rcampbell@nvidia.com>, Mina Almasry <almasrymina@google.com>, Ram Pai <linuxram@us.ibm.com>, Dave Hansen <dave.hansen@intel.com>, Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>, Masami Hiramatsu <mhiramat@kernel.org>, Brian Geffon <bgeffon@google.com>, SeongJae Park <sjpark@amazon.de>, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org Content-Type: text/plain; charset="UTF-8" To: unlisted-recipients:; (no To-header on input) Precedence: bulk
Series	Speed up mremap on large regions \| expand [0/5] Speed up mremap on large regions [1/5] kselftests: vm: Add mremap tests [2/5] arm64: mremap speedup - Enable HAVE_MOVE_PMD [3/5] mm: Speedup mremap on 1GB or larger regions [4/5] arm64: mremap speedup - Enable HAVE_MOVE_PUD [5/5] x86: mremap speedup - Enable HAVE_MOVE_PUD

Kalesh Singh Sept. 30, 2020, 10:21 p.m. UTC

mremap time can be optimized by moving entries at the PMD/PUD level if
the source and destination addresses are PMD/PUD-aligned and
PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
x86. Other architectures where this type of move is supported and known to
be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
and HAVE_MOVE_PUD.

Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
region on x86 and arm64:

    - HAVE_MOVE_PMD is already enabled on x86 : N/A
    - Enabling HAVE_MOVE_PUD on x86   : ~13x speed up

    - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
    - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up

          Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
          give a total of ~150x speed up on arm64.


Kalesh Singh (5):
  kselftests: vm: Add mremap tests
  arm64: mremap speedup - Enable HAVE_MOVE_PMD
  mm: Speedup mremap on 1GB or larger regions
  arm64: mremap speedup - Enable HAVE_MOVE_PUD
  x86: mremap speedup - Enable HAVE_MOVE_PUD

 arch/Kconfig                             |   7 +
 arch/arm64/Kconfig                       |   2 +
 arch/arm64/include/asm/pgtable.h         |   1 +
 arch/x86/Kconfig                         |   1 +
 mm/mremap.c                              | 211 +++++++++++++++++---
 tools/testing/selftests/vm/.gitignore    |   1 +
 tools/testing/selftests/vm/Makefile      |   1 +
 tools/testing/selftests/vm/mremap_test.c | 243 +++++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests   |  11 +
 9 files changed, 448 insertions(+), 30 deletions(-)
 create mode 100644 tools/testing/selftests/vm/mremap_test.c

Kirill A . Shutemov Sept. 30, 2020, 10:32 p.m. UTC | #1

On Wed, Sep 30, 2020 at 10:21:17PM +0000, Kalesh Singh wrote:
> mremap time can be optimized by moving entries at the PMD/PUD level if
> the source and destination addresses are PMD/PUD-aligned and
> PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
> x86. Other architectures where this type of move is supported and known to
> be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
> and HAVE_MOVE_PUD.
> 
> Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
> region on x86 and arm64:
> 
>     - HAVE_MOVE_PMD is already enabled on x86 : N/A
>     - Enabling HAVE_MOVE_PUD on x86   : ~13x speed up
> 
>     - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
>     - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
> 
>           Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
>           give a total of ~150x speed up on arm64.

Is there a *real* workload that benefit from HAVE_MOVE_PUD?

Lokesh Gidra Sept. 30, 2020, 10:42 p.m. UTC | #2

On Wed, Sep 30, 2020 at 3:32 PM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Wed, Sep 30, 2020 at 10:21:17PM +0000, Kalesh Singh wrote:
> > mremap time can be optimized by moving entries at the PMD/PUD level if
> > the source and destination addresses are PMD/PUD-aligned and
> > PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
> > x86. Other architectures where this type of move is supported and known to
> > be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
> > and HAVE_MOVE_PUD.
> >
> > Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
> > region on x86 and arm64:
> >
> >     - HAVE_MOVE_PMD is already enabled on x86 : N/A
> >     - Enabling HAVE_MOVE_PUD on x86   : ~13x speed up
> >
> >     - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
> >     - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
> >
> >           Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
> >           give a total of ~150x speed up on arm64.
>
> Is there a *real* workload that benefit from HAVE_MOVE_PUD?
>
We have a Java garbage collector under development which requires
moving physical pages of multi-gigabyte heap using mremap. During this
move, the application threads have to be paused for correctness. It is
critical to keep this pause as short as possible to avoid jitters
during user interaction. This is where HAVE_MOVE_PUD will greatly
help.
> --
>  Kirill A. Shutemov

Joel Fernandes Sept. 30, 2020, 10:46 p.m. UTC | #3

On Wed, Sep 30, 2020 at 6:42 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> On Wed, Sep 30, 2020 at 3:32 PM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > On Wed, Sep 30, 2020 at 10:21:17PM +0000, Kalesh Singh wrote:
> > > mremap time can be optimized by moving entries at the PMD/PUD level if
> > > the source and destination addresses are PMD/PUD-aligned and
> > > PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
> > > x86. Other architectures where this type of move is supported and known to
> > > be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
> > > and HAVE_MOVE_PUD.
> > >
> > > Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
> > > region on x86 and arm64:
> > >
> > >     - HAVE_MOVE_PMD is already enabled on x86 : N/A
> > >     - Enabling HAVE_MOVE_PUD on x86   : ~13x speed up
> > >
> > >     - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
> > >     - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
> > >
> > >           Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
> > >           give a total of ~150x speed up on arm64.
> >
> > Is there a *real* workload that benefit from HAVE_MOVE_PUD?
> >
> We have a Java garbage collector under development which requires
> moving physical pages of multi-gigabyte heap using mremap. During this
> move, the application threads have to be paused for correctness. It is
> critical to keep this pause as short as possible to avoid jitters
> during user interaction. This is where HAVE_MOVE_PUD will greatly
> help.

And that detail should totally have gone into the commit message :-/

Thanks,

 - Joel

Kalesh Singh Sept. 30, 2020, 11:03 p.m. UTC | #4

On Wed, Sep 30, 2020 at 6:47 PM Joel Fernandes <joelaf@google.com> wrote:
>
> On Wed, Sep 30, 2020 at 6:42 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> >
> > On Wed, Sep 30, 2020 at 3:32 PM Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > On Wed, Sep 30, 2020 at 10:21:17PM +0000, Kalesh Singh wrote:
> > > > mremap time can be optimized by moving entries at the PMD/PUD level if
> > > > the source and destination addresses are PMD/PUD-aligned and
> > > > PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
> > > > x86. Other architectures where this type of move is supported and known to
> > > > be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
> > > > and HAVE_MOVE_PUD.
> > > >
> > > > Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
> > > > region on x86 and arm64:
> > > >
> > > >     - HAVE_MOVE_PMD is already enabled on x86 : N/A
> > > >     - Enabling HAVE_MOVE_PUD on x86   : ~13x speed up
> > > >
> > > >     - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
> > > >     - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
> > > >
> > > >           Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
> > > >           give a total of ~150x speed up on arm64.
> > >
> > > Is there a *real* workload that benefit from HAVE_MOVE_PUD?
> > >
> > We have a Java garbage collector under development which requires
> > moving physical pages of multi-gigabyte heap using mremap. During this
> > move, the application threads have to be paused for correctness. It is
> > critical to keep this pause as short as possible to avoid jitters
> > during user interaction. This is where HAVE_MOVE_PUD will greatly
> > help.
>
> And that detail should totally have gone into the commit message :-/
Hi Joel,
The patch that introduces HAVE_MOVE_PUD in the series mentions the
Android garbage collection use case. I can add these details there in
the next version.
Thanks,
Kalesh
>
> Thanks,
>
>  - Joel

Kirill A . Shutemov Oct. 1, 2020, 12:27 p.m. UTC | #5

On Wed, Sep 30, 2020 at 03:42:17PM -0700, Lokesh Gidra wrote:
> On Wed, Sep 30, 2020 at 3:32 PM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > On Wed, Sep 30, 2020 at 10:21:17PM +0000, Kalesh Singh wrote:
> > > mremap time can be optimized by moving entries at the PMD/PUD level if
> > > the source and destination addresses are PMD/PUD-aligned and
> > > PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
> > > x86. Other architectures where this type of move is supported and known to
> > > be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
> > > and HAVE_MOVE_PUD.
> > >
> > > Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
> > > region on x86 and arm64:
> > >
> > >     - HAVE_MOVE_PMD is already enabled on x86 : N/A
> > >     - Enabling HAVE_MOVE_PUD on x86   : ~13x speed up
> > >
> > >     - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
> > >     - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
> > >
> > >           Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
> > >           give a total of ~150x speed up on arm64.
> >
> > Is there a *real* workload that benefit from HAVE_MOVE_PUD?
> >
> We have a Java garbage collector under development which requires
> moving physical pages of multi-gigabyte heap using mremap. During this
> move, the application threads have to be paused for correctness. It is
> critical to keep this pause as short as possible to avoid jitters
> during user interaction. This is where HAVE_MOVE_PUD will greatly
> help.

Any chance to quantify the effect of mremap() with and without
HAVE_MOVE_PUD?

I doubt it's a major contributor to the GC pause. I expect you need to
move tens of gigs to get sizable effect. And if your GC routinely moves
tens of gigs, maybe problem somewhere else?

I'm asking for numbers, because increase in complexity comes with cost.
If it doesn't provide an substantial benefit to a real workload
maintaining the code forever doesn't make sense.

Kalesh Singh Oct. 1, 2020, 3:59 p.m. UTC | #6

On Thu, Oct 1, 2020 at 8:27 AM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Wed, Sep 30, 2020 at 03:42:17PM -0700, Lokesh Gidra wrote:
> > On Wed, Sep 30, 2020 at 3:32 PM Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > On Wed, Sep 30, 2020 at 10:21:17PM +0000, Kalesh Singh wrote:
> > > > mremap time can be optimized by moving entries at the PMD/PUD level if
> > > > the source and destination addresses are PMD/PUD-aligned and
> > > > PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
> > > > x86. Other architectures where this type of move is supported and known to
> > > > be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
> > > > and HAVE_MOVE_PUD.
> > > >
> > > > Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
> > > > region on x86 and arm64:
> > > >
> > > >     - HAVE_MOVE_PMD is already enabled on x86 : N/A
> > > >     - Enabling HAVE_MOVE_PUD on x86   : ~13x speed up
> > > >
> > > >     - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
> > > >     - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
> > > >
> > > >           Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
> > > >           give a total of ~150x speed up on arm64.
> > >
> > > Is there a *real* workload that benefit from HAVE_MOVE_PUD?
> > >
> > We have a Java garbage collector under development which requires
> > moving physical pages of multi-gigabyte heap using mremap. During this
> > move, the application threads have to be paused for correctness. It is
> > critical to keep this pause as short as possible to avoid jitters
> > during user interaction. This is where HAVE_MOVE_PUD will greatly
> > help.
>
> Any chance to quantify the effect of mremap() with and without
> HAVE_MOVE_PUD?
>
> I doubt it's a major contributor to the GC pause. I expect you need to
> move tens of gigs to get sizable effect. And if your GC routinely moves
> tens of gigs, maybe problem somewhere else?
>
> I'm asking for numbers, because increase in complexity comes with cost.
> If it doesn't provide an substantial benefit to a real workload
> maintaining the code forever doesn't make sense.

Lokesh on this thread would be better able to answer this. I'll let
him weigh in here.
Thanks, Kalesh
>
> --
>  Kirill A. Shutemov
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

Lokesh Gidra Oct. 2, 2020, 12:09 a.m. UTC | #7

On Thu, Oct 1, 2020 at 9:00 AM Kalesh Singh <kaleshsingh@google.com> wrote:
>
> On Thu, Oct 1, 2020 at 8:27 AM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > On Wed, Sep 30, 2020 at 03:42:17PM -0700, Lokesh Gidra wrote:
> > > On Wed, Sep 30, 2020 at 3:32 PM Kirill A. Shutemov
> > > <kirill.shutemov@linux.intel.com> wrote:
> > > >
> > > > On Wed, Sep 30, 2020 at 10:21:17PM +0000, Kalesh Singh wrote:
> > > > > mremap time can be optimized by moving entries at the PMD/PUD level if
> > > > > the source and destination addresses are PMD/PUD-aligned and
> > > > > PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
> > > > > x86. Other architectures where this type of move is supported and known to
> > > > > be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
> > > > > and HAVE_MOVE_PUD.
> > > > >
> > > > > Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
> > > > > region on x86 and arm64:
> > > > >
> > > > >     - HAVE_MOVE_PMD is already enabled on x86 : N/A
> > > > >     - Enabling HAVE_MOVE_PUD on x86   : ~13x speed up
> > > > >
> > > > >     - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
> > > > >     - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
> > > > >
> > > > >           Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
> > > > >           give a total of ~150x speed up on arm64.
> > > >
> > > > Is there a *real* workload that benefit from HAVE_MOVE_PUD?
> > > >
> > > We have a Java garbage collector under development which requires
> > > moving physical pages of multi-gigabyte heap using mremap. During this
> > > move, the application threads have to be paused for correctness. It is
> > > critical to keep this pause as short as possible to avoid jitters
> > > during user interaction. This is where HAVE_MOVE_PUD will greatly
> > > help.
> >
> > Any chance to quantify the effect of mremap() with and without
> > HAVE_MOVE_PUD?
> >
> > I doubt it's a major contributor to the GC pause. I expect you need to
> > move tens of gigs to get sizable effect. And if your GC routinely moves
> > tens of gigs, maybe problem somewhere else?
> >
> > I'm asking for numbers, because increase in complexity comes with cost.
> > If it doesn't provide an substantial benefit to a real workload
> > maintaining the code forever doesn't make sense.
>
mremap is indeed the biggest contributor to the GC pause. It has to
take place in what is typically known as a 'stop-the-world' pause,
wherein all application threads are paused. During this pause the GC
thread flips the GC roots (threads' stacks, globals etc.), and then
resumes threads along with concurrent compaction of the heap.This
GC-root flip differs depending on which compaction algorithm is being
used.

In our case it involves updating object references in threads' stacks
and remapping java heap to a different location. The threads' stacks
can be handled in parallel with the mremap. Therefore, the dominant
factor is indeed the cost of mremap. From patches 2 and 4, it is clear
that remapping 1GB without this optimization will take ~9ms on arm64.

Although this mremap has to happen only once every GC cycle, and the
typical size is also not going to be more than a GB or 2, pausing
application threads for ~9ms is guaranteed to cause jitters. OTOH,
with this optimization, mremap is reduced to ~60us, which is a totally
acceptable pause time.

Unfortunately, implementation of the new GC algorithm hasn't yet
reached the point where I can quantify the effect of this
optimization. But I can confirm that without this optimization the new
GC will not be approved.


> Lokesh on this thread would be better able to answer this. I'll let
> him weigh in here.
> Thanks, Kalesh
> >
> > --
> >  Kirill A. Shutemov
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >

Kirill A . Shutemov Oct. 2, 2020, 5:35 a.m. UTC | #8

On Thu, Oct 01, 2020 at 05:09:02PM -0700, Lokesh Gidra wrote:
> On Thu, Oct 1, 2020 at 9:00 AM Kalesh Singh <kaleshsingh@google.com> wrote:
> >
> > On Thu, Oct 1, 2020 at 8:27 AM Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > On Wed, Sep 30, 2020 at 03:42:17PM -0700, Lokesh Gidra wrote:
> > > > On Wed, Sep 30, 2020 at 3:32 PM Kirill A. Shutemov
> > > > <kirill.shutemov@linux.intel.com> wrote:
> > > > >
> > > > > On Wed, Sep 30, 2020 at 10:21:17PM +0000, Kalesh Singh wrote:
> > > > > > mremap time can be optimized by moving entries at the PMD/PUD level if
> > > > > > the source and destination addresses are PMD/PUD-aligned and
> > > > > > PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
> > > > > > x86. Other architectures where this type of move is supported and known to
> > > > > > be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
> > > > > > and HAVE_MOVE_PUD.
> > > > > >
> > > > > > Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
> > > > > > region on x86 and arm64:
> > > > > >
> > > > > >     - HAVE_MOVE_PMD is already enabled on x86 : N/A
> > > > > >     - Enabling HAVE_MOVE_PUD on x86   : ~13x speed up
> > > > > >
> > > > > >     - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
> > > > > >     - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
> > > > > >
> > > > > >           Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
> > > > > >           give a total of ~150x speed up on arm64.
> > > > >
> > > > > Is there a *real* workload that benefit from HAVE_MOVE_PUD?
> > > > >
> > > > We have a Java garbage collector under development which requires
> > > > moving physical pages of multi-gigabyte heap using mremap. During this
> > > > move, the application threads have to be paused for correctness. It is
> > > > critical to keep this pause as short as possible to avoid jitters
> > > > during user interaction. This is where HAVE_MOVE_PUD will greatly
> > > > help.
> > >
> > > Any chance to quantify the effect of mremap() with and without
> > > HAVE_MOVE_PUD?
> > >
> > > I doubt it's a major contributor to the GC pause. I expect you need to
> > > move tens of gigs to get sizable effect. And if your GC routinely moves
> > > tens of gigs, maybe problem somewhere else?
> > >
> > > I'm asking for numbers, because increase in complexity comes with cost.
> > > If it doesn't provide an substantial benefit to a real workload
> > > maintaining the code forever doesn't make sense.
> >
> mremap is indeed the biggest contributor to the GC pause. It has to
> take place in what is typically known as a 'stop-the-world' pause,
> wherein all application threads are paused. During this pause the GC
> thread flips the GC roots (threads' stacks, globals etc.), and then
> resumes threads along with concurrent compaction of the heap.This
> GC-root flip differs depending on which compaction algorithm is being
> used.
> 
> In our case it involves updating object references in threads' stacks
> and remapping java heap to a different location. The threads' stacks
> can be handled in parallel with the mremap. Therefore, the dominant
> factor is indeed the cost of mremap. From patches 2 and 4, it is clear
> that remapping 1GB without this optimization will take ~9ms on arm64.
> 
> Although this mremap has to happen only once every GC cycle, and the
> typical size is also not going to be more than a GB or 2, pausing
> application threads for ~9ms is guaranteed to cause jitters. OTOH,
> with this optimization, mremap is reduced to ~60us, which is a totally
> acceptable pause time.
> 
> Unfortunately, implementation of the new GC algorithm hasn't yet
> reached the point where I can quantify the effect of this
> optimization. But I can confirm that without this optimization the new
> GC will not be approved.

IIUC, the 9ms -> 90us improvement attributed to combination HAVE_MOVE_PMD
and HAVE_MOVE_PUD, right? I expect HAVE_MOVE_PMD to be reasonable for some
workloads, but marginal benefit of HAVE_MOVE_PUD is in doubt. Do you see
it's useful for your workload?

Lokesh Gidra Oct. 2, 2020, 6:39 a.m. UTC | #9

On Thu, Oct 1, 2020 at 10:36 PM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Thu, Oct 01, 2020 at 05:09:02PM -0700, Lokesh Gidra wrote:
> > On Thu, Oct 1, 2020 at 9:00 AM Kalesh Singh <kaleshsingh@google.com> wrote:
> > >
> > > On Thu, Oct 1, 2020 at 8:27 AM Kirill A. Shutemov
> > > <kirill.shutemov@linux.intel.com> wrote:
> > > >
> > > > On Wed, Sep 30, 2020 at 03:42:17PM -0700, Lokesh Gidra wrote:
> > > > > On Wed, Sep 30, 2020 at 3:32 PM Kirill A. Shutemov
> > > > > <kirill.shutemov@linux.intel.com> wrote:
> > > > > >
> > > > > > On Wed, Sep 30, 2020 at 10:21:17PM +0000, Kalesh Singh wrote:
> > > > > > > mremap time can be optimized by moving entries at the PMD/PUD level if
> > > > > > > the source and destination addresses are PMD/PUD-aligned and
> > > > > > > PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
> > > > > > > x86. Other architectures where this type of move is supported and known to
> > > > > > > be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
> > > > > > > and HAVE_MOVE_PUD.
> > > > > > >
> > > > > > > Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
> > > > > > > region on x86 and arm64:
> > > > > > >
> > > > > > >     - HAVE_MOVE_PMD is already enabled on x86 : N/A
> > > > > > >     - Enabling HAVE_MOVE_PUD on x86   : ~13x speed up
> > > > > > >
> > > > > > >     - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
> > > > > > >     - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
> > > > > > >
> > > > > > >           Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
> > > > > > >           give a total of ~150x speed up on arm64.
> > > > > >
> > > > > > Is there a *real* workload that benefit from HAVE_MOVE_PUD?
> > > > > >
> > > > > We have a Java garbage collector under development which requires
> > > > > moving physical pages of multi-gigabyte heap using mremap. During this
> > > > > move, the application threads have to be paused for correctness. It is
> > > > > critical to keep this pause as short as possible to avoid jitters
> > > > > during user interaction. This is where HAVE_MOVE_PUD will greatly
> > > > > help.
> > > >
> > > > Any chance to quantify the effect of mremap() with and without
> > > > HAVE_MOVE_PUD?
> > > >
> > > > I doubt it's a major contributor to the GC pause. I expect you need to
> > > > move tens of gigs to get sizable effect. And if your GC routinely moves
> > > > tens of gigs, maybe problem somewhere else?
> > > >
> > > > I'm asking for numbers, because increase in complexity comes with cost.
> > > > If it doesn't provide an substantial benefit to a real workload
> > > > maintaining the code forever doesn't make sense.
> > >
> > mremap is indeed the biggest contributor to the GC pause. It has to
> > take place in what is typically known as a 'stop-the-world' pause,
> > wherein all application threads are paused. During this pause the GC
> > thread flips the GC roots (threads' stacks, globals etc.), and then
> > resumes threads along with concurrent compaction of the heap.This
> > GC-root flip differs depending on which compaction algorithm is being
> > used.
> >
> > In our case it involves updating object references in threads' stacks
> > and remapping java heap to a different location. The threads' stacks
> > can be handled in parallel with the mremap. Therefore, the dominant
> > factor is indeed the cost of mremap. From patches 2 and 4, it is clear
> > that remapping 1GB without this optimization will take ~9ms on arm64.
> >
> > Although this mremap has to happen only once every GC cycle, and the
> > typical size is also not going to be more than a GB or 2, pausing
> > application threads for ~9ms is guaranteed to cause jitters. OTOH,
> > with this optimization, mremap is reduced to ~60us, which is a totally
> > acceptable pause time.
> >
> > Unfortunately, implementation of the new GC algorithm hasn't yet
> > reached the point where I can quantify the effect of this
> > optimization. But I can confirm that without this optimization the new
> > GC will not be approved.
>
> IIUC, the 9ms -> 90us improvement attributed to combination HAVE_MOVE_PMD
> and HAVE_MOVE_PUD, right? I expect HAVE_MOVE_PMD to be reasonable for some
> workloads, but marginal benefit of HAVE_MOVE_PUD is in doubt. Do you see
> it's useful for your workload?
>
Yes, 9ms -> 90us is when both are combined. The past experience has
been that even ~1ms long stop-the-world pause is prone to cause
jitters. HAVE_MOVE_PMD takes us only this far. So HAVE_MOVE_PUD is
required to bring the mremap cost to acceptable level.

Ideally, I was hoping that the functionality of HAVE_MOVE_PMD can be
extended to all levels of the hierarchical page table, and in the
process simplify the implementation. But unfortunately, that doesn't
seem to be possible from patch 3.

> --
>  Kirill A. Shutemov

[0/5] Speed up mremap on large regions

Message

Comments