[RFC,0/3] ARM64: Guest performance improvement during dirty

Message ID	20220110210441.2074798-1-jingzhangos@google.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@kernel.org> Date: Mon, 10 Jan 2022 21:04:38 +0000 Message-Id: <20220110210441.2074798-1-jingzhangos@google.com> Mime-Version: 1.0 Subject: [RFC PATCH 0/3] ARM64: Guest performance improvement during dirty From: Jing Zhang <jingzhangos@google.com> To: KVM <kvm@vger.kernel.org>, KVMARM <kvmarm@lists.cs.columbia.edu>, Marc Zyngier <maz@kernel.org>, Will Deacon <will@kernel.org>, Paolo Bonzini <pbonzini@redhat.com>, David Matlack <dmatlack@google.com>, Oliver Upton <oupton@google.com>, Reiji Watanabe <reijiw@google.com> Cc: Jing Zhang <jingzhangos@google.com> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	ARM64: Guest performance improvement during dirty \| expand [RFC,0/3] ARM64: Guest performance improvement during dirty [RFC,1/3] KVM: arm64: Use read/write spin lock for MMU protection [RFC,2/3] KVM: arm64: Add fast path to handle permission relaxation during dirty logging [RFC,3/3] KVM: selftests: Add vgic initialization for dirty log perf test for ARM

Jing Zhang Jan. 10, 2022, 9:04 p.m. UTC

This patch is to reduce the performance degradation of guest workload during
dirty logging on ARM64. A fast path is added to handle permission relaxation
during dirty logging. The MMU lock is replaced with rwlock, by which all
permision relaxations on leaf pte can be performed under the read lock. This
greatly reduces the MMU lock contention during dirty logging. With this
solution, the source guest workload performance degradation can be improved
by more than 60%.

Problem:
  * A Google internal live migration test shows that the source guest workload
  performance has >99% degradation for about 105 seconds, >50% degradation
  for about 112 seconds, >10% degradation for about 112 seconds on ARM64.
  This shows that most of the time, the guest workload degradtion is above
  99%, which obviously needs some improvement compared to the test result
  on x86 (>99% for 6s, >50% for 9s, >10% for 27s).
  * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB
  * VM spec: #vCPU: 48, #Mem/vCPU: 4GB

Analysis:
  * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get
    the number of contentions of MMU lock and the "dirty memory time" on
    various VM spec.
    By using test command
    ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU]
    Below are the results:
    +-------+------------------------+-----------------------+
    | #vCPU | dirty memory time (ms) | number of contentions |
    +-------+------------------------+-----------------------+
    | 1     | 926                    | 0                     |
    +-------+------------------------+-----------------------+
    | 2     | 1189                   | 4732558               |
    +-------+------------------------+-----------------------+
    | 4     | 2503                   | 11527185              |
    +-------+------------------------+-----------------------+
    | 8     | 5069                   | 24881677              |
    +-------+------------------------+-----------------------+
    | 16    | 10340                  | 50347956              |
    +-------+------------------------+-----------------------+
    | 32    | 20351                  | 100605720             |
    +-------+------------------------+-----------------------+
    | 64    | 40994                  | 201442478             |
    +-------+------------------------+-----------------------+

  * From the test results above, the "dirty memory time" and the number of
    MMU lock contention scale with the number of vCPUs. That means all the
    dirty memory operations from all vCPU threads have been serialized by
    the MMU lock. Further analysis also shows that the permission relaxation
    during dirty logging is where vCPU threads get serialized.

Solution:
  * On ARM64, there is no mechanism as PML (Page Modification Logging) and
    the dirty-bit solution for dirty logging is much complicated compared to
    the write-protection solution. The straight way to reduce the guest
    performance degradation is to enhance the concurrency for the permission
    fault path during dirty logging.
  * In this patch, we only put leaf PTE permission relaxation for dirty
    logging under read lock, all others would go under write lock.
    Below are the results based on the solution:
    +-------+------------------------+
    | #vCPU | dirty memory time (ms) |
    +-------+------------------------+
    | 1     | 803                    |
    +-------+------------------------+
    | 2     | 843                    |
    +-------+------------------------+
    | 4     | 942                    |
    +-------+------------------------+
    | 8     | 1458                   |
    +-------+------------------------+
    | 16    | 2853                   |
    +-------+------------------------+
    | 32    | 5886                   |
    +-------+------------------------+
    | 64    | 12190                  |
    +-------+------------------------+
    All "dirty memory time" have been reduced by more than 60% when the
    number of vCPU grows.
    
---

Jing Zhang (3):
  KVM: arm64: Use read/write spin lock for MMU protection
  KVM: arm64: Add fast path to handle permission relaxation during dirty
    logging
  KVM: selftests: Add vgic initialization for dirty log perf test for
    ARM

 arch/arm64/include/asm/kvm_host.h             |  2 +
 arch/arm64/kvm/mmu.c                          | 86 +++++++++++++++----
 .../selftests/kvm/dirty_log_perf_test.c       | 10 +++
 3 files changed, 80 insertions(+), 18 deletions(-)


base-commit: fea31d1690945e6dd6c3e89ec5591490857bc3d4

Marc Zyngier Jan. 11, 2022, 11:54 a.m. UTC | #1

On Mon, 10 Jan 2022 21:04:38 +0000,
Jing Zhang <jingzhangos@google.com> wrote:
> 
> This patch is to reduce the performance degradation of guest workload during
> dirty logging on ARM64. A fast path is added to handle permission relaxation
> during dirty logging. The MMU lock is replaced with rwlock, by which all
> permision relaxations on leaf pte can be performed under the read lock. This
> greatly reduces the MMU lock contention during dirty logging. With this
> solution, the source guest workload performance degradation can be improved
> by more than 60%.
> 
> Problem:
>   * A Google internal live migration test shows that the source guest workload
>   performance has >99% degradation for about 105 seconds, >50% degradation
>   for about 112 seconds, >10% degradation for about 112 seconds on ARM64.
>   This shows that most of the time, the guest workload degradtion is above
>   99%, which obviously needs some improvement compared to the test result
>   on x86 (>99% for 6s, >50% for 9s, >10% for 27s).
>   * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB
>   * VM spec: #vCPU: 48, #Mem/vCPU: 4GB

What are the host and guest page sizes?

> 
> Analysis:
>   * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get
>     the number of contentions of MMU lock and the "dirty memory time" on
>     various VM spec.
>     By using test command
>     ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU]

How is this test representative of the internal live migration test
you mention above? '-m 2' indicates a mode that varies depending on
the HW and revision of the test (I just added a bunch of supported
modes). Which one is it?

>     Below are the results:
>     +-------+------------------------+-----------------------+
>     | #vCPU | dirty memory time (ms) | number of contentions |
>     +-------+------------------------+-----------------------+
>     | 1     | 926                    | 0                     |
>     +-------+------------------------+-----------------------+
>     | 2     | 1189                   | 4732558               |
>     +-------+------------------------+-----------------------+
>     | 4     | 2503                   | 11527185              |
>     +-------+------------------------+-----------------------+
>     | 8     | 5069                   | 24881677              |
>     +-------+------------------------+-----------------------+
>     | 16    | 10340                  | 50347956              |
>     +-------+------------------------+-----------------------+
>     | 32    | 20351                  | 100605720             |
>     +-------+------------------------+-----------------------+
>     | 64    | 40994                  | 201442478             |
>     +-------+------------------------+-----------------------+
> 
>   * From the test results above, the "dirty memory time" and the number of
>     MMU lock contention scale with the number of vCPUs. That means all the
>     dirty memory operations from all vCPU threads have been serialized by
>     the MMU lock. Further analysis also shows that the permission relaxation
>     during dirty logging is where vCPU threads get serialized.
> 
> Solution:
>   * On ARM64, there is no mechanism as PML (Page Modification Logging) and
>     the dirty-bit solution for dirty logging is much complicated compared to
>     the write-protection solution. The straight way to reduce the guest
>     performance degradation is to enhance the concurrency for the permission
>     fault path during dirty logging.
>   * In this patch, we only put leaf PTE permission relaxation for dirty
>     logging under read lock, all others would go under write lock.
>     Below are the results based on the solution:
>     +-------+------------------------+
>     | #vCPU | dirty memory time (ms) |
>     +-------+------------------------+
>     | 1     | 803                    |
>     +-------+------------------------+
>     | 2     | 843                    |
>     +-------+------------------------+
>     | 4     | 942                    |
>     +-------+------------------------+
>     | 8     | 1458                   |
>     +-------+------------------------+
>     | 16    | 2853                   |
>     +-------+------------------------+
>     | 32    | 5886                   |
>     +-------+------------------------+
>     | 64    | 12190                  |
>     +-------+------------------------+
>     All "dirty memory time" have been reduced by more than 60% when the
>     number of vCPU grows.

How does that translate to the original problem statement with your
live migration test?

Thanks,

	M.

Jing Zhang Jan. 11, 2022, 10:12 p.m. UTC | #2

On Tue, Jan 11, 2022 at 3:55 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Mon, 10 Jan 2022 21:04:38 +0000,
> Jing Zhang <jingzhangos@google.com> wrote:
> >
> > This patch is to reduce the performance degradation of guest workload during
> > dirty logging on ARM64. A fast path is added to handle permission relaxation
> > during dirty logging. The MMU lock is replaced with rwlock, by which all
> > permision relaxations on leaf pte can be performed under the read lock. This
> > greatly reduces the MMU lock contention during dirty logging. With this
> > solution, the source guest workload performance degradation can be improved
> > by more than 60%.
> >
> > Problem:
> >   * A Google internal live migration test shows that the source guest workload
> >   performance has >99% degradation for about 105 seconds, >50% degradation
> >   for about 112 seconds, >10% degradation for about 112 seconds on ARM64.
> >   This shows that most of the time, the guest workload degradtion is above
> >   99%, which obviously needs some improvement compared to the test result
> >   on x86 (>99% for 6s, >50% for 9s, >10% for 27s).
> >   * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB
> >   * VM spec: #vCPU: 48, #Mem/vCPU: 4GB
>
> What are the host and guest page sizes?
Both are 4K and guest mem is 2M hugepage backed. Will add the info for
future posts.
>
> >
> > Analysis:
> >   * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get
> >     the number of contentions of MMU lock and the "dirty memory time" on
> >     various VM spec.
> >     By using test command
> >     ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU]
>
> How is this test representative of the internal live migration test
> you mention above? '-m 2' indicates a mode that varies depending on
> the HW and revision of the test (I just added a bunch of supported
> modes). Which one is it?
The "dirty memory time" is the time vCPU threads spent in KVM after
fault. Higher "dirty memory time" means higher degradation to guest
workload.
'-m 2' indicates mode "PA-bits:48,  VA-bits:48,  4K pages". Will add
this for future posts.
>
> >     Below are the results:
> >     +-------+------------------------+-----------------------+
> >     | #vCPU | dirty memory time (ms) | number of contentions |
> >     +-------+------------------------+-----------------------+
> >     | 1     | 926                    | 0                     |
> >     +-------+------------------------+-----------------------+
> >     | 2     | 1189                   | 4732558               |
> >     +-------+------------------------+-----------------------+
> >     | 4     | 2503                   | 11527185              |
> >     +-------+------------------------+-----------------------+
> >     | 8     | 5069                   | 24881677              |
> >     +-------+------------------------+-----------------------+
> >     | 16    | 10340                  | 50347956              |
> >     +-------+------------------------+-----------------------+
> >     | 32    | 20351                  | 100605720             |
> >     +-------+------------------------+-----------------------+
> >     | 64    | 40994                  | 201442478             |
> >     +-------+------------------------+-----------------------+
> >
> >   * From the test results above, the "dirty memory time" and the number of
> >     MMU lock contention scale with the number of vCPUs. That means all the
> >     dirty memory operations from all vCPU threads have been serialized by
> >     the MMU lock. Further analysis also shows that the permission relaxation
> >     during dirty logging is where vCPU threads get serialized.
> >
> > Solution:
> >   * On ARM64, there is no mechanism as PML (Page Modification Logging) and
> >     the dirty-bit solution for dirty logging is much complicated compared to
> >     the write-protection solution. The straight way to reduce the guest
> >     performance degradation is to enhance the concurrency for the permission
> >     fault path during dirty logging.
> >   * In this patch, we only put leaf PTE permission relaxation for dirty
> >     logging under read lock, all others would go under write lock.
> >     Below are the results based on the solution:
> >     +-------+------------------------+
> >     | #vCPU | dirty memory time (ms) |
> >     +-------+------------------------+
> >     | 1     | 803                    |
> >     +-------+------------------------+
> >     | 2     | 843                    |
> >     +-------+------------------------+
> >     | 4     | 942                    |
> >     +-------+------------------------+
> >     | 8     | 1458                   |
> >     +-------+------------------------+
> >     | 16    | 2853                   |
> >     +-------+------------------------+
> >     | 32    | 5886                   |
> >     +-------+------------------------+
> >     | 64    | 12190                  |
> >     +-------+------------------------+
> >     All "dirty memory time" have been reduced by more than 60% when the
> >     number of vCPU grows.
>
> How does that translate to the original problem statement with your
> live migration test?
Based on the solution, the test results from the Google internal live
migration test also shows more than 60% improvement with >99% for 30s,
>50% for 58s and >10% for 76s.
Will add this info in to future posts.
>
> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.

Thanks,
Jing

Ricardo Koller Jan. 13, 2022, 2:49 a.m. UTC | #3

Hi Jing,

On Mon, Jan 10, 2022 at 09:04:38PM +0000, Jing Zhang wrote:
> This patch is to reduce the performance degradation of guest workload during
> dirty logging on ARM64. A fast path is added to handle permission relaxation
> during dirty logging. The MMU lock is replaced with rwlock, by which all
> permision relaxations on leaf pte can be performed under the read lock. This
> greatly reduces the MMU lock contention during dirty logging. With this
> solution, the source guest workload performance degradation can be improved
> by more than 60%.
> 
> Problem:
>   * A Google internal live migration test shows that the source guest workload
>   performance has >99% degradation for about 105 seconds, >50% degradation
>   for about 112 seconds, >10% degradation for about 112 seconds on ARM64.
>   This shows that most of the time, the guest workload degradtion is above
>   99%, which obviously needs some improvement compared to the test result
>   on x86 (>99% for 6s, >50% for 9s, >10% for 27s).
>   * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB
>   * VM spec: #vCPU: 48, #Mem/vCPU: 4GB
> 
> Analysis:
>   * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get
>     the number of contentions of MMU lock and the "dirty memory time" on
>     various VM spec.
>     By using test command
>     ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU]
>     Below are the results:
>     +-------+------------------------+-----------------------+
>     | #vCPU | dirty memory time (ms) | number of contentions |
>     +-------+------------------------+-----------------------+
>     | 1     | 926                    | 0                     |
>     +-------+------------------------+-----------------------+
>     | 2     | 1189                   | 4732558               |
>     +-------+------------------------+-----------------------+
>     | 4     | 2503                   | 11527185              |
>     +-------+------------------------+-----------------------+
>     | 8     | 5069                   | 24881677              |
>     +-------+------------------------+-----------------------+
>     | 16    | 10340                  | 50347956              |
>     +-------+------------------------+-----------------------+
>     | 32    | 20351                  | 100605720             |
>     +-------+------------------------+-----------------------+
>     | 64    | 40994                  | 201442478             |
>     +-------+------------------------+-----------------------+
> 
>   * From the test results above, the "dirty memory time" and the number of
>     MMU lock contention scale with the number of vCPUs. That means all the
>     dirty memory operations from all vCPU threads have been serialized by
>     the MMU lock. Further analysis also shows that the permission relaxation
>     during dirty logging is where vCPU threads get serialized.
> 
> Solution:
>   * On ARM64, there is no mechanism as PML (Page Modification Logging) and
>     the dirty-bit solution for dirty logging is much complicated compared to
>     the write-protection solution. The straight way to reduce the guest
>     performance degradation is to enhance the concurrency for the permission
>     fault path during dirty logging.
>   * In this patch, we only put leaf PTE permission relaxation for dirty
>     logging under read lock, all others would go under write lock.
>     Below are the results based on the solution:
>     +-------+------------------------+
>     | #vCPU | dirty memory time (ms) |
>     +-------+------------------------+
>     | 1     | 803                    |
>     +-------+------------------------+
>     | 2     | 843                    |
>     +-------+------------------------+
>     | 4     | 942                    |
>     +-------+------------------------+
>     | 8     | 1458                   |
>     +-------+------------------------+
>     | 16    | 2853                   |
>     +-------+------------------------+
>     | 32    | 5886                   |
>     +-------+------------------------+
>     | 64    | 12190                  |
>     +-------+------------------------+

Just curious, do yo know why is time still doubling (roughly) with the
number of cpus? maybe you performed another experiment or have some
guess(es).

Thanks,
Ricardo

>     All "dirty memory time" have been reduced by more than 60% when the
>     number of vCPU grows.
>     
> ---
> 
> Jing Zhang (3):
>   KVM: arm64: Use read/write spin lock for MMU protection
>   KVM: arm64: Add fast path to handle permission relaxation during dirty
>     logging
>   KVM: selftests: Add vgic initialization for dirty log perf test for
>     ARM
> 
>  arch/arm64/include/asm/kvm_host.h             |  2 +
>  arch/arm64/kvm/mmu.c                          | 86 +++++++++++++++----
>  .../selftests/kvm/dirty_log_perf_test.c       | 10 +++
>  3 files changed, 80 insertions(+), 18 deletions(-)
> 
> 
> base-commit: fea31d1690945e6dd6c3e89ec5591490857bc3d4
> -- 
> 2.34.1.575.g55b058a8bb-goog
> 
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Jing Zhang Jan. 13, 2022, 3:50 a.m. UTC | #4

On Wed, Jan 12, 2022 at 6:50 PM Ricardo Koller <ricarkol@google.com> wrote:
>
> Hi Jing,
>
> On Mon, Jan 10, 2022 at 09:04:38PM +0000, Jing Zhang wrote:
> > This patch is to reduce the performance degradation of guest workload during
> > dirty logging on ARM64. A fast path is added to handle permission relaxation
> > during dirty logging. The MMU lock is replaced with rwlock, by which all
> > permision relaxations on leaf pte can be performed under the read lock. This
> > greatly reduces the MMU lock contention during dirty logging. With this
> > solution, the source guest workload performance degradation can be improved
> > by more than 60%.
> >
> > Problem:
> >   * A Google internal live migration test shows that the source guest workload
> >   performance has >99% degradation for about 105 seconds, >50% degradation
> >   for about 112 seconds, >10% degradation for about 112 seconds on ARM64.
> >   This shows that most of the time, the guest workload degradtion is above
> >   99%, which obviously needs some improvement compared to the test result
> >   on x86 (>99% for 6s, >50% for 9s, >10% for 27s).
> >   * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB
> >   * VM spec: #vCPU: 48, #Mem/vCPU: 4GB
> >
> > Analysis:
> >   * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get
> >     the number of contentions of MMU lock and the "dirty memory time" on
> >     various VM spec.
> >     By using test command
> >     ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU]
> >     Below are the results:
> >     +-------+------------------------+-----------------------+
> >     | #vCPU | dirty memory time (ms) | number of contentions |
> >     +-------+------------------------+-----------------------+
> >     | 1     | 926                    | 0                     |
> >     +-------+------------------------+-----------------------+
> >     | 2     | 1189                   | 4732558               |
> >     +-------+------------------------+-----------------------+
> >     | 4     | 2503                   | 11527185              |
> >     +-------+------------------------+-----------------------+
> >     | 8     | 5069                   | 24881677              |
> >     +-------+------------------------+-----------------------+
> >     | 16    | 10340                  | 50347956              |
> >     +-------+------------------------+-----------------------+
> >     | 32    | 20351                  | 100605720             |
> >     +-------+------------------------+-----------------------+
> >     | 64    | 40994                  | 201442478             |
> >     +-------+------------------------+-----------------------+
> >
> >   * From the test results above, the "dirty memory time" and the number of
> >     MMU lock contention scale with the number of vCPUs. That means all the
> >     dirty memory operations from all vCPU threads have been serialized by
> >     the MMU lock. Further analysis also shows that the permission relaxation
> >     during dirty logging is where vCPU threads get serialized.
> >
> > Solution:
> >   * On ARM64, there is no mechanism as PML (Page Modification Logging) and
> >     the dirty-bit solution for dirty logging is much complicated compared to
> >     the write-protection solution. The straight way to reduce the guest
> >     performance degradation is to enhance the concurrency for the permission
> >     fault path during dirty logging.
> >   * In this patch, we only put leaf PTE permission relaxation for dirty
> >     logging under read lock, all others would go under write lock.
> >     Below are the results based on the solution:
> >     +-------+------------------------+
> >     | #vCPU | dirty memory time (ms) |
> >     +-------+------------------------+
> >     | 1     | 803                    |
> >     +-------+------------------------+
> >     | 2     | 843                    |
> >     +-------+------------------------+
> >     | 4     | 942                    |
> >     +-------+------------------------+
> >     | 8     | 1458                   |
> >     +-------+------------------------+
> >     | 16    | 2853                   |
> >     +-------+------------------------+
> >     | 32    | 5886                   |
> >     +-------+------------------------+
> >     | 64    | 12190                  |
> >     +-------+------------------------+
>
> Just curious, do yo know why is time still doubling (roughly) with the
> number of cpus? maybe you performed another experiment or have some
> guess(es).
Yes. it is from the serialization caused by TLB flush whenever the
permission is relaxed. I tried test by removing the TLB flushes (of
course it shouldn't be removed), the time would be close to a constant
no matter the number of vCPUs.
>
> Thanks,
> Ricardo
>
> >     All "dirty memory time" have been reduced by more than 60% when the
> >     number of vCPU grows.
> >
> > ---
> >
> > Jing Zhang (3):
> >   KVM: arm64: Use read/write spin lock for MMU protection
> >   KVM: arm64: Add fast path to handle permission relaxation during dirty
> >     logging
> >   KVM: selftests: Add vgic initialization for dirty log perf test for
> >     ARM
> >
> >  arch/arm64/include/asm/kvm_host.h             |  2 +
> >  arch/arm64/kvm/mmu.c                          | 86 +++++++++++++++----
> >  .../selftests/kvm/dirty_log_perf_test.c       | 10 +++
> >  3 files changed, 80 insertions(+), 18 deletions(-)
> >
> >
> > base-commit: fea31d1690945e6dd6c3e89ec5591490857bc3d4
> > --
> > 2.34.1.575.g55b058a8bb-goog
> >
> > _______________________________________________
> > kvmarm mailing list
> > kvmarm@lists.cs.columbia.edu
> > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Thanks,
Jing

Ricardo Koller Jan. 13, 2022, 6:12 a.m. UTC | #5

On Wed, Jan 12, 2022 at 07:50:48PM -0800, Jing Zhang wrote:
> On Wed, Jan 12, 2022 at 6:50 PM Ricardo Koller <ricarkol@google.com> wrote:
> >
> > Hi Jing,
> >
> > On Mon, Jan 10, 2022 at 09:04:38PM +0000, Jing Zhang wrote:
> > > This patch is to reduce the performance degradation of guest workload during
> > > dirty logging on ARM64. A fast path is added to handle permission relaxation
> > > during dirty logging. The MMU lock is replaced with rwlock, by which all
> > > permision relaxations on leaf pte can be performed under the read lock. This
> > > greatly reduces the MMU lock contention during dirty logging. With this
> > > solution, the source guest workload performance degradation can be improved
> > > by more than 60%.
> > >
> > > Problem:
> > >   * A Google internal live migration test shows that the source guest workload
> > >   performance has >99% degradation for about 105 seconds, >50% degradation
> > >   for about 112 seconds, >10% degradation for about 112 seconds on ARM64.
> > >   This shows that most of the time, the guest workload degradtion is above
> > >   99%, which obviously needs some improvement compared to the test result
> > >   on x86 (>99% for 6s, >50% for 9s, >10% for 27s).
> > >   * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB
> > >   * VM spec: #vCPU: 48, #Mem/vCPU: 4GB
> > >
> > > Analysis:
> > >   * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get
> > >     the number of contentions of MMU lock and the "dirty memory time" on
> > >     various VM spec.
> > >     By using test command
> > >     ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU]
> > >     Below are the results:
> > >     +-------+------------------------+-----------------------+
> > >     | #vCPU | dirty memory time (ms) | number of contentions |
> > >     +-------+------------------------+-----------------------+
> > >     | 1     | 926                    | 0                     |
> > >     +-------+------------------------+-----------------------+
> > >     | 2     | 1189                   | 4732558               |
> > >     +-------+------------------------+-----------------------+
> > >     | 4     | 2503                   | 11527185              |
> > >     +-------+------------------------+-----------------------+
> > >     | 8     | 5069                   | 24881677              |
> > >     +-------+------------------------+-----------------------+
> > >     | 16    | 10340                  | 50347956              |
> > >     +-------+------------------------+-----------------------+
> > >     | 32    | 20351                  | 100605720             |
> > >     +-------+------------------------+-----------------------+
> > >     | 64    | 40994                  | 201442478             |
> > >     +-------+------------------------+-----------------------+
> > >
> > >   * From the test results above, the "dirty memory time" and the number of
> > >     MMU lock contention scale with the number of vCPUs. That means all the
> > >     dirty memory operations from all vCPU threads have been serialized by
> > >     the MMU lock. Further analysis also shows that the permission relaxation
> > >     during dirty logging is where vCPU threads get serialized.
> > >
> > > Solution:
> > >   * On ARM64, there is no mechanism as PML (Page Modification Logging) and
> > >     the dirty-bit solution for dirty logging is much complicated compared to
> > >     the write-protection solution. The straight way to reduce the guest
> > >     performance degradation is to enhance the concurrency for the permission
> > >     fault path during dirty logging.
> > >   * In this patch, we only put leaf PTE permission relaxation for dirty
> > >     logging under read lock, all others would go under write lock.
> > >     Below are the results based on the solution:
> > >     +-------+------------------------+
> > >     | #vCPU | dirty memory time (ms) |
> > >     +-------+------------------------+
> > >     | 1     | 803                    |
> > >     +-------+------------------------+
> > >     | 2     | 843                    |
> > >     +-------+------------------------+
> > >     | 4     | 942                    |
> > >     +-------+------------------------+
> > >     | 8     | 1458                   |
> > >     +-------+------------------------+
> > >     | 16    | 2853                   |
> > >     +-------+------------------------+
> > >     | 32    | 5886                   |
> > >     +-------+------------------------+
> > >     | 64    | 12190                  |
> > >     +-------+------------------------+
> >
> > Just curious, do yo know why is time still doubling (roughly) with the
> > number of cpus? maybe you performed another experiment or have some
> > guess(es).
> Yes. it is from the serialization caused by TLB flush whenever the
> permission is relaxed. I tried test by removing the TLB flushes (of
> course it shouldn't be removed), the time would be close to a constant
> no matter the number of vCPUs.

Got it, thanks for the info.

Ricardo

> >
> > Thanks,
> > Ricardo
> >
> > >     All "dirty memory time" have been reduced by more than 60% when the
> > >     number of vCPU grows.
> > >
> > > ---
> > >
> > > Jing Zhang (3):
> > >   KVM: arm64: Use read/write spin lock for MMU protection
> > >   KVM: arm64: Add fast path to handle permission relaxation during dirty
> > >     logging
> > >   KVM: selftests: Add vgic initialization for dirty log perf test for
> > >     ARM
> > >
> > >  arch/arm64/include/asm/kvm_host.h             |  2 +
> > >  arch/arm64/kvm/mmu.c                          | 86 +++++++++++++++----
> > >  .../selftests/kvm/dirty_log_perf_test.c       | 10 +++
> > >  3 files changed, 80 insertions(+), 18 deletions(-)
> > >
> > >
> > > base-commit: fea31d1690945e6dd6c3e89ec5591490857bc3d4
> > > --
> > > 2.34.1.575.g55b058a8bb-goog
> > >
> > > _______________________________________________
> > > kvmarm mailing list
> > > kvmarm@lists.cs.columbia.edu
> > > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> Thanks,
> Jing

[RFC,0/3] ARM64: Guest performance improvement during dirty

Message

Comments