Message ID | 20220110210441.2074798-1-jingzhangos@google.com (mailing list archive) |
---|---|
Headers | show |
Series | ARM64: Guest performance improvement during dirty | expand |
On Mon, 10 Jan 2022 21:04:38 +0000, Jing Zhang <jingzhangos@google.com> wrote: > > This patch is to reduce the performance degradation of guest workload during > dirty logging on ARM64. A fast path is added to handle permission relaxation > during dirty logging. The MMU lock is replaced with rwlock, by which all > permision relaxations on leaf pte can be performed under the read lock. This > greatly reduces the MMU lock contention during dirty logging. With this > solution, the source guest workload performance degradation can be improved > by more than 60%. > > Problem: > * A Google internal live migration test shows that the source guest workload > performance has >99% degradation for about 105 seconds, >50% degradation > for about 112 seconds, >10% degradation for about 112 seconds on ARM64. > This shows that most of the time, the guest workload degradtion is above > 99%, which obviously needs some improvement compared to the test result > on x86 (>99% for 6s, >50% for 9s, >10% for 27s). > * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB > * VM spec: #vCPU: 48, #Mem/vCPU: 4GB What are the host and guest page sizes? > > Analysis: > * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get > the number of contentions of MMU lock and the "dirty memory time" on > various VM spec. > By using test command > ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU] How is this test representative of the internal live migration test you mention above? '-m 2' indicates a mode that varies depending on the HW and revision of the test (I just added a bunch of supported modes). Which one is it? > Below are the results: > +-------+------------------------+-----------------------+ > | #vCPU | dirty memory time (ms) | number of contentions | > +-------+------------------------+-----------------------+ > | 1 | 926 | 0 | > +-------+------------------------+-----------------------+ > | 2 | 1189 | 4732558 | > +-------+------------------------+-----------------------+ > | 4 | 2503 | 11527185 | > +-------+------------------------+-----------------------+ > | 8 | 5069 | 24881677 | > +-------+------------------------+-----------------------+ > | 16 | 10340 | 50347956 | > +-------+------------------------+-----------------------+ > | 32 | 20351 | 100605720 | > +-------+------------------------+-----------------------+ > | 64 | 40994 | 201442478 | > +-------+------------------------+-----------------------+ > > * From the test results above, the "dirty memory time" and the number of > MMU lock contention scale with the number of vCPUs. That means all the > dirty memory operations from all vCPU threads have been serialized by > the MMU lock. Further analysis also shows that the permission relaxation > during dirty logging is where vCPU threads get serialized. > > Solution: > * On ARM64, there is no mechanism as PML (Page Modification Logging) and > the dirty-bit solution for dirty logging is much complicated compared to > the write-protection solution. The straight way to reduce the guest > performance degradation is to enhance the concurrency for the permission > fault path during dirty logging. > * In this patch, we only put leaf PTE permission relaxation for dirty > logging under read lock, all others would go under write lock. > Below are the results based on the solution: > +-------+------------------------+ > | #vCPU | dirty memory time (ms) | > +-------+------------------------+ > | 1 | 803 | > +-------+------------------------+ > | 2 | 843 | > +-------+------------------------+ > | 4 | 942 | > +-------+------------------------+ > | 8 | 1458 | > +-------+------------------------+ > | 16 | 2853 | > +-------+------------------------+ > | 32 | 5886 | > +-------+------------------------+ > | 64 | 12190 | > +-------+------------------------+ > All "dirty memory time" have been reduced by more than 60% when the > number of vCPU grows. How does that translate to the original problem statement with your live migration test? Thanks, M.
On Tue, Jan 11, 2022 at 3:55 AM Marc Zyngier <maz@kernel.org> wrote: > > On Mon, 10 Jan 2022 21:04:38 +0000, > Jing Zhang <jingzhangos@google.com> wrote: > > > > This patch is to reduce the performance degradation of guest workload during > > dirty logging on ARM64. A fast path is added to handle permission relaxation > > during dirty logging. The MMU lock is replaced with rwlock, by which all > > permision relaxations on leaf pte can be performed under the read lock. This > > greatly reduces the MMU lock contention during dirty logging. With this > > solution, the source guest workload performance degradation can be improved > > by more than 60%. > > > > Problem: > > * A Google internal live migration test shows that the source guest workload > > performance has >99% degradation for about 105 seconds, >50% degradation > > for about 112 seconds, >10% degradation for about 112 seconds on ARM64. > > This shows that most of the time, the guest workload degradtion is above > > 99%, which obviously needs some improvement compared to the test result > > on x86 (>99% for 6s, >50% for 9s, >10% for 27s). > > * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB > > * VM spec: #vCPU: 48, #Mem/vCPU: 4GB > > What are the host and guest page sizes? Both are 4K and guest mem is 2M hugepage backed. Will add the info for future posts. > > > > > Analysis: > > * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get > > the number of contentions of MMU lock and the "dirty memory time" on > > various VM spec. > > By using test command > > ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU] > > How is this test representative of the internal live migration test > you mention above? '-m 2' indicates a mode that varies depending on > the HW and revision of the test (I just added a bunch of supported > modes). Which one is it? The "dirty memory time" is the time vCPU threads spent in KVM after fault. Higher "dirty memory time" means higher degradation to guest workload. '-m 2' indicates mode "PA-bits:48, VA-bits:48, 4K pages". Will add this for future posts. > > > Below are the results: > > +-------+------------------------+-----------------------+ > > | #vCPU | dirty memory time (ms) | number of contentions | > > +-------+------------------------+-----------------------+ > > | 1 | 926 | 0 | > > +-------+------------------------+-----------------------+ > > | 2 | 1189 | 4732558 | > > +-------+------------------------+-----------------------+ > > | 4 | 2503 | 11527185 | > > +-------+------------------------+-----------------------+ > > | 8 | 5069 | 24881677 | > > +-------+------------------------+-----------------------+ > > | 16 | 10340 | 50347956 | > > +-------+------------------------+-----------------------+ > > | 32 | 20351 | 100605720 | > > +-------+------------------------+-----------------------+ > > | 64 | 40994 | 201442478 | > > +-------+------------------------+-----------------------+ > > > > * From the test results above, the "dirty memory time" and the number of > > MMU lock contention scale with the number of vCPUs. That means all the > > dirty memory operations from all vCPU threads have been serialized by > > the MMU lock. Further analysis also shows that the permission relaxation > > during dirty logging is where vCPU threads get serialized. > > > > Solution: > > * On ARM64, there is no mechanism as PML (Page Modification Logging) and > > the dirty-bit solution for dirty logging is much complicated compared to > > the write-protection solution. The straight way to reduce the guest > > performance degradation is to enhance the concurrency for the permission > > fault path during dirty logging. > > * In this patch, we only put leaf PTE permission relaxation for dirty > > logging under read lock, all others would go under write lock. > > Below are the results based on the solution: > > +-------+------------------------+ > > | #vCPU | dirty memory time (ms) | > > +-------+------------------------+ > > | 1 | 803 | > > +-------+------------------------+ > > | 2 | 843 | > > +-------+------------------------+ > > | 4 | 942 | > > +-------+------------------------+ > > | 8 | 1458 | > > +-------+------------------------+ > > | 16 | 2853 | > > +-------+------------------------+ > > | 32 | 5886 | > > +-------+------------------------+ > > | 64 | 12190 | > > +-------+------------------------+ > > All "dirty memory time" have been reduced by more than 60% when the > > number of vCPU grows. > > How does that translate to the original problem statement with your > live migration test? Based on the solution, the test results from the Google internal live migration test also shows more than 60% improvement with >99% for 30s, >50% for 58s and >10% for 76s. Will add this info in to future posts. > > Thanks, > > M. > > -- > Without deviation from the norm, progress is not possible. Thanks, Jing
Hi Jing, On Mon, Jan 10, 2022 at 09:04:38PM +0000, Jing Zhang wrote: > This patch is to reduce the performance degradation of guest workload during > dirty logging on ARM64. A fast path is added to handle permission relaxation > during dirty logging. The MMU lock is replaced with rwlock, by which all > permision relaxations on leaf pte can be performed under the read lock. This > greatly reduces the MMU lock contention during dirty logging. With this > solution, the source guest workload performance degradation can be improved > by more than 60%. > > Problem: > * A Google internal live migration test shows that the source guest workload > performance has >99% degradation for about 105 seconds, >50% degradation > for about 112 seconds, >10% degradation for about 112 seconds on ARM64. > This shows that most of the time, the guest workload degradtion is above > 99%, which obviously needs some improvement compared to the test result > on x86 (>99% for 6s, >50% for 9s, >10% for 27s). > * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB > * VM spec: #vCPU: 48, #Mem/vCPU: 4GB > > Analysis: > * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get > the number of contentions of MMU lock and the "dirty memory time" on > various VM spec. > By using test command > ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU] > Below are the results: > +-------+------------------------+-----------------------+ > | #vCPU | dirty memory time (ms) | number of contentions | > +-------+------------------------+-----------------------+ > | 1 | 926 | 0 | > +-------+------------------------+-----------------------+ > | 2 | 1189 | 4732558 | > +-------+------------------------+-----------------------+ > | 4 | 2503 | 11527185 | > +-------+------------------------+-----------------------+ > | 8 | 5069 | 24881677 | > +-------+------------------------+-----------------------+ > | 16 | 10340 | 50347956 | > +-------+------------------------+-----------------------+ > | 32 | 20351 | 100605720 | > +-------+------------------------+-----------------------+ > | 64 | 40994 | 201442478 | > +-------+------------------------+-----------------------+ > > * From the test results above, the "dirty memory time" and the number of > MMU lock contention scale with the number of vCPUs. That means all the > dirty memory operations from all vCPU threads have been serialized by > the MMU lock. Further analysis also shows that the permission relaxation > during dirty logging is where vCPU threads get serialized. > > Solution: > * On ARM64, there is no mechanism as PML (Page Modification Logging) and > the dirty-bit solution for dirty logging is much complicated compared to > the write-protection solution. The straight way to reduce the guest > performance degradation is to enhance the concurrency for the permission > fault path during dirty logging. > * In this patch, we only put leaf PTE permission relaxation for dirty > logging under read lock, all others would go under write lock. > Below are the results based on the solution: > +-------+------------------------+ > | #vCPU | dirty memory time (ms) | > +-------+------------------------+ > | 1 | 803 | > +-------+------------------------+ > | 2 | 843 | > +-------+------------------------+ > | 4 | 942 | > +-------+------------------------+ > | 8 | 1458 | > +-------+------------------------+ > | 16 | 2853 | > +-------+------------------------+ > | 32 | 5886 | > +-------+------------------------+ > | 64 | 12190 | > +-------+------------------------+ Just curious, do yo know why is time still doubling (roughly) with the number of cpus? maybe you performed another experiment or have some guess(es). Thanks, Ricardo > All "dirty memory time" have been reduced by more than 60% when the > number of vCPU grows. > > --- > > Jing Zhang (3): > KVM: arm64: Use read/write spin lock for MMU protection > KVM: arm64: Add fast path to handle permission relaxation during dirty > logging > KVM: selftests: Add vgic initialization for dirty log perf test for > ARM > > arch/arm64/include/asm/kvm_host.h | 2 + > arch/arm64/kvm/mmu.c | 86 +++++++++++++++---- > .../selftests/kvm/dirty_log_perf_test.c | 10 +++ > 3 files changed, 80 insertions(+), 18 deletions(-) > > > base-commit: fea31d1690945e6dd6c3e89ec5591490857bc3d4 > -- > 2.34.1.575.g55b058a8bb-goog > > _______________________________________________ > kvmarm mailing list > kvmarm@lists.cs.columbia.edu > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
On Wed, Jan 12, 2022 at 6:50 PM Ricardo Koller <ricarkol@google.com> wrote: > > Hi Jing, > > On Mon, Jan 10, 2022 at 09:04:38PM +0000, Jing Zhang wrote: > > This patch is to reduce the performance degradation of guest workload during > > dirty logging on ARM64. A fast path is added to handle permission relaxation > > during dirty logging. The MMU lock is replaced with rwlock, by which all > > permision relaxations on leaf pte can be performed under the read lock. This > > greatly reduces the MMU lock contention during dirty logging. With this > > solution, the source guest workload performance degradation can be improved > > by more than 60%. > > > > Problem: > > * A Google internal live migration test shows that the source guest workload > > performance has >99% degradation for about 105 seconds, >50% degradation > > for about 112 seconds, >10% degradation for about 112 seconds on ARM64. > > This shows that most of the time, the guest workload degradtion is above > > 99%, which obviously needs some improvement compared to the test result > > on x86 (>99% for 6s, >50% for 9s, >10% for 27s). > > * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB > > * VM spec: #vCPU: 48, #Mem/vCPU: 4GB > > > > Analysis: > > * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get > > the number of contentions of MMU lock and the "dirty memory time" on > > various VM spec. > > By using test command > > ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU] > > Below are the results: > > +-------+------------------------+-----------------------+ > > | #vCPU | dirty memory time (ms) | number of contentions | > > +-------+------------------------+-----------------------+ > > | 1 | 926 | 0 | > > +-------+------------------------+-----------------------+ > > | 2 | 1189 | 4732558 | > > +-------+------------------------+-----------------------+ > > | 4 | 2503 | 11527185 | > > +-------+------------------------+-----------------------+ > > | 8 | 5069 | 24881677 | > > +-------+------------------------+-----------------------+ > > | 16 | 10340 | 50347956 | > > +-------+------------------------+-----------------------+ > > | 32 | 20351 | 100605720 | > > +-------+------------------------+-----------------------+ > > | 64 | 40994 | 201442478 | > > +-------+------------------------+-----------------------+ > > > > * From the test results above, the "dirty memory time" and the number of > > MMU lock contention scale with the number of vCPUs. That means all the > > dirty memory operations from all vCPU threads have been serialized by > > the MMU lock. Further analysis also shows that the permission relaxation > > during dirty logging is where vCPU threads get serialized. > > > > Solution: > > * On ARM64, there is no mechanism as PML (Page Modification Logging) and > > the dirty-bit solution for dirty logging is much complicated compared to > > the write-protection solution. The straight way to reduce the guest > > performance degradation is to enhance the concurrency for the permission > > fault path during dirty logging. > > * In this patch, we only put leaf PTE permission relaxation for dirty > > logging under read lock, all others would go under write lock. > > Below are the results based on the solution: > > +-------+------------------------+ > > | #vCPU | dirty memory time (ms) | > > +-------+------------------------+ > > | 1 | 803 | > > +-------+------------------------+ > > | 2 | 843 | > > +-------+------------------------+ > > | 4 | 942 | > > +-------+------------------------+ > > | 8 | 1458 | > > +-------+------------------------+ > > | 16 | 2853 | > > +-------+------------------------+ > > | 32 | 5886 | > > +-------+------------------------+ > > | 64 | 12190 | > > +-------+------------------------+ > > Just curious, do yo know why is time still doubling (roughly) with the > number of cpus? maybe you performed another experiment or have some > guess(es). Yes. it is from the serialization caused by TLB flush whenever the permission is relaxed. I tried test by removing the TLB flushes (of course it shouldn't be removed), the time would be close to a constant no matter the number of vCPUs. > > Thanks, > Ricardo > > > All "dirty memory time" have been reduced by more than 60% when the > > number of vCPU grows. > > > > --- > > > > Jing Zhang (3): > > KVM: arm64: Use read/write spin lock for MMU protection > > KVM: arm64: Add fast path to handle permission relaxation during dirty > > logging > > KVM: selftests: Add vgic initialization for dirty log perf test for > > ARM > > > > arch/arm64/include/asm/kvm_host.h | 2 + > > arch/arm64/kvm/mmu.c | 86 +++++++++++++++---- > > .../selftests/kvm/dirty_log_perf_test.c | 10 +++ > > 3 files changed, 80 insertions(+), 18 deletions(-) > > > > > > base-commit: fea31d1690945e6dd6c3e89ec5591490857bc3d4 > > -- > > 2.34.1.575.g55b058a8bb-goog > > > > _______________________________________________ > > kvmarm mailing list > > kvmarm@lists.cs.columbia.edu > > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm Thanks, Jing
On Wed, Jan 12, 2022 at 07:50:48PM -0800, Jing Zhang wrote: > On Wed, Jan 12, 2022 at 6:50 PM Ricardo Koller <ricarkol@google.com> wrote: > > > > Hi Jing, > > > > On Mon, Jan 10, 2022 at 09:04:38PM +0000, Jing Zhang wrote: > > > This patch is to reduce the performance degradation of guest workload during > > > dirty logging on ARM64. A fast path is added to handle permission relaxation > > > during dirty logging. The MMU lock is replaced with rwlock, by which all > > > permision relaxations on leaf pte can be performed under the read lock. This > > > greatly reduces the MMU lock contention during dirty logging. With this > > > solution, the source guest workload performance degradation can be improved > > > by more than 60%. > > > > > > Problem: > > > * A Google internal live migration test shows that the source guest workload > > > performance has >99% degradation for about 105 seconds, >50% degradation > > > for about 112 seconds, >10% degradation for about 112 seconds on ARM64. > > > This shows that most of the time, the guest workload degradtion is above > > > 99%, which obviously needs some improvement compared to the test result > > > on x86 (>99% for 6s, >50% for 9s, >10% for 27s). > > > * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB > > > * VM spec: #vCPU: 48, #Mem/vCPU: 4GB > > > > > > Analysis: > > > * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get > > > the number of contentions of MMU lock and the "dirty memory time" on > > > various VM spec. > > > By using test command > > > ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU] > > > Below are the results: > > > +-------+------------------------+-----------------------+ > > > | #vCPU | dirty memory time (ms) | number of contentions | > > > +-------+------------------------+-----------------------+ > > > | 1 | 926 | 0 | > > > +-------+------------------------+-----------------------+ > > > | 2 | 1189 | 4732558 | > > > +-------+------------------------+-----------------------+ > > > | 4 | 2503 | 11527185 | > > > +-------+------------------------+-----------------------+ > > > | 8 | 5069 | 24881677 | > > > +-------+------------------------+-----------------------+ > > > | 16 | 10340 | 50347956 | > > > +-------+------------------------+-----------------------+ > > > | 32 | 20351 | 100605720 | > > > +-------+------------------------+-----------------------+ > > > | 64 | 40994 | 201442478 | > > > +-------+------------------------+-----------------------+ > > > > > > * From the test results above, the "dirty memory time" and the number of > > > MMU lock contention scale with the number of vCPUs. That means all the > > > dirty memory operations from all vCPU threads have been serialized by > > > the MMU lock. Further analysis also shows that the permission relaxation > > > during dirty logging is where vCPU threads get serialized. > > > > > > Solution: > > > * On ARM64, there is no mechanism as PML (Page Modification Logging) and > > > the dirty-bit solution for dirty logging is much complicated compared to > > > the write-protection solution. The straight way to reduce the guest > > > performance degradation is to enhance the concurrency for the permission > > > fault path during dirty logging. > > > * In this patch, we only put leaf PTE permission relaxation for dirty > > > logging under read lock, all others would go under write lock. > > > Below are the results based on the solution: > > > +-------+------------------------+ > > > | #vCPU | dirty memory time (ms) | > > > +-------+------------------------+ > > > | 1 | 803 | > > > +-------+------------------------+ > > > | 2 | 843 | > > > +-------+------------------------+ > > > | 4 | 942 | > > > +-------+------------------------+ > > > | 8 | 1458 | > > > +-------+------------------------+ > > > | 16 | 2853 | > > > +-------+------------------------+ > > > | 32 | 5886 | > > > +-------+------------------------+ > > > | 64 | 12190 | > > > +-------+------------------------+ > > > > Just curious, do yo know why is time still doubling (roughly) with the > > number of cpus? maybe you performed another experiment or have some > > guess(es). > Yes. it is from the serialization caused by TLB flush whenever the > permission is relaxed. I tried test by removing the TLB flushes (of > course it shouldn't be removed), the time would be close to a constant > no matter the number of vCPUs. Got it, thanks for the info. Ricardo > > > > Thanks, > > Ricardo > > > > > All "dirty memory time" have been reduced by more than 60% when the > > > number of vCPU grows. > > > > > > --- > > > > > > Jing Zhang (3): > > > KVM: arm64: Use read/write spin lock for MMU protection > > > KVM: arm64: Add fast path to handle permission relaxation during dirty > > > logging > > > KVM: selftests: Add vgic initialization for dirty log perf test for > > > ARM > > > > > > arch/arm64/include/asm/kvm_host.h | 2 + > > > arch/arm64/kvm/mmu.c | 86 +++++++++++++++---- > > > .../selftests/kvm/dirty_log_perf_test.c | 10 +++ > > > 3 files changed, 80 insertions(+), 18 deletions(-) > > > > > > > > > base-commit: fea31d1690945e6dd6c3e89ec5591490857bc3d4 > > > -- > > > 2.34.1.575.g55b058a8bb-goog > > > > > > _______________________________________________ > > > kvmarm mailing list > > > kvmarm@lists.cs.columbia.edu > > > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm > Thanks, > Jing