mbox series

[RFC,0/7] kvm: arm64: Implement SW/HW combined dirty log

Message ID 20210126124444.27136-1-zhukeqian1@huawei.com (mailing list archive)
Headers show
Series kvm: arm64: Implement SW/HW combined dirty log | expand

Message

zhukeqian Jan. 26, 2021, 12:44 p.m. UTC
The intention:

On arm64 platform, we tracking dirty log of vCPU through guest memory abort.
KVM occupys some vCPU time of guest to change stage2 mapping and mark dirty.
This leads to heavy side effect on VM, especially when multi vCPU race and
some of them block on kvm mmu_lock.

DBM is a HW auxiliary approach to log dirty. MMU chages PTE to be writable if
its DBM bit is set. Then KVM doesn't occupy vCPU time to log dirty.

About this patch series:

The biggest problem of apply DBM for stage2 is that software must scan PTs to
collect dirty state, which may cost much time and affect downtime of migration.

This series realize a SW/HW combined dirty log that can effectively solve this
problem (The smmu side can also use this approach to solve dma dirty log tracking).

The core idea is that we do not enable hardware dirty at start (do not add DBM bit).
When a arbitrary PT occurs fault, we execute soft tracking for this PT and enable
hardware tracking for its *nearby* PTs (e.g. Add DBM bit for nearby 16PTs). Then when
sync dirty log, we have known all PTs with hardware dirty enabled, so we do not need
to scan all PTs.

        mem abort point             mem abort point
              ↓                            ↓
---------------------------------------------------------------
        |********|        |        |********|        |        |
---------------------------------------------------------------
             ↑                            ↑
        set DBM bit of               set DBM bit of
     this PT section (64PTEs)      this PT section (64PTEs)

We may worry that when dirty rate is over-high we still need to scan too much PTs.
We mainly concern the VM stop time. With Qemu dirty rate throttling, the dirty memory
is closing to the VM stop threshold, so there is a little PTs to scan after VM stop.

It has the advantages of hardware tracking that minimizes side effect on vCPU,
and also has the advantages of software tracking that controls vCPU dirty rate.
Moreover, software tracking helps us to scan PTs at some fixed points, which
greatly reduces scanning time. And the biggest benefit is that we can apply this
solution for dma dirty tracking.

Test:

Host: Kunpeng 920 with 128 CPU 512G RAM. Disable Transparent Hugepage (Ensure test result
      is not effected by dissolve of block page table at the early stage of migration).
VM:   16 CPU 16GB RAM. Run 4 pair of (redis_benchmark+redis_server).

Each run 5 times for software dirty log and SW/HW conbined dirty log. 

Test result:

Gain 5%~7% improvement of redis QPS during VM migration.
VM downtime is not affected fundamentally.
About 56.7% of DBM is effectively used.

Keqian Zhu (7):
  arm64: cpufeature: Add API to report system support of HWDBM
  kvm: arm64: Use atomic operation when update PTE
  kvm: arm64: Add level_apply parameter for stage2_attr_walker
  kvm: arm64: Add some HW_DBM related pgtable interfaces
  kvm: arm64: Add some HW_DBM related mmu interfaces
  kvm: arm64: Only write protect selected PTE
  kvm: arm64: Start up SW/HW combined dirty log

 arch/arm64/include/asm/cpufeature.h  |  12 +++
 arch/arm64/include/asm/kvm_host.h    |   6 ++
 arch/arm64/include/asm/kvm_mmu.h     |   7 ++
 arch/arm64/include/asm/kvm_pgtable.h |  45 ++++++++++
 arch/arm64/kvm/arm.c                 | 125 ++++++++++++++++++++++++++
 arch/arm64/kvm/hyp/pgtable.c         | 130 ++++++++++++++++++++++-----
 arch/arm64/kvm/mmu.c                 |  47 +++++++++-
 arch/arm64/kvm/reset.c               |   8 +-
 8 files changed, 351 insertions(+), 29 deletions(-)

Comments

zhukeqian Feb. 1, 2021, 1:12 p.m. UTC | #1
Hi Marc,

Do you have time to have a look at this? Thanks ;-)

Keqian.

On 2021/1/26 20:44, Keqian Zhu wrote:
> The intention:
> 
> On arm64 platform, we tracking dirty log of vCPU through guest memory abort.
> KVM occupys some vCPU time of guest to change stage2 mapping and mark dirty.
> This leads to heavy side effect on VM, especially when multi vCPU race and
> some of them block on kvm mmu_lock.
> 
> DBM is a HW auxiliary approach to log dirty. MMU chages PTE to be writable if
> its DBM bit is set. Then KVM doesn't occupy vCPU time to log dirty.
> 
> About this patch series:
> 
> The biggest problem of apply DBM for stage2 is that software must scan PTs to
> collect dirty state, which may cost much time and affect downtime of migration.
> 
> This series realize a SW/HW combined dirty log that can effectively solve this
> problem (The smmu side can also use this approach to solve dma dirty log tracking).
> 
> The core idea is that we do not enable hardware dirty at start (do not add DBM bit).
> When a arbitrary PT occurs fault, we execute soft tracking for this PT and enable
> hardware tracking for its *nearby* PTs (e.g. Add DBM bit for nearby 16PTs). Then when
> sync dirty log, we have known all PTs with hardware dirty enabled, so we do not need
> to scan all PTs.
> 
>         mem abort point             mem abort point
>               ↓                            ↓
> ---------------------------------------------------------------
>         |********|        |        |********|        |        |
> ---------------------------------------------------------------
>              ↑                            ↑
>         set DBM bit of               set DBM bit of
>      this PT section (64PTEs)      this PT section (64PTEs)
> 
> We may worry that when dirty rate is over-high we still need to scan too much PTs.
> We mainly concern the VM stop time. With Qemu dirty rate throttling, the dirty memory
> is closing to the VM stop threshold, so there is a little PTs to scan after VM stop.
> 
> It has the advantages of hardware tracking that minimizes side effect on vCPU,
> and also has the advantages of software tracking that controls vCPU dirty rate.
> Moreover, software tracking helps us to scan PTs at some fixed points, which
> greatly reduces scanning time. And the biggest benefit is that we can apply this
> solution for dma dirty tracking.
> 
> Test:
> 
> Host: Kunpeng 920 with 128 CPU 512G RAM. Disable Transparent Hugepage (Ensure test result
>       is not effected by dissolve of block page table at the early stage of migration).
> VM:   16 CPU 16GB RAM. Run 4 pair of (redis_benchmark+redis_server).
> 
> Each run 5 times for software dirty log and SW/HW conbined dirty log. 
> 
> Test result:
> 
> Gain 5%~7% improvement of redis QPS during VM migration.
> VM downtime is not affected fundamentally.
> About 56.7% of DBM is effectively used.
> 
> Keqian Zhu (7):
>   arm64: cpufeature: Add API to report system support of HWDBM
>   kvm: arm64: Use atomic operation when update PTE
>   kvm: arm64: Add level_apply parameter for stage2_attr_walker
>   kvm: arm64: Add some HW_DBM related pgtable interfaces
>   kvm: arm64: Add some HW_DBM related mmu interfaces
>   kvm: arm64: Only write protect selected PTE
>   kvm: arm64: Start up SW/HW combined dirty log
> 
>  arch/arm64/include/asm/cpufeature.h  |  12 +++
>  arch/arm64/include/asm/kvm_host.h    |   6 ++
>  arch/arm64/include/asm/kvm_mmu.h     |   7 ++
>  arch/arm64/include/asm/kvm_pgtable.h |  45 ++++++++++
>  arch/arm64/kvm/arm.c                 | 125 ++++++++++++++++++++++++++
>  arch/arm64/kvm/hyp/pgtable.c         | 130 ++++++++++++++++++++++-----
>  arch/arm64/kvm/mmu.c                 |  47 +++++++++-
>  arch/arm64/kvm/reset.c               |   8 +-
>  8 files changed, 351 insertions(+), 29 deletions(-)
>
Marc Zyngier Feb. 1, 2021, 1:17 p.m. UTC | #2
On 2021-02-01 13:12, Keqian Zhu wrote:
> Hi Marc,
> 
> Do you have time to have a look at this? Thanks ;-)

Not immediately. I'm busy with stuff that is planned to go
in 5.12, which isn't the case for this series. I'll get to
it eventually.

Thanks,

         M.
zhukeqian Feb. 1, 2021, 1:25 p.m. UTC | #3
On 2021/2/1 21:17, Marc Zyngier wrote:
> On 2021-02-01 13:12, Keqian Zhu wrote:
>> Hi Marc,
>>
>> Do you have time to have a look at this? Thanks ;-)
> 
> Not immediately. I'm busy with stuff that is planned to go
> in 5.12, which isn't the case for this series. I'll get to
> it eventually.
> 
> Thanks,
> 
>         M.
Sure, I am not eager. Please concentrate on your urgent work firstly. ;-) Thanks.

Keqian.
zhukeqian March 2, 2021, 11:23 a.m. UTC | #4
Hi everyone,

Any comments are welcome :).

Thanks,
Keqian

On 2021/1/26 20:44, Keqian Zhu wrote:
> The intention:
> 
> On arm64 platform, we tracking dirty log of vCPU through guest memory abort.
> KVM occupys some vCPU time of guest to change stage2 mapping and mark dirty.
> This leads to heavy side effect on VM, especially when multi vCPU race and
> some of them block on kvm mmu_lock.
> 
> DBM is a HW auxiliary approach to log dirty. MMU chages PTE to be writable if
> its DBM bit is set. Then KVM doesn't occupy vCPU time to log dirty.
> 
> About this patch series:
> 
> The biggest problem of apply DBM for stage2 is that software must scan PTs to
> collect dirty state, which may cost much time and affect downtime of migration.
> 
> This series realize a SW/HW combined dirty log that can effectively solve this
> problem (The smmu side can also use this approach to solve dma dirty log tracking).
> 
> The core idea is that we do not enable hardware dirty at start (do not add DBM bit).
> When a arbitrary PT occurs fault, we execute soft tracking for this PT and enable
> hardware tracking for its *nearby* PTs (e.g. Add DBM bit for nearby 16PTs). Then when
> sync dirty log, we have known all PTs with hardware dirty enabled, so we do not need
> to scan all PTs.
> 
>         mem abort point             mem abort point
>               ↓                            ↓
> ---------------------------------------------------------------
>         |********|        |        |********|        |        |
> ---------------------------------------------------------------
>              ↑                            ↑
>         set DBM bit of               set DBM bit of
>      this PT section (64PTEs)      this PT section (64PTEs)
> 
> We may worry that when dirty rate is over-high we still need to scan too much PTs.
> We mainly concern the VM stop time. With Qemu dirty rate throttling, the dirty memory
> is closing to the VM stop threshold, so there is a little PTs to scan after VM stop.
> 
> It has the advantages of hardware tracking that minimizes side effect on vCPU,
> and also has the advantages of software tracking that controls vCPU dirty rate.
> Moreover, software tracking helps us to scan PTs at some fixed points, which
> greatly reduces scanning time. And the biggest benefit is that we can apply this
> solution for dma dirty tracking.
> 
> Test:
> 
> Host: Kunpeng 920 with 128 CPU 512G RAM. Disable Transparent Hugepage (Ensure test result
>       is not effected by dissolve of block page table at the early stage of migration).
> VM:   16 CPU 16GB RAM. Run 4 pair of (redis_benchmark+redis_server).
> 
> Each run 5 times for software dirty log and SW/HW conbined dirty log. 
> 
> Test result:
> 
> Gain 5%~7% improvement of redis QPS during VM migration.
> VM downtime is not affected fundamentally.
> About 56.7% of DBM is effectively used.
> 
> Keqian Zhu (7):
>   arm64: cpufeature: Add API to report system support of HWDBM
>   kvm: arm64: Use atomic operation when update PTE
>   kvm: arm64: Add level_apply parameter for stage2_attr_walker
>   kvm: arm64: Add some HW_DBM related pgtable interfaces
>   kvm: arm64: Add some HW_DBM related mmu interfaces
>   kvm: arm64: Only write protect selected PTE
>   kvm: arm64: Start up SW/HW combined dirty log
> 
>  arch/arm64/include/asm/cpufeature.h  |  12 +++
>  arch/arm64/include/asm/kvm_host.h    |   6 ++
>  arch/arm64/include/asm/kvm_mmu.h     |   7 ++
>  arch/arm64/include/asm/kvm_pgtable.h |  45 ++++++++++
>  arch/arm64/kvm/arm.c                 | 125 ++++++++++++++++++++++++++
>  arch/arm64/kvm/hyp/pgtable.c         | 130 ++++++++++++++++++++++-----
>  arch/arm64/kvm/mmu.c                 |  47 +++++++++-
>  arch/arm64/kvm/reset.c               |   8 +-
>  8 files changed, 351 insertions(+), 29 deletions(-)
>