mbox series

[RFC,v2,0/8] KVM: arm64: Implement SW/HW combined dirty log

Message ID 20230825093528.1637-1-shameerali.kolothum.thodi@huawei.com (mailing list archive)
Headers show
Series KVM: arm64: Implement SW/HW combined dirty log | expand

Message

Shameer Kolothum Aug. 25, 2023, 9:35 a.m. UTC
Hi,

This is to revive the RFC series[1], which makes use of hardware dirty
bit modifier(DBM) feature(FEAT_HAFDBS) for dirty page tracking, sent
out by Zhu Keqian sometime back.

One of the main drawbacks in using the hardware DBM feature for dirty
page tracking is the additional overhead in scanning the PTEs for dirty
pages[2]. Also there are no vCPU page faults when we set the DBM bit,
which may result in higher convergence time during guest migration. 

This series tries to reduce these overheads by not setting the
DBM for all the writeable pages during migration and instead uses a
combined software(current page fault mechanism) and hardware approach
(set DBM) for dirty page tracking.

As noted in RFC v1[1],
"The core idea is that we do not enable hardware dirty at start (do not
add DBM bit). When an arbitrary PT occurs fault, we execute soft tracking
for this PT and enable hardware tracking for its *nearby* PTs (e.g. Add
DBM bit for nearby 64PTs). Then when sync dirty log, we have known all
PTs with hardware dirty enabled, so we do not need to scan all PTs."

Major changes from the RFC v1 are:

1. Rebased to 6.5-rc5 + FEAT_TLBIRANGE series[3].
   The original RFC v1 was based on 5.11 and there are multiple changes
   in KVM/arm64 that fundamentally changed the way the page tables are
   updated. I am not 100% sure that I got all the locking mechanisms
   right during page table traversal here. But haven't seen any
   regressions or mem corruptions so far in my test setup.

2. Use of ctx->flags for handling DBM updates(patch#2)

3. During migration, we can only set DBM for pages that are already
   writeable. But the CLEAR_LOG path will set all the pages as write
   protected. There isn't any easy way to distinguish previous read-only
   pages from this write protected pages. Hence, made use of 
   "Reserved for Software use" bits in the page descriptor to mark
   "writeable-clean" pages. See patch #4.

4. Introduced KVM_CAP_ARM_HW_DBM for enabling this feature from userspace.

Testing
----------
Hardware: HiSilicon ARM64 platform(without FEAT_TLBIRANGE)
Kernel: 6.5-rc5 based with eager page split explicitly
        enabled(chunksize=2MB)

Tests with dirty_log_perf_test with anonymous THP pages shows significant
improvement in "dirty memory time" as expected but with a hit on
"get dirty time" .

./dirty_log_perf_test -b 512MB -v 96 -i 5 -m 2 -s anonymous_thp

+---------------------------+----------------+------------------+
|                           |   6.5-rc5      | 6.5-rc5 + series |
|                           |     (s)        |       (s)        |
+---------------------------+----------------+------------------+
|    dirty memory time      |    4.22        |          0.41    |
|    get dirty log time     |    0.00047     |          3.25    |
|    clear dirty log time   |    0.48        |          0.98    |
+---------------------------------------------------------------+
       
In order to get some idea on actual live migration performance,
I created a VM (96vCPUs, 1GB), ran a redis-benchmark test and
while the test was in progress initiated live migration(local).

redis-benchmark -t set -c 900 -n 5000000 --threads 96

Average of 5 runs shows that benchmark finishes ~10% faster with
a ~8% increase in "total time" for migration.

+---------------------------+----------------+------------------+
|                           |   6.5-rc5      | 6.5-rc5 + series |
|                           |     (s)        |    (s)           |
+---------------------------+----------------+------------------+
| [redis]5000000 requests in|    79.428      |      71.49       |
| [info migrate]total time  |    8438        |      9097        |
+---------------------------------------------------------------+
       
Also ran extensive VM migrations with a Qemu with md5 checksum
calculated for RAM. No regressions or memory corruption observed
so far.

It looks like this series will benefit VMs with write intensive
workloads to improve the Guest uptime during migration.

Please take a look and let me know your feedback. Any help with further
tests and verification is really appreciated.

Thanks,
Shameer

1. https://lore.kernel.org/linux-arm-kernel/20210126124444.27136-1-zhukeqian1@huawei.com/
2. https://lore.kernel.org/linux-arm-kernel/20200525112406.28224-1-zhukeqian1@huawei.com/
3. https://lore.kernel.org/kvm/20230811045127.3308641-1-rananta@google.com/


Keqian Zhu (5):
  arm64: cpufeature: Add API to report system support of HWDBM
  KVM: arm64: Add some HW_DBM related pgtable interfaces
  KVM: arm64: Add some HW_DBM related mmu interfaces
  KVM: arm64: Only write protect selected PTE
  KVM: arm64: Start up SW/HW combined dirty log

Shameer Kolothum (3):
  KVM: arm64: Add KVM_PGTABLE_WALK_HW_DBM for HW DBM support
  KVM: arm64: Set DBM for writeable-clean pages
  KVM: arm64: Add KVM_CAP_ARM_HW_DBM

 arch/arm64/include/asm/cpufeature.h  |  15 +++
 arch/arm64/include/asm/kvm_host.h    |   8 ++
 arch/arm64/include/asm/kvm_mmu.h     |   7 ++
 arch/arm64/include/asm/kvm_pgtable.h |  53 ++++++++++
 arch/arm64/kernel/image-vars.h       |   2 +
 arch/arm64/kvm/arm.c                 | 138 ++++++++++++++++++++++++++
 arch/arm64/kvm/hyp/pgtable.c         | 139 +++++++++++++++++++++++++--
 arch/arm64/kvm/mmu.c                 |  50 +++++++++-
 include/uapi/linux/kvm.h             |   1 +
 9 files changed, 403 insertions(+), 10 deletions(-)

Comments

Oliver Upton Sept. 13, 2023, 5:30 p.m. UTC | #1
Hi Shameer,

On Fri, Aug 25, 2023 at 10:35:20AM +0100, Shameer Kolothum wrote:
> Hi,
> 
> This is to revive the RFC series[1], which makes use of hardware dirty
> bit modifier(DBM) feature(FEAT_HAFDBS) for dirty page tracking, sent
> out by Zhu Keqian sometime back.
> 
> One of the main drawbacks in using the hardware DBM feature for dirty
> page tracking is the additional overhead in scanning the PTEs for dirty
> pages[2]. Also there are no vCPU page faults when we set the DBM bit,
> which may result in higher convergence time during guest migration. 
> 
> This series tries to reduce these overheads by not setting the
> DBM for all the writeable pages during migration and instead uses a
> combined software(current page fault mechanism) and hardware approach
> (set DBM) for dirty page tracking.
> 
> As noted in RFC v1[1],
> "The core idea is that we do not enable hardware dirty at start (do not
> add DBM bit). When an arbitrary PT occurs fault, we execute soft tracking
> for this PT and enable hardware tracking for its *nearby* PTs (e.g. Add
> DBM bit for nearby 64PTs). Then when sync dirty log, we have known all
> PTs with hardware dirty enabled, so we do not need to scan all PTs."

I'm unconvinced of the value of such a change.

What you're proposing here is complicated and I fear not easily
maintainable. Keeping the *two* sources of dirty state seems likely to
fail (eventually) with some very unfortunate consequences.

The optimization of enabling DBM on neighboring PTEs is presumptive of
the guest access pattern and could incur unnecessary scans of the
stage-2 page table w/ a sufficiently sparse guest access pattern.

> Tests with dirty_log_perf_test with anonymous THP pages shows significant
> improvement in "dirty memory time" as expected but with a hit on
> "get dirty time" .
> 
> ./dirty_log_perf_test -b 512MB -v 96 -i 5 -m 2 -s anonymous_thp
> 
> +---------------------------+----------------+------------------+
> |                           |   6.5-rc5      | 6.5-rc5 + series |
> |                           |     (s)        |       (s)        |
> +---------------------------+----------------+------------------+
> |    dirty memory time      |    4.22        |          0.41    |
> |    get dirty log time     |    0.00047     |          3.25    |
> |    clear dirty log time   |    0.48        |          0.98    |
> +---------------------------------------------------------------+

The vCPU:memory ratio you're testing doesn't seem representative of what
a typical cloud provider would be configuring, and the dirty log
collection is going to scale linearly with the size of guest memory.

Slow dirty log collection is going to matter a lot for VM blackout,
which from experience tends to be the most sensitive period of live
migration for guest workloads.

At least in our testing, the split GET/CLEAR dirty log ioctls
dramatically improved the performance of a write-protection based ditry
tracking scheme, as the false positive rate for dirtied pages is
significantly reduced. FWIW, this is what we use for doing LM on arm64 as
opposed to the D-bit implemenation that we use on x86.
       
> In order to get some idea on actual live migration performance,
> I created a VM (96vCPUs, 1GB), ran a redis-benchmark test and
> while the test was in progress initiated live migration(local).
> 
> redis-benchmark -t set -c 900 -n 5000000 --threads 96
> 
> Average of 5 runs shows that benchmark finishes ~10% faster with
> a ~8% increase in "total time" for migration.
> 
> +---------------------------+----------------+------------------+
> |                           |   6.5-rc5      | 6.5-rc5 + series |
> |                           |     (s)        |    (s)           |
> +---------------------------+----------------+------------------+
> | [redis]5000000 requests in|    79.428      |      71.49       |
> | [info migrate]total time  |    8438        |      9097        |
> +---------------------------------------------------------------+

Faster pre-copy performance would help the benchmark complete faster,
but the goal for a live migration should be to minimize the lost
computation for the entire operation. You'd need to test with a
continuous workload rather than one with a finite amount of work.

Also, do you know what live migration scheme you're using here?
Shameer Kolothum Sept. 14, 2023, 9:47 a.m. UTC | #2
Hi Oliver,

> -----Original Message-----
> From: Oliver Upton [mailto:oliver.upton@linux.dev]
> Sent: 13 September 2023 18:30
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: kvmarm@lists.linux.dev; kvm@vger.kernel.org;
> linux-arm-kernel@lists.infradead.org; maz@kernel.org; will@kernel.org;
> catalin.marinas@arm.com; james.morse@arm.com;
> suzuki.poulose@arm.com; yuzenghui <yuzenghui@huawei.com>; zhukeqian
> <zhukeqian1@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com>
> Subject: Re: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined
> dirty log
> 
> Hi Shameer,
> 
> On Fri, Aug 25, 2023 at 10:35:20AM +0100, Shameer Kolothum wrote:
> > Hi,
> >
> > This is to revive the RFC series[1], which makes use of hardware dirty
> > bit modifier(DBM) feature(FEAT_HAFDBS) for dirty page tracking, sent
> > out by Zhu Keqian sometime back.
> >
> > One of the main drawbacks in using the hardware DBM feature for dirty
> > page tracking is the additional overhead in scanning the PTEs for dirty
> > pages[2]. Also there are no vCPU page faults when we set the DBM bit,
> > which may result in higher convergence time during guest migration.
> >
> > This series tries to reduce these overheads by not setting the
> > DBM for all the writeable pages during migration and instead uses a
> > combined software(current page fault mechanism) and hardware
> approach
> > (set DBM) for dirty page tracking.
> >
> > As noted in RFC v1[1],
> > "The core idea is that we do not enable hardware dirty at start (do not
> > add DBM bit). When an arbitrary PT occurs fault, we execute soft tracking
> > for this PT and enable hardware tracking for its *nearby* PTs (e.g. Add
> > DBM bit for nearby 64PTs). Then when sync dirty log, we have known all
> > PTs with hardware dirty enabled, so we do not need to scan all PTs."
> 
> I'm unconvinced of the value of such a change.
> 
> What you're proposing here is complicated and I fear not easily
> maintainable. Keeping the *two* sources of dirty state seems likely to
> fail (eventually) with some very unfortunate consequences.

It does adds complexity to the dirty state management code. I have tried
to separate the code path using appropriate FLAGS etc to make it more
manageable. But this is probably one area we can work on if the overall
approach does have some benefits.

> The optimization of enabling DBM on neighboring PTEs is presumptive of
> the guest access pattern and could incur unnecessary scans of the
> stage-2 page table w/ a sufficiently sparse guest access pattern.

Agree. This may not work as intended for all workloads and especially 
if the access pattern is sparse. But still hopeful that it will be beneficial for
workloads that have continuous write patterns. And we do have a knob to
turn it on or off.

> > Tests with dirty_log_perf_test with anonymous THP pages shows
> significant
> > improvement in "dirty memory time" as expected but with a hit on
> > "get dirty time" .
> >
> > ./dirty_log_perf_test -b 512MB -v 96 -i 5 -m 2 -s anonymous_thp
> >
> > +---------------------------+----------------+------------------+
> > |                           |   6.5-rc5      | 6.5-rc5 + series
> |
> >
> |                           |     (s)        |       (s)
>    |
> > +---------------------------+----------------+------------------+
> > |    dirty memory
> time      |    4.22        |          0.41    |
> > |    get dirty log
> time     |    0.00047     |          3.25    |
> > |    clear dirty log
> time   |    0.48        |          0.98    |
> > +---------------------------------------------------------------+
> 
> The vCPU:memory ratio you're testing doesn't seem representative of what
> a typical cloud provider would be configuring, and the dirty log
> collection is going to scale linearly with the size of guest memory.

I was limited by the test setup I had. I will give it a go with a higher mem
system. 
 
> Slow dirty log collection is going to matter a lot for VM blackout,
> which from experience tends to be the most sensitive period of live
> migration for guest workloads.
> 
> At least in our testing, the split GET/CLEAR dirty log ioctls
> dramatically improved the performance of a write-protection based ditry
> tracking scheme, as the false positive rate for dirtied pages is
> significantly reduced. FWIW, this is what we use for doing LM on arm64 as
> opposed to the D-bit implemenation that we use on x86.

Guess, by D-bit on x86 you mean the PML feature. Unfortunately that is
something we lack on ARM yet.
 
> > In order to get some idea on actual live migration performance,
> > I created a VM (96vCPUs, 1GB), ran a redis-benchmark test and
> > while the test was in progress initiated live migration(local).
> >
> > redis-benchmark -t set -c 900 -n 5000000 --threads 96
> >
> > Average of 5 runs shows that benchmark finishes ~10% faster with
> > a ~8% increase in "total time" for migration.
> >
> > +---------------------------+----------------+------------------+
> > |                           |   6.5-rc5      | 6.5-rc5 + series
> |
> >
> |                           |     (s)        |    (s)
>     |
> > +---------------------------+----------------+------------------+
> > | [redis]5000000 requests in|    79.428      |      71.49       |
> > | [info migrate]total
> time  |    8438        |      9097        |
> > +---------------------------------------------------------------+
> 
> Faster pre-copy performance would help the benchmark complete faster,
> but the goal for a live migration should be to minimize the lost
> computation for the entire operation. You'd need to test with a
> continuous workload rather than one with a finite amount of work.

Ok. Though the above is not representative of a real workload, I thought
it gives some idea on how "Guest up time improvement" is benefitting the
overall availability of the workload during migration. I will check within our
wider team to see if I can setup a more suitable test/workload to show some
improvement with this approach. 

Please let me know if there is a specific workload you have in mind.

> Also, do you know what live migration scheme you're using here?

The above is the default one (pre-copy).

Thanks for getting back on this. Appreciate if you can do a quick glance
through the rest of the patches as well for any gross errors especially with
respect to page table walk locking, usage of DBM FLAGS etc.

Thanks,
Shameer
Oliver Upton Sept. 15, 2023, 12:36 a.m. UTC | #3
On Thu, Sep 14, 2023 at 09:47:48AM +0000, Shameerali Kolothum Thodi wrote:

[...]

> > What you're proposing here is complicated and I fear not easily
> > maintainable. Keeping the *two* sources of dirty state seems likely to
> > fail (eventually) with some very unfortunate consequences.
> 
> It does adds complexity to the dirty state management code. I have tried
> to separate the code path using appropriate FLAGS etc to make it more
> manageable. But this is probably one area we can work on if the overall
> approach does have some benefits.

I'd be a bit more amenable to a solution that would select either
write-protection or dirty state management, but not both.

> > The vCPU:memory ratio you're testing doesn't seem representative of what
> > a typical cloud provider would be configuring, and the dirty log
> > collection is going to scale linearly with the size of guest memory.
> 
> I was limited by the test setup I had. I will give it a go with a higher mem
> system. 

Thanks. Dirty log collection needn't be single threaded, but the
fundamental concern of dirty log collection time scaling linearly w.r.t.
the size to memory remains. Write-protection helps spread the cost of
collecting dirty state out across all the vCPU threads.

There could be some value in giving userspace the ability to parallelize
calls to dirty log ioctls to work on non-intersecting intervals.

> > Slow dirty log collection is going to matter a lot for VM blackout,
> > which from experience tends to be the most sensitive period of live
> > migration for guest workloads.
> > 
> > At least in our testing, the split GET/CLEAR dirty log ioctls
> > dramatically improved the performance of a write-protection based ditry
> > tracking scheme, as the false positive rate for dirtied pages is
> > significantly reduced. FWIW, this is what we use for doing LM on arm64 as
> > opposed to the D-bit implemenation that we use on x86.
> 
> Guess, by D-bit on x86 you mean the PML feature. Unfortunately that is
> something we lack on ARM yet.

Sorry, this was rather nonspecific. I was describing the pre-copy
strategies we're using at Google (out of tree). We're carrying patches
to use EPT D-bit for exitless dirty tracking.

> > Faster pre-copy performance would help the benchmark complete faster,
> > but the goal for a live migration should be to minimize the lost
> > computation for the entire operation. You'd need to test with a
> > continuous workload rather than one with a finite amount of work.
> 
> Ok. Though the above is not representative of a real workload, I thought
> it gives some idea on how "Guest up time improvement" is benefitting the
> overall availability of the workload during migration. I will check within our
> wider team to see if I can setup a more suitable test/workload to show some
> improvement with this approach. 
> 
> Please let me know if there is a specific workload you have in mind.

No objection to the workload you've chosen, I'm more concerned about the
benchmark finishing before live migration completes.

What I'm looking for is something like this:

 - Calculate the ops/sec your benchmark completes in steady state

 - Do a live migration and sample the rate throughout the benchmark,
   accounting for VM blackout time

 - Calculate the area under the curve of:

     y = steady_state_rate - live_migration_rate(t)

 - Compare the area under the curve for write-protection and your DBM
   approach.

> Thanks for getting back on this. Appreciate if you can do a quick glance
> through the rest of the patches as well for any gross errors especially with
> respect to page table walk locking, usage of DBM FLAGS etc.

I'll give it a read when I have some spare cycles. To be entirely clear,
I don't have any fundamental objections to using DBM for dirty tracking.
I just want to make sure that all alternatives have been considered
in the current scheme before we seriously consider a new approach with
its own set of tradeoffs.
Shameer Kolothum Sept. 18, 2023, 9:55 a.m. UTC | #4
> -----Original Message-----
> From: Oliver Upton [mailto:oliver.upton@linux.dev]
> Sent: 15 September 2023 01:36
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: kvmarm@lists.linux.dev; kvm@vger.kernel.org;
> linux-arm-kernel@lists.infradead.org; maz@kernel.org; will@kernel.org;
> catalin.marinas@arm.com; james.morse@arm.com;
> suzuki.poulose@arm.com; yuzenghui <yuzenghui@huawei.com>; zhukeqian
> <zhukeqian1@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com>
> Subject: Re: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined
> dirty log
> 
> On Thu, Sep 14, 2023 at 09:47:48AM +0000, Shameerali Kolothum Thodi
> wrote:
> 
> [...]
> 
> > > What you're proposing here is complicated and I fear not easily
> > > maintainable. Keeping the *two* sources of dirty state seems likely to
> > > fail (eventually) with some very unfortunate consequences.
> >
> > It does adds complexity to the dirty state management code. I have tried
> > to separate the code path using appropriate FLAGS etc to make it more
> > manageable. But this is probably one area we can work on if the overall
> > approach does have some benefits.
> 
> I'd be a bit more amenable to a solution that would select either
> write-protection or dirty state management, but not both.
> 
> > > The vCPU:memory ratio you're testing doesn't seem representative of
> what
> > > a typical cloud provider would be configuring, and the dirty log
> > > collection is going to scale linearly with the size of guest memory.
> >
> > I was limited by the test setup I had. I will give it a go with a higher mem
> > system.
> 
> Thanks. Dirty log collection needn't be single threaded, but the
> fundamental concern of dirty log collection time scaling linearly w.r.t.
> the size to memory remains. Write-protection helps spread the cost of
> collecting dirty state out across all the vCPU threads.
> 
> There could be some value in giving userspace the ability to parallelize
> calls to dirty log ioctls to work on non-intersecting intervals.
> 
> > > Slow dirty log collection is going to matter a lot for VM blackout,
> > > which from experience tends to be the most sensitive period of live
> > > migration for guest workloads.
> > >
> > > At least in our testing, the split GET/CLEAR dirty log ioctls
> > > dramatically improved the performance of a write-protection based ditry
> > > tracking scheme, as the false positive rate for dirtied pages is
> > > significantly reduced. FWIW, this is what we use for doing LM on arm64
> as
> > > opposed to the D-bit implemenation that we use on x86.
> >
> > Guess, by D-bit on x86 you mean the PML feature. Unfortunately that is
> > something we lack on ARM yet.
> 
> Sorry, this was rather nonspecific. I was describing the pre-copy
> strategies we're using at Google (out of tree). We're carrying patches
> to use EPT D-bit for exitless dirty tracking.

Just curious, how does it handle the overheads associated with scanning for
dirty pages and the convergence w.r.t high rate of dirtying in exitless mode? 
 
> > > Faster pre-copy performance would help the benchmark complete faster,
> > > but the goal for a live migration should be to minimize the lost
> > > computation for the entire operation. You'd need to test with a
> > > continuous workload rather than one with a finite amount of work.
> >
> > Ok. Though the above is not representative of a real workload, I thought
> > it gives some idea on how "Guest up time improvement" is benefitting the
> > overall availability of the workload during migration. I will check within our
> > wider team to see if I can setup a more suitable test/workload to show
> some
> > improvement with this approach.
> >
> > Please let me know if there is a specific workload you have in mind.
> 
> No objection to the workload you've chosen, I'm more concerned about the
> benchmark finishing before live migration completes.
> 
> What I'm looking for is something like this:
> 
>  - Calculate the ops/sec your benchmark completes in steady state
> 
>  - Do a live migration and sample the rate throughout the benchmark,
>    accounting for VM blackout time
> 
>  - Calculate the area under the curve of:
> 
>      y = steady_state_rate - live_migration_rate(t)
> 
>  - Compare the area under the curve for write-protection and your DBM
>    approach.

Ok. Got it.

> > Thanks for getting back on this. Appreciate if you can do a quick glance
> > through the rest of the patches as well for any gross errors especially with
> > respect to page table walk locking, usage of DBM FLAGS etc.
> 
> I'll give it a read when I have some spare cycles. To be entirely clear,
> I don't have any fundamental objections to using DBM for dirty tracking.
> I just want to make sure that all alternatives have been considered
> in the current scheme before we seriously consider a new approach with
> its own set of tradeoffs.

Thanks for taking a look.

Shameer
Oliver Upton Sept. 20, 2023, 9:12 p.m. UTC | #5
On Mon, Sep 18, 2023 at 09:55:22AM +0000, Shameerali Kolothum Thodi wrote:

[...]

> > Sorry, this was rather nonspecific. I was describing the pre-copy
> > strategies we're using at Google (out of tree). We're carrying patches
> > to use EPT D-bit for exitless dirty tracking.
> 
> Just curious, how does it handle the overheads associated with scanning for
> dirty pages and the convergence w.r.t high rate of dirtying in exitless mode? 

A pool of kthreads, which really isn't a good solution at all. The
'better' way to do it would be to add some back pressure to the guest
such that your pre-copy transfer can converge with the guest and use the
freed up CPU time to manage the dirty state.

But hopefully we can make that a userspace issue.
Shameer Kolothum Oct. 12, 2023, 7:51 a.m. UTC | #6
Hi,

> -----Original Message-----
> From: linux-arm-kernel
> [mailto:linux-arm-kernel-bounces@lists.infradead.org] On Behalf Of
> Shameerali Kolothum Thodi
> Sent: 18 September 2023 10:55
> To: Oliver Upton <oliver.upton@linux.dev>
> Cc: kvmarm@lists.linux.dev; kvm@vger.kernel.org;
> linux-arm-kernel@lists.infradead.org; maz@kernel.org; will@kernel.org;
> catalin.marinas@arm.com; james.morse@arm.com;
> suzuki.poulose@arm.com; yuzenghui <yuzenghui@huawei.com>; zhukeqian
> <zhukeqian1@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com>
> Subject: RE: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined
> dirty log
 
[...]

> > > Please let me know if there is a specific workload you have in mind.
> >
> > No objection to the workload you've chosen, I'm more concerned about
> the
> > benchmark finishing before live migration completes.
> >
> > What I'm looking for is something like this:
> >
> >  - Calculate the ops/sec your benchmark completes in steady state
> >
> >  - Do a live migration and sample the rate throughout the benchmark,
> >    accounting for VM blackout time
> >
> >  - Calculate the area under the curve of:
> >
> >      y = steady_state_rate - live_migration_rate(t)
> >
> >  - Compare the area under the curve for write-protection and your DBM
> >    approach.
> 
> Ok. Got it.

I attempted to benchmark the performance of this series better as suggested above.

Used memcached/memaslap instead of redis-benchmark as this tool seems to dirty
memory at a faster rate than redis-benchmark in my setup.

./memaslap -s 127.0.0.1:11211 -S 1s  -F ./memslap.cnf -T 96 -c 96 -t 20m

Please find the google sheet link below for the charts that compare the average
throughput rates during the migration time window for 6.5-org and
6.5-kvm-dbm branch.

https://docs.google.com/spreadsheets/d/1T2F94Lsjpx080hW8OSxwbTJXihbXDNlTE1HjWCC0J_4/edit?usp=sharing

Sheet #1 : is with autoconverge=on with default settings(initial-throttle 20 & increment 10).

As you can see from the charts, if you compare the kvm-dbm branch throughput
during the migration window of original branch, it is considerably higher.
But the convergence time to finish migration increases almost at the same
rate for KVM-DBM. This in effect results in a decreased overall avg. 
throughput if we compare with the same time window of original branch.

Sheet #2: is with autoconverge=on with throttle-increment set to 15 for kvm-dbm branch run.

However, if we increase the migration throttling rate for kvm-dbm branch, 
it looks to me we can still have better throughput during the migration
window time and also an overall higher throughput rate with KVM-DBM solution.
 
Sheet: #3. Captures the dirty_log_perf_test times vs memory per vCPU. 

This is also in line with the above results. KVM-DBM has better/constant-ish
dirty memory time compared to linear increase noted for original. 
But it is just the opposite for Get Dirty log time. 

From the above, it looks to me there is a value addition in using HW DBM
for write intensive workloads if we adjust the CPU throttling in the user space.

Please take a look and let me know your feedback/thoughts.

Thanks,
Shameer