[v2,3/3] migration: Support responsive CPU throttle

Message ID	3a383e563cc57c77320af805c8b8ece4e68eebea.1727630000.git.yong.huang@smartx.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: yong.huang@smartx.com To: qemu-devel@nongnu.org Cc: Peter Xu <peterx@redhat.com>, Fabiano Rosas <farosas@suse.de>, Eric Blake <eblake@redhat.com>, Markus Armbruster <armbru@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, yong.huang@smartx.com Subject: [PATCH v2 3/3] migration: Support responsive CPU throttle Date: Mon, 30 Sep 2024 01:14:28 +0800 Message-Id: <3a383e563cc57c77320af805c8b8ece4e68eebea.1727630000.git.yong.huang@smartx.com> In-Reply-To: <cover.1727630000.git.yong.huang@smartx.com> References: <cover.1727630000.git.yong.huang@smartx.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2607:f8b0:4864:20::531; envelope-from=yong.huang@smartx.com; helo=mail-pg1-x531.google.com X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	migration: auto-converge refinements for huge VM \| expand [v2,0/3] migration: auto-converge refinements for huge VM [v2,1/3] migration: Support background ramblock dirty sync [v2,2/3] qapi/migration: Introduce cpu-throttle-responsive parameter [v2,3/3] migration: Support responsive CPU throttle

Message ID

3a383e563cc57c77320af805c8b8ece4e68eebea.1727630000.git.yong.huang@smartx.com (mailing list archive)

State

New, archived

Headers

From: yong.huang@smartx.com
To: qemu-devel@nongnu.org
Cc: Peter Xu <peterx@redhat.com>, Fabiano Rosas <farosas@suse.de>,
 Eric Blake <eblake@redhat.com>, Markus Armbruster <armbru@redhat.com>,
 Paolo Bonzini <pbonzini@redhat.com>, yong.huang@smartx.com
Subject: [PATCH v2 3/3] migration: Support responsive CPU throttle
Date: Mon, 30 Sep 2024 01:14:28 +0800
Message-Id: 
 <3a383e563cc57c77320af805c8b8ece4e68eebea.1727630000.git.yong.huang@smartx.com>
In-Reply-To: <cover.1727630000.git.yong.huang@smartx.com>
References: <cover.1727630000.git.yong.huang@smartx.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Received-SPF: pass client-ip=2607:f8b0:4864:20::531;
 envelope-from=yong.huang@smartx.com; helo=mail-pg1-x531.google.com
X-Spam_score_int: -18
X-Spam_score: -1.9
X-Spam_bar: -
X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org

Series

migration: auto-converge refinements for huge VM | expand

Commit Message

Yong Huang Sept. 29, 2024, 5:14 p.m. UTC

From: Hyman Huang <yong.huang@smartx.com>

Currently, the convergence algorithm determines that the migration
cannot converge according to the following principle:
The dirty pages generated in current iteration exceed a specific
percentage (throttle-trigger-threshold, 50 by default) of the number
of transmissions. Let's refer to this criteria as the "dirty rate".
If this criteria is met more than or equal to twice
(dirty_rate_high_cnt >= 2), the throttle percentage increased.

In most cases, above implementation is appropriate. However, for a
VM with high memory overload, each iteration is time-consuming.
The VM's computing performance may be throttled at a high percentage
and last for a long time due to the repeated confirmation behavior.
Which may be intolerable for some computationally sensitive software
in the VM.

As the comment mentioned in the migration_trigger_throttle function,
in order to avoid erroneous detection, the original algorithm confirms
the criteria repeatedly. Put differently, the criteria does not need
to be validated again once the detection is more reliable.

In the refinement, in order to make the detection more accurate, we
introduce another criteria, called the "dirty ratio" to determine
the migration convergence. The "dirty ratio" is the ratio of
bytes_xfer_period and bytes_dirty_period. When the algorithm
repeatedly detects that the "dirty ratio" of current sync is lower
than the previous, the algorithm determines that the migration cannot
converge. For the "dirty rate" and "dirty ratio", if one of the two
criteria is met, the penalty percentage would be increased. This
makes CPU throttle more responsively and therefor saves the time of
the entire iteration and therefore reduces the time of VM performance
degradation.

In conclusion, this refinement significantly reduces the processing
time required for the throttle percentage step to its maximum while
the VM is under a high memory load.

Signed-off-by: Hyman Huang <yong.huang@smartx.com>
---
 migration/ram.c              | 55 ++++++++++++++++++++++++++++++++++--
 migration/trace-events       |  1 +
 tests/qtest/migration-test.c |  1 +
 3 files changed, 55 insertions(+), 2 deletions(-)

Comments

Peter Xu Sept. 30, 2024, 8:47 p.m. UTC | #1

On Mon, Sep 30, 2024 at 01:14:28AM +0800, yong.huang@smartx.com wrote:
> From: Hyman Huang <yong.huang@smartx.com>
> 
> Currently, the convergence algorithm determines that the migration
> cannot converge according to the following principle:
> The dirty pages generated in current iteration exceed a specific
> percentage (throttle-trigger-threshold, 50 by default) of the number
> of transmissions. Let's refer to this criteria as the "dirty rate".
> If this criteria is met more than or equal to twice
> (dirty_rate_high_cnt >= 2), the throttle percentage increased.
> 
> In most cases, above implementation is appropriate. However, for a
> VM with high memory overload, each iteration is time-consuming.
> The VM's computing performance may be throttled at a high percentage
> and last for a long time due to the repeated confirmation behavior.
> Which may be intolerable for some computationally sensitive software
> in the VM.
> 
> As the comment mentioned in the migration_trigger_throttle function,
> in order to avoid erroneous detection, the original algorithm confirms
> the criteria repeatedly. Put differently, the criteria does not need
> to be validated again once the detection is more reliable.
> 
> In the refinement, in order to make the detection more accurate, we
> introduce another criteria, called the "dirty ratio" to determine
> the migration convergence. The "dirty ratio" is the ratio of
> bytes_xfer_period and bytes_dirty_period. When the algorithm
> repeatedly detects that the "dirty ratio" of current sync is lower
> than the previous, the algorithm determines that the migration cannot
> converge. For the "dirty rate" and "dirty ratio", if one of the two
> criteria is met, the penalty percentage would be increased. This
> makes CPU throttle more responsively and therefor saves the time of
> the entire iteration and therefore reduces the time of VM performance
> degradation.
> 
> In conclusion, this refinement significantly reduces the processing
> time required for the throttle percentage step to its maximum while
> the VM is under a high memory load.

I'm a bit lost on why this patch 2-3 is still needed if patch 1 works.
Wouldn't that greatly increase the chance of throttle code being inovked
already?  Why we still need this?

Thanks,

Yong Huang Oct. 1, 2024, 2:18 a.m. UTC | #2

On Tue, Oct 1, 2024 at 4:47 AM Peter Xu <peterx@redhat.com> wrote:

> On Mon, Sep 30, 2024 at 01:14:28AM +0800, yong.huang@smartx.com wrote:
> > From: Hyman Huang <yong.huang@smartx.com>
> >
> > Currently, the convergence algorithm determines that the migration
> > cannot converge according to the following principle:
> > The dirty pages generated in current iteration exceed a specific
> > percentage (throttle-trigger-threshold, 50 by default) of the number
> > of transmissions. Let's refer to this criteria as the "dirty rate".
> > If this criteria is met more than or equal to twice
> > (dirty_rate_high_cnt >= 2), the throttle percentage increased.
> >
> > In most cases, above implementation is appropriate. However, for a
> > VM with high memory overload, each iteration is time-consuming.
> > The VM's computing performance may be throttled at a high percentage
> > and last for a long time due to the repeated confirmation behavior.
> > Which may be intolerable for some computationally sensitive software
> > in the VM.
> >
> > As the comment mentioned in the migration_trigger_throttle function,
> > in order to avoid erroneous detection, the original algorithm confirms
> > the criteria repeatedly. Put differently, the criteria does not need
> > to be validated again once the detection is more reliable.
> >
> > In the refinement, in order to make the detection more accurate, we
> > introduce another criteria, called the "dirty ratio" to determine
> > the migration convergence. The "dirty ratio" is the ratio of
> > bytes_xfer_period and bytes_dirty_period. When the algorithm
> > repeatedly detects that the "dirty ratio" of current sync is lower
> > than the previous, the algorithm determines that the migration cannot
> > converge. For the "dirty rate" and "dirty ratio", if one of the two
> > criteria is met, the penalty percentage would be increased. This
> > makes CPU throttle more responsively and therefor saves the time of
> > the entire iteration and therefore reduces the time of VM performance
> > degradation.
> >
> > In conclusion, this refinement significantly reduces the processing
> > time required for the throttle percentage step to its maximum while
> > the VM is under a high memory load.
>
> I'm a bit lost on why this patch 2-3 is still needed if patch 1 works.
> Wouldn't that greatly increase the chance of throttle code being inovked
> already?  Why we still need this?
>

Indeed, if we are considering how to increase the change of throttle.
Patch 1 is sufficient, and I'm not insisting.

If we are talking about how to detect the migration convergence, this
patch, IMHO, is still helpful. Anyway, it depends on your judgment. :)


>
> Thanks,
>
> --
> Peter Xu
>
>
Yong

Peter Xu Oct. 1, 2024, 3:37 p.m. UTC | #3

On Tue, Oct 01, 2024 at 10:18:54AM +0800, Yong Huang wrote:
> On Tue, Oct 1, 2024 at 4:47 AM Peter Xu <peterx@redhat.com> wrote:
> 
> > On Mon, Sep 30, 2024 at 01:14:28AM +0800, yong.huang@smartx.com wrote:
> > > From: Hyman Huang <yong.huang@smartx.com>
> > >
> > > Currently, the convergence algorithm determines that the migration
> > > cannot converge according to the following principle:
> > > The dirty pages generated in current iteration exceed a specific
> > > percentage (throttle-trigger-threshold, 50 by default) of the number
> > > of transmissions. Let's refer to this criteria as the "dirty rate".
> > > If this criteria is met more than or equal to twice
> > > (dirty_rate_high_cnt >= 2), the throttle percentage increased.
> > >
> > > In most cases, above implementation is appropriate. However, for a
> > > VM with high memory overload, each iteration is time-consuming.
> > > The VM's computing performance may be throttled at a high percentage
> > > and last for a long time due to the repeated confirmation behavior.
> > > Which may be intolerable for some computationally sensitive software
> > > in the VM.
> > >
> > > As the comment mentioned in the migration_trigger_throttle function,
> > > in order to avoid erroneous detection, the original algorithm confirms
> > > the criteria repeatedly. Put differently, the criteria does not need
> > > to be validated again once the detection is more reliable.
> > >
> > > In the refinement, in order to make the detection more accurate, we
> > > introduce another criteria, called the "dirty ratio" to determine
> > > the migration convergence. The "dirty ratio" is the ratio of
> > > bytes_xfer_period and bytes_dirty_period. When the algorithm
> > > repeatedly detects that the "dirty ratio" of current sync is lower
> > > than the previous, the algorithm determines that the migration cannot
> > > converge. For the "dirty rate" and "dirty ratio", if one of the two
> > > criteria is met, the penalty percentage would be increased. This
> > > makes CPU throttle more responsively and therefor saves the time of
> > > the entire iteration and therefore reduces the time of VM performance
> > > degradation.
> > >
> > > In conclusion, this refinement significantly reduces the processing
> > > time required for the throttle percentage step to its maximum while
> > > the VM is under a high memory load.
> >
> > I'm a bit lost on why this patch 2-3 is still needed if patch 1 works.
> > Wouldn't that greatly increase the chance of throttle code being inovked
> > already?  Why we still need this?
> >
> 
> Indeed, if we are considering how to increase the change of throttle.
> Patch 1 is sufficient, and I'm not insisting.
> 
> If we are talking about how to detect the migration convergence, this
> patch, IMHO, is still helpful. Anyway, it depends on your judgment. :)

Thanks.  I really hope we can stick with patch 1 only for now, and we leave
patches like 2-3 for future, or probably never.

I want to avoid more magical tunables, and I want to avoid the code harder
to read.  Unlike most of other migration features, auto converge so far is
already pretty heavy on the "engineering" aspect of things.  More people
care about downtime with 100ms or even less, then it makes zero sense a
throttle feature can stop a group of vCPUs for more than that easily.

I hope we can unite more dev/qe resources on postcopy across QEMU community
for enterprise users.  PoCs are always good stuff for QEMU as it's a
community project and people experiment things on it, but I hope at least
from design level, not small tunables like this one.  We could have
introduced 10 more tunables all over, feed them to AI and train some
numbers that migration can improve 10%, but IMHO that doesn't hugely help.

If you really care about convergence issues, I want to know whether you
agree on postcopy being a better way to go.  There're still plenty of
things we can do better in that area on either postcopy in general, or
downtime optimizations that lots of people are working (e.g. VFIO's), so
again IMHO it'll be good we keep focused there.

Thanks,

Yong Huang Oct. 8, 2024, 2:34 a.m. UTC | #4

On Tue, Oct 1, 2024 at 11:37 PM Peter Xu <peterx@redhat.com> wrote:

> On Tue, Oct 01, 2024 at 10:18:54AM +0800, Yong Huang wrote:
> > On Tue, Oct 1, 2024 at 4:47 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > > On Mon, Sep 30, 2024 at 01:14:28AM +0800, yong.huang@smartx.com wrote:
> > > > From: Hyman Huang <yong.huang@smartx.com>
> > > >
> > > > Currently, the convergence algorithm determines that the migration
> > > > cannot converge according to the following principle:
> > > > The dirty pages generated in current iteration exceed a specific
> > > > percentage (throttle-trigger-threshold, 50 by default) of the number
> > > > of transmissions. Let's refer to this criteria as the "dirty rate".
> > > > If this criteria is met more than or equal to twice
> > > > (dirty_rate_high_cnt >= 2), the throttle percentage increased.
> > > >
> > > > In most cases, above implementation is appropriate. However, for a
> > > > VM with high memory overload, each iteration is time-consuming.
> > > > The VM's computing performance may be throttled at a high percentage
> > > > and last for a long time due to the repeated confirmation behavior.
> > > > Which may be intolerable for some computationally sensitive software
> > > > in the VM.
> > > >
> > > > As the comment mentioned in the migration_trigger_throttle function,
> > > > in order to avoid erroneous detection, the original algorithm
> confirms
> > > > the criteria repeatedly. Put differently, the criteria does not need
> > > > to be validated again once the detection is more reliable.
> > > >
> > > > In the refinement, in order to make the detection more accurate, we
> > > > introduce another criteria, called the "dirty ratio" to determine
> > > > the migration convergence. The "dirty ratio" is the ratio of
> > > > bytes_xfer_period and bytes_dirty_period. When the algorithm
> > > > repeatedly detects that the "dirty ratio" of current sync is lower
> > > > than the previous, the algorithm determines that the migration cannot
> > > > converge. For the "dirty rate" and "dirty ratio", if one of the two
> > > > criteria is met, the penalty percentage would be increased. This
> > > > makes CPU throttle more responsively and therefor saves the time of
> > > > the entire iteration and therefore reduces the time of VM performance
> > > > degradation.
> > > >
> > > > In conclusion, this refinement significantly reduces the processing
> > > > time required for the throttle percentage step to its maximum while
> > > > the VM is under a high memory load.
> > >
> > > I'm a bit lost on why this patch 2-3 is still needed if patch 1 works.
> > > Wouldn't that greatly increase the chance of throttle code being
> inovked
> > > already?  Why we still need this?
> > >
> >
> > Indeed, if we are considering how to increase the change of throttle.
> > Patch 1 is sufficient, and I'm not insisting.
> >
> > If we are talking about how to detect the migration convergence, this
> > patch, IMHO, is still helpful. Anyway, it depends on your judgment. :)
>
> Thanks.  I really hope we can stick with patch 1 only for now, and we leave
> patches like 2-3 for future, or probably never.
>
> I want to avoid more magical tunables, and I want to avoid the code harder
> to read.  Unlike most of other migration features, auto converge so far is
> already pretty heavy on the "engineering" aspect of things.  More people
> care about downtime with 100ms or even less, then it makes zero sense a
> throttle feature can stop a group of vCPUs for more than that easily.
>
> I hope we can unite more dev/qe resources on postcopy across QEMU community
> for enterprise users.  PoCs are always good stuff for QEMU as it's a
> community project and people experiment things on it, but I hope at least
> from design level, not small tunables like this one.  We could have
> introduced 10 more tunables all over, feed them to AI and train some
> numbers that migration can improve 10%, but IMHO that doesn't hugely help.
>
> If you really care about convergence issues, I want to know whether you
> agree on postcopy being a better way to go.  There're still plenty of
>

Agree, postcopy ought to deserve more attention as respect to refining the
huge
VM migration.


> things we can do better in that area on either postcopy in general, or
> downtime optimizations that lots of people are working (e.g. VFIO's), so
> again IMHO it'll be good we keep focused there.
>
> Thanks,
>
> --
> Peter Xu
>
>
Thanks for sharing your idea, I'll drop these 2 patches in the next version.

Yong

diff --git a/migration/ram.c b/migration/ram.c
index 995bae1ac9..c36fed5135 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -420,6 +420,12 @@  struct RAMState {
      * RAM migration.
      */
     unsigned int postcopy_bmap_sync_requested;
+
+    /*
+     * Ratio of bytes_dirty_period and bytes_xfer_period in the
+     * previous sync.
+     */
+    uint64_t dirty_ratio_pct;
 };
 typedef struct RAMState RAMState;
 
@@ -1019,6 +1025,43 @@  static void migration_dirty_limit_guest(void)
     trace_migration_dirty_limit_guest(quota_dirtyrate);
 }
 
+static bool migration_dirty_ratio_high(RAMState *rs)
+{
+    static int dirty_ratio_high_cnt;
+    uint64_t threshold = migrate_throttle_trigger_threshold();
+    uint64_t bytes_xfer_period =
+        migration_transferred_bytes() - rs->bytes_xfer_prev;
+    uint64_t bytes_dirty_period = rs->num_dirty_pages_period * TARGET_PAGE_SIZE;
+    bool dirty_ratio_high = false;
+    uint64_t prev, curr;
+
+    /* Calculate the dirty ratio percentage */
+    curr = 100 * (bytes_dirty_period * 1.0 / bytes_xfer_period);
+
+    prev = rs->dirty_ratio_pct;
+    rs->dirty_ratio_pct = curr;
+
+    if (prev == 0) {
+        return false;
+    }
+
+    /*
+     * If current dirty ratio is greater than previouse, determine
+     * that the migration do not converge.
+     */
+    if (curr > threshold && curr >= prev) {
+        trace_migration_dirty_ratio_high(curr, prev);
+        dirty_ratio_high_cnt++;
+    }
+
+    if (dirty_ratio_high_cnt >= 2) {
+        dirty_ratio_high = true;
+        dirty_ratio_high_cnt = 0;
+    }
+
+    return dirty_ratio_high;
+}
+
 static void migration_trigger_throttle(RAMState *rs)
 {
     uint64_t threshold = migrate_throttle_trigger_threshold();
@@ -1026,6 +1069,11 @@  static void migration_trigger_throttle(RAMState *rs)
         migration_transferred_bytes() - rs->bytes_xfer_prev;
     uint64_t bytes_dirty_period = rs->num_dirty_pages_period * TARGET_PAGE_SIZE;
     uint64_t bytes_dirty_threshold = bytes_xfer_period * threshold / 100;
+    bool dirty_ratio_high = false;
+
+    if (migrate_cpu_throttle_responsive() && (bytes_xfer_period != 0)) {
+        dirty_ratio_high = migration_dirty_ratio_high(rs);
+    }
 
     /*
      * The following detection logic can be refined later. For now:
@@ -1035,8 +1083,11 @@  static void migration_trigger_throttle(RAMState *rs)
      * twice, start or increase throttling.
      */
     if ((bytes_dirty_period > bytes_dirty_threshold) &&
-        (++rs->dirty_rate_high_cnt >= 2)) {
-        rs->dirty_rate_high_cnt = 0;
+        ((++rs->dirty_rate_high_cnt >= 2) || dirty_ratio_high)) {
+
+        rs->dirty_rate_high_cnt =
+            rs->dirty_rate_high_cnt >= 2 ? 0 : rs->dirty_rate_high_cnt;
+
         if (migrate_auto_converge()) {
             trace_migration_throttle();
             mig_throttle_guest_down(bytes_dirty_period,
diff --git a/migration/trace-events b/migration/trace-events
index 3f09e7f383..19a1ff7973 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -96,6 +96,7 @@  get_queued_page_not_dirty(const char *block_name, uint64_t tmp_offset, unsigned
 migration_bitmap_sync_start(void) ""
 migration_bitmap_sync_end(uint64_t dirty_pages) "dirty_pages %" PRIu64
 migration_bitmap_clear_dirty(char *str, uint64_t start, uint64_t size, unsigned long page) "rb %s start 0x%"PRIx64" size 0x%"PRIx64" page 0x%lx"
+migration_dirty_ratio_high(uint64_t cur, uint64_t prev) "current ratio: %" PRIu64 " previous ratio: %" PRIu64
 migration_throttle(void) ""
 migration_dirty_limit_guest(int64_t dirtyrate) "guest dirty page rate limit %" PRIi64 " MB/s"
 ram_discard_range(const char *rbname, uint64_t start, size_t len) "%s: start: %" PRIx64 " %zx"
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 3296f5244d..acdc1d6358 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2807,6 +2807,7 @@  static void test_migrate_auto_converge(void)
     migrate_set_parameter_int(from, "cpu-throttle-initial", init_pct);
     migrate_set_parameter_int(from, "cpu-throttle-increment", inc_pct);
     migrate_set_parameter_int(from, "max-cpu-throttle", max_pct);
+    migrate_set_parameter_bool(from, "cpu-throttle-responsive", true);
 
     /*
      * Set the initial parameters so that the migration could not converge