[vfio,0/7] Enhances the vfio-virtio driver to support live migration

Message ID	20241027100751.219214-1-yishaih@nvidia.com (mailing list archive)
Headers	show Received: from NAM11-CO1-obe.outbound.protection.outlook.com (mail-co1nam11on2087.outbound.protection.outlook.com [40.107.220.87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 86F3F29CE8 for <kvm@vger.kernel.org>; Sun, 27 Oct 2024 10:08:38 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.117.161 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.117.161; helo=mail.nvidia.com; pr=C From: Yishai Hadas <yishaih@nvidia.com> To: <alex.williamson@redhat.com>, <mst@redhat.com>, <jasowang@redhat.com>, <jgg@nvidia.com> CC: <kvm@vger.kernel.org>, <virtualization@lists.linux-foundation.org>, <parav@nvidia.com>, <feliu@nvidia.com>, <kevin.tian@intel.com>, <joao.m.martins@oracle.com>, <leonro@nvidia.com>, <yishaih@nvidia.com>, <maorg@nvidia.com> Subject: [PATCH vfio 0/7] Enhances the vfio-virtio driver to support live migration Date: Sun, 27 Oct 2024 12:07:44 +0200 Message-ID: <20241027100751.219214-1-yishaih@nvidia.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	Enhances the vfio-virtio driver to support live migration \| expand [vfio,0/7] Enhances the vfio-virtio driver to support live migration [vfio,1/7] virtio_pci: Introduce device parts access commands [vfio,2/7] virtio: Extend the admin command to include the result size [vfio,3/7] virtio: Manage device and driver capabilities via the admin commands [vfio,4/7] virtio-pci: Introduce APIs to execute device parts admin commands [vfio,5/7] vfio/virtio: Add support for the basic live migration functionality [vfio,6/7] vfio/virtio: Add PRE_COPY support for live migration [vfio,7/7] vfio/virtio: Enable live migration once VIRTIO_PCI was configured

Yishai Hadas Oct. 27, 2024, 10:07 a.m. UTC

This series enhances the vfio-virtio driver to support live migration
for virtio-net Virtual Functions (VFs) that are migration-capable.
 
This series follows the Virtio 1.4 specification to implement the
necessary device parts commands, enabling a device to participate in the
live migration process.

The key VFIO features implemented include: VFIO_MIGRATION_STOP_COPY,
VFIO_MIGRATION_P2P, VFIO_MIGRATION_PRE_COPY.
 
The implementation integrates with the VFIO subsystem via vfio_pci_core
and incorporates Virtio-specific logic to handle the migration process.
 
Migration functionality follows the definitions in uapi/vfio.h and uses
the Virtio VF-to-PF admin queue command channel for executing the device
parts related commands.
 
Patch Overview:
The first four patches focus on the Virtio layer and address the
following:
- Define the layout of the device parts commands required as part of the
  migration process.
- Provide APIs to enable upper layers (e.g., VFIO, net) to execute the
  related device parts commands.
 
The last three patches focus on the VFIO layer:
- Extend the vfio-virtio driver to support live migration for Virtio-net
  VFs.
- Move legacy I/O operations to a separate file, which is compiled only
  when VIRTIO_PCI_ADMIN_LEGACY is configured, ensuring that live
  migration depends solely on VIRTIO_PCI.
 
Additional Notes:
- The kernel protocol between the source and target devices includes a
  header containing metadata such as record size, tag, and flags.
  The record size allows the target to read a complete image from the
  source before passing device part data. This follows the Virtio
  specification, which mandates that partial device parts are not
  supplied. The tag and flags serve as placeholders for future extensions
  to the kernel protocol between the source and target, ensuring backward
  and forward compatibility.
 
- Both the source and target comply with the Virtio specification by
  using a device part object with a unique ID during the migration
  process. As this resource is limited to a maximum of 255, its lifecycle
  is confined to periods when live migration is active.

- According to the Virtio specification, a device has only two states:
  RUNNING and STOPPED. Consequently, certain VFIO transitions (e.g.,
  RUNNING_P2P->STOP, STOP->RUNNING_P2P) are treated as no-ops. When
  transitioning to RUNNING_P2P, the device state is set to STOP and
  remains STOPPED until it transitions back from RUNNING_P2P->RUNNING, at
  which point it resumes its RUNNING state.
 
- Furthermore, the Virtio specification does not support reading partial
  or incremental device contexts. This means that during the PRE_COPY
  state, the vfio-virtio driver reads the full device state. This step is
  beneficial because it allows the device to send some "initial data"
  before moving to the STOP_COPY state, thus reducing downtime by
  preparing early. To avoid an infinite number of device calls during
  PRE_COPY, the vfio-virtio driver limits this flow to a maximum of 128
  calls. After reaching this limit, the driver will report zero bytes
  remaining in PRE_COPY, signaling to QEMU to transition to STOP_COPY.
 
- Support for dirty page tracking during migration will be provided via
  the IOMMUFD framework.
 
- This series has been successfully tested on Virtio-net VF devices.

Yishai

Yishai Hadas (7):
  virtio_pci: Introduce device parts access commands
  virtio: Extend the admin command to include the result size
  virtio: Manage device and driver capabilities via the admin commands
  virtio-pci: Introduce APIs to execute device parts admin commands
  vfio/virtio: Add support for the basic live migration functionality
  vfio/virtio: Add PRE_COPY support for live migration
  vfio/virtio: Enable live migration once VIRTIO_PCI was configured

 drivers/vfio/pci/virtio/Kconfig     |    4 +-
 drivers/vfio/pci/virtio/Makefile    |    3 +-
 drivers/vfio/pci/virtio/common.h    |  127 +++
 drivers/vfio/pci/virtio/legacy_io.c |  420 +++++++++
 drivers/vfio/pci/virtio/main.c      |  496 ++--------
 drivers/vfio/pci/virtio/migrate.c   | 1339 +++++++++++++++++++++++++++
 drivers/virtio/virtio_pci_common.h  |   19 +-
 drivers/virtio/virtio_pci_modern.c  |  457 ++++++++-
 include/linux/virtio.h              |    1 +
 include/linux/virtio_pci_admin.h    |   11 +
 include/uapi/linux/virtio_pci.h     |  131 +++
 11 files changed, 2593 insertions(+), 415 deletions(-)
 create mode 100644 drivers/vfio/pci/virtio/common.h
 create mode 100644 drivers/vfio/pci/virtio/legacy_io.c
 create mode 100644 drivers/vfio/pci/virtio/migrate.c

Alex Williamson Oct. 28, 2024, 4:13 p.m. UTC | #1

On Sun, 27 Oct 2024 12:07:44 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:
> 
> - According to the Virtio specification, a device has only two states:
>   RUNNING and STOPPED. Consequently, certain VFIO transitions (e.g.,
>   RUNNING_P2P->STOP, STOP->RUNNING_P2P) are treated as no-ops. When
>   transitioning to RUNNING_P2P, the device state is set to STOP and
>   remains STOPPED until it transitions back from RUNNING_P2P->RUNNING, at
>   which point it resumes its RUNNING state.

Does this assume the virtio device is not a DMA target for another
device?  If so, how can we make such an assumption?  Otherwise, what
happens on a DMA write to the stopped virtio device?

> - Furthermore, the Virtio specification does not support reading partial
>   or incremental device contexts. This means that during the PRE_COPY
>   state, the vfio-virtio driver reads the full device state. This step is
>   beneficial because it allows the device to send some "initial data"
>   before moving to the STOP_COPY state, thus reducing downtime by
>   preparing early. To avoid an infinite number of device calls during
>   PRE_COPY, the vfio-virtio driver limits this flow to a maximum of 128
>   calls. After reaching this limit, the driver will report zero bytes
>   remaining in PRE_COPY, signaling to QEMU to transition to STOP_COPY.

If the virtio spec doesn't support partial contexts, what makes it
beneficial here?  Can you qualify to what extent this initial data
improves the overall migration performance?

If it is beneficial, why is it beneficial to send initial data more than
once?  In particular, what heuristic supports capping iterations at 128?
The code also only indicates this is to prevent infinite iterations.
Would it be better to rate-limit calls, by reporting no data available
for some time interval after the previous call?  Thanks,

Alex

Jason Gunthorpe Oct. 28, 2024, 4:23 p.m. UTC | #2

On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote:
> On Sun, 27 Oct 2024 12:07:44 +0200
> Yishai Hadas <yishaih@nvidia.com> wrote:
> > 
> > - According to the Virtio specification, a device has only two states:
> >   RUNNING and STOPPED. Consequently, certain VFIO transitions (e.g.,
> >   RUNNING_P2P->STOP, STOP->RUNNING_P2P) are treated as no-ops. When
> >   transitioning to RUNNING_P2P, the device state is set to STOP and
> >   remains STOPPED until it transitions back from RUNNING_P2P->RUNNING, at
> >   which point it resumes its RUNNING state.
> 
> Does this assume the virtio device is not a DMA target for another
> device?  If so, how can we make such an assumption?  Otherwise, what
> happens on a DMA write to the stopped virtio device?

I was told the virtio spec says that during VFIO STOP it only stops
doing outgoing DMA, it still must accept incoming operations.

It was a point of debate if the additional step (stop everything vs
stop outgoing only) was necessary and the virtio folks felt that stop
outgoing was good enough.

> If the virtio spec doesn't support partial contexts, what makes it
> beneficial here?  

It stil lets the receiver 'warm up', like allocating memory and
approximately sizing things.

> If it is beneficial, why is it beneficial to send initial data more than
> once?  

I guess because it is allowed to change and the benefit is highest
when the pre copy data closely matches the final data..

Rate limiting does seem better to me

Jason

Alex Williamson Oct. 28, 2024, 4:54 p.m. UTC | #3

On Mon, 28 Oct 2024 13:23:54 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote:
> 
> > If the virtio spec doesn't support partial contexts, what makes it
> > beneficial here?    
> 
> It stil lets the receiver 'warm up', like allocating memory and
> approximately sizing things.
> 
> > If it is beneficial, why is it beneficial to send initial data more than
> > once?    
> 
> I guess because it is allowed to change and the benefit is highest
> when the pre copy data closely matches the final data..

It would be useful to see actual data here.  For instance, what is the
latency advantage to allocating anything in the warm-up and what's the
probability that allocation is simply refreshed versus starting over?

Re-sending the initial data up to some arbitrary cap sounds more like
we're making a policy decision in the driver to consume more migration
bandwidth for some unknown latency trade-off at stop-copy.  I wonder if
that advantage disappears if the pre-copy data is at all stale relative
to the current device state.  Thanks,

Alex

Parav Pandit Oct. 28, 2024, 5:46 p.m. UTC | #4

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, October 28, 2024 10:24 PM
> 
> On Mon, 28 Oct 2024 13:23:54 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote:
> >
> > > If the virtio spec doesn't support partial contexts, what makes it
> > > beneficial here?
> >
> > It stil lets the receiver 'warm up', like allocating memory and
> > approximately sizing things.
> >
> > > If it is beneficial, why is it beneficial to send initial data more than
> > > once?
> >
> > I guess because it is allowed to change and the benefit is highest
> > when the pre copy data closely matches the final data..
> 
> It would be useful to see actual data here.  For instance, what is the latency
> advantage to allocating anything in the warm-up and what's the probability
> that allocation is simply refreshed versus starting over?
> 

Allocating everything during the warm-up phase, compared to no allocation, reduced the total VM downtime from 439 ms to 128 ms.
This was tested using two PCI VF hardware devices per VM.

The benefit comes from the device state staying mostly the same.

We tested with different configurations from 1 to 4 devices per VM, varied with vcpus and memory.
Also, more detailed test results are captured in Figure-2 on page 6 at [1].

The commit log for patch-7 should have captured the perf summary table for the value of the 7th patch.

Yishai,
If you are planning to send next revision, please add it.

> Re-sending the initial data up to some arbitrary cap sounds more like we're
> making a policy decision in the driver to consume more migration bandwidth
> for some unknown latency trade-off at stop-copy.  I wonder if that advantage
> disappears if the pre-copy data is at all stale relative to the current device
> state.  Thanks,
> 

You're right. If the pre-copy data differs significantly from the current device state, the benefits might be lost.
However, this can also depend on the device's design. A more advanced device could apply a low-pass filter to avoid unnecessary refreshes.

> Alex

[1] https://netdevconf.info/0x18/docs/netdev-0x18-paper22-talk-paper.pdf

Alex Williamson Oct. 28, 2024, 6:17 p.m. UTC | #5

On Sun, 27 Oct 2024 12:07:44 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:

> This series enhances the vfio-virtio driver to support live migration
> for virtio-net Virtual Functions (VFs) that are migration-capable.

What's the status of making virtio-net VFs in QEMU migration capable?

There would be some obvious benefits for the vfio migration ecosystem
if we could validate migration of a functional device (ie. not mtty) in
an L2 guest with no physical hardware dependencies.  Thanks,

Alex

Yishai Hadas Oct. 29, 2024, 8:43 a.m. UTC | #6

On 28/10/2024 20:17, Alex Williamson wrote:
> On Sun, 27 Oct 2024 12:07:44 +0200
> Yishai Hadas <yishaih@nvidia.com> wrote:
> 
>> This series enhances the vfio-virtio driver to support live migration
>> for virtio-net Virtual Functions (VFs) that are migration-capable.
> 
> What's the status of making virtio-net VFs in QEMU migration capable?

Currently, we don’t have plans to make virtio-net VFs in QEMU 
migration-capable.

> 
> There would be some obvious benefits for the vfio migration ecosystem
> if we could validate migration of a functional device (ie. not mtty) in
> an L2 guest with no physical hardware dependencies.  Thanks,
> 
> Alex
>

Right, software testing could be beneficial, however, we are not the 
ones who own or perform software simulation and its related tasks inside 
QEMU.

So, this task could be considered in the future once its HOST side is 
actually accepted.

Yishai

Yishai Hadas Oct. 29, 2024, 11:59 a.m. UTC | #7

On 28/10/2024 18:23, Jason Gunthorpe wrote:
> On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote:
>> On Sun, 27 Oct 2024 12:07:44 +0200
>> Yishai Hadas <yishaih@nvidia.com> wrote:
>> If the virtio spec doesn't support partial contexts, what makes it
>> beneficial here?
> 
> It stil lets the receiver 'warm up', like allocating memory and
> approximately sizing things.
> 
>> If it is beneficial, why is it beneficial to send initial data more than
>> once?
> 
> I guess because it is allowed to change and the benefit is highest
> when the pre copy data closely matches the final data..
> 
> Rate limiting does seem better to me
> 
> Jason

Right, given that the device state is likely to remain mostly unchanged 
over a certain period, using a rate limiter could be a sensible approach.

So, in V1, I plan to replace the hard-coded limit value of 128 with a 
rate limiter by reporting no data available for some time interval after 
the previous call.

I would start with a one second interval, which seems to be reasonable 
for that kind of device.

Yishai

Alex Williamson Oct. 29, 2024, 8:28 p.m. UTC | #8

On Mon, 28 Oct 2024 17:46:57 +0000
Parav Pandit <parav@nvidia.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Monday, October 28, 2024 10:24 PM
> > 
> > On Mon, 28 Oct 2024 13:23:54 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote:
> > >  
> > > > If the virtio spec doesn't support partial contexts, what makes it
> > > > beneficial here?  
> > >
> > > It stil lets the receiver 'warm up', like allocating memory and
> > > approximately sizing things.
> > >  
> > > > If it is beneficial, why is it beneficial to send initial data more than
> > > > once?  
> > >
> > > I guess because it is allowed to change and the benefit is highest
> > > when the pre copy data closely matches the final data..  
> > 
> > It would be useful to see actual data here.  For instance, what is the latency
> > advantage to allocating anything in the warm-up and what's the probability
> > that allocation is simply refreshed versus starting over?
> >   
> 
> Allocating everything during the warm-up phase, compared to no
> allocation, reduced the total VM downtime from 439 ms to 128 ms. This
> was tested using two PCI VF hardware devices per VM.
>
> The benefit comes from the device state staying mostly the same.
> 
> We tested with different configurations from 1 to 4 devices per VM,
> varied with vcpus and memory. Also, more detailed test results are
> captured in Figure-2 on page 6 at [1].

Those numbers seems to correspond to column 1 of Figure 2 in the
referenced document, but that's looking only at downtime.  To me that
chart seems to show a step function where there's ~400ms of downtime
per device, which suggests we're serializing device resume in the
stop-copy phase on the target without pre-copy.

Figure 3 appears to look at total VM migration time, where pre-copy
tends to show marginal improvements in smaller configurations, but up
to 60% worse overall migration time as the vCPU, device, and VM memory
size increase.  The paper comes to the conclusion:

	It can be concluded that either of increasing the VM memory or
	device configuration has equal effect on the VM total migration
	time, but no effect on the VM downtime due to pre-copy
	enablement.

Noting specifically "downtime" here ignores that the overall migration
time actually got worse with pre-copy.

Between columns 10 & 11 the device count is doubled.  With pre-copy
enabled, the migration time increases by 135% while with pre-copy
disabled we only only see a 113% increase.  Between columns 11 & 12 the
VM memory is further doubled.  This results in another 33% increase in
migration time with pre-copy enabled and only a 3% increase with
pre-copy disabled.  For the most part this entire figure shows that
overall migration time with pre-copy enabled is either on par with or
worse than the same with pre-copy disabled.

We then move on to Tables 1 & 2, which are again back to specifically
showing timing of operations related to downtime rather than overall
migration time.  The notable thing here seems to be that we've
amortized the 300ms per device load time across the pre-copy phase,
leaving only 11ms per device contributing to downtime.

However, the paper also goes into this tangent:

	Our observations indicate that enabling device-level pre-copy
	results in more pre-copy operations of the system RAM and
	device state. This leads to a 50% reduction in memory (RAM)
	copy time in the device pre-copy method in the micro-benchmark
	results, saving 100 milliseconds of downtime.

I'd argue that this is an anti-feature.  A less generous interpretation
is that pre-copy extended the migration time, likely resulting in more
RAM transfer during pre-copy, potentially to the point that the VM
undershot its prescribed downtime.  Further analysis should also look
at the total data transferred for the migration and adherence to the
configured VM downtime, rather than just the absolute downtime.

At the end of the paper, I think we come to the same conclusion shown
in Figure 1, where device load seems to be serialized and therefore
significantly limits scalability.  That could be parallelized, but
even 300-400ms for loading all devices is still too much contribution to
downtime.  I'd therefore agree that pre-loading the device during
pre-copy improves the scaling by an order of magnitude, but it doesn't
solve the scaling problem.  Also, it should not come with the cost of
drawing out pre-copy and thus the overall migration time to this
extent.  The reduction in downtime related to RAM copy time should be
evidence that the pre-copy behavior here has exceeded its scope and is
interfering with the balance between pre- and post- copy elsewhere.
Thanks,

Alex

> 
> [1] https://netdevconf.info/0x18/docs/netdev-0x18-paper22-talk-paper.pdf
>

Parav Pandit Oct. 31, 2024, 3:04 p.m. UTC | #9

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, October 30, 2024 1:58 AM
> 
> On Mon, 28 Oct 2024 17:46:57 +0000
> Parav Pandit <parav@nvidia.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Monday, October 28, 2024 10:24 PM
> > >
> > > On Mon, 28 Oct 2024 13:23:54 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > > On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote:
> > > >
> > > > > If the virtio spec doesn't support partial contexts, what makes
> > > > > it beneficial here?
> > > >
> > > > It stil lets the receiver 'warm up', like allocating memory and
> > > > approximately sizing things.
> > > >
> > > > > If it is beneficial, why is it beneficial to send initial data
> > > > > more than once?
> > > >
> > > > I guess because it is allowed to change and the benefit is highest
> > > > when the pre copy data closely matches the final data..
> > >
> > > It would be useful to see actual data here.  For instance, what is
> > > the latency advantage to allocating anything in the warm-up and
> > > what's the probability that allocation is simply refreshed versus starting
> over?
> > >
> >
> > Allocating everything during the warm-up phase, compared to no
> > allocation, reduced the total VM downtime from 439 ms to 128 ms. This
> > was tested using two PCI VF hardware devices per VM.
> >
> > The benefit comes from the device state staying mostly the same.
> >
> > We tested with different configurations from 1 to 4 devices per VM,
> > varied with vcpus and memory. Also, more detailed test results are
> > captured in Figure-2 on page 6 at [1].
> 
> Those numbers seems to correspond to column 1 of Figure 2 in the
> referenced document, but that's looking only at downtime.  
Yes.
What do you mean by only looking at the downtime?
The intention was to measure the downtime in various configurations.
Do you mean, we should have looked at migration bandwidth, migration amount of data, migration time too?
If so, yes, some of them were not considered as the focus was on two things:
a. total VM downtime
b. total migration time

But with recent tests, we looked at more things. Explained more below.

> To me that chart
> seems to show a step function where there's ~400ms of downtime per
> device, which suggests we're serializing device resume in the stop-copy
> phase on the target without pre-copy.
>
Yes. even without serialization, when there is single device, same bottleneck can be observed.
And your orthogonal suggestion of using parallelism is very useful.
The paper captures this aspect in text on page 7 after the Table 2.

> Figure 3 appears to look at total VM migration time, where pre-copy tends to
> show marginal improvements in smaller configurations, but up to 60% worse
> overall migration time as the vCPU, device, and VM memory size increase.
> The paper comes to the conclusion:
> 
> 	It can be concluded that either of increasing the VM memory or
> 	device configuration has equal effect on the VM total migration
> 	time, but no effect on the VM downtime due to pre-copy
> 	enablement.
> 
> Noting specifically "downtime" here ignores that the overall migration time
> actually got worse with pre-copy.
> 
> Between columns 10 & 11 the device count is doubled.  With pre-copy
> enabled, the migration time increases by 135% while with pre-copy disabled
> we only only see a 113% increase.  Between columns 11 & 12 the VM
> memory is further doubled.  This results in another 33% increase in
> migration time with pre-copy enabled and only a 3% increase with pre-copy
> disabled.  For the most part this entire figure shows that overall migration
> time with pre-copy enabled is either on par with or worse than the same
> with pre-copy disabled.
>
I will answer this part in more detail towards the end of the email.
 
> We then move on to Tables 1 & 2, which are again back to specifically
> showing timing of operations related to downtime rather than overall
> migration time. 
Yes, because the objective was to analyze the effects and improvements on downtime of various configurations of device, VM, pre-copy.

> The notable thing here seems to be that we've amortized
> the 300ms per device load time across the pre-copy phase, leaving only 11ms
> per device contributing to downtime.
> 
Correct.

> However, the paper also goes into this tangent:
> 
> 	Our observations indicate that enabling device-level pre-copy
> 	results in more pre-copy operations of the system RAM and
> 	device state. This leads to a 50% reduction in memory (RAM)
> 	copy time in the device pre-copy method in the micro-benchmark
> 	results, saving 100 milliseconds of downtime.
> 
> I'd argue that this is an anti-feature.  A less generous interpretation is that
> pre-copy extended the migration time, likely resulting in more RAM transfer
> during pre-copy, potentially to the point that the VM undershot its
> prescribed downtime.  
VM downtime was close to the configured downtime, on slightly higher side.

> Further analysis should also look at the total data
> transferred for the migration and adherence to the configured VM
> downtime, rather than just the absolute downtime.
>
We did look the device side total data transferred to see how many iterations of pre-copy done.

> At the end of the paper, I think we come to the same conclusion shown in
> Figure 1, where device load seems to be serialized and therefore significantly
> limits scalability.  That could be parallelized, but even 300-400ms for loading
> all devices is still too much contribution to downtime.  I'd therefore agree
> that pre-loading the device during pre-copy improves the scaling by an order
> of magnitude, 
Yep.
> but it doesn't solve the scaling problem.  
Yes, your suggestion is very valid.
Parallel operation from the qemu would make the downtime even smaller.
The paper also highlighted this in page 7 after Table-2.

> Also, it should not
> come with the cost of drawing out pre-copy and thus the overall migration
> time to this extent.  
Right. You pointed out rightly.
So we did several more tests in last 2 days for insights you provided.
And found an interesting outcome.

In 30+ samples, we collected for each, 
(a) pre-copy enabled and
(b) pre-copy disabled.

This was done for column 10 and 11.

The VM total migration time varied in range of 13 seconds to 60 seconds.
Most noticeably with pre-copy off also it varied in such large range.

In the paper it was pure co-incidence that every time pre-copy=on had higher migration time compared to pre-copy=on.
This led us to misguide that pre-copy influenced the higher migration time.

After some reason, we found the QEMU anomaly which was fixed/overcome by the knob " avail-switchover-bandwidth".
Basically the bandwidth calculation was not accurate, due to which the migration time fluctuated a lot.
This problem and solution are described in [2].

Following the solution_2, 
We ran exact same tests of column 10 and 11, with " avail-switchover-bandwidth" configured.
With that for both the modes pre-copy=on and off the total migration time stayed constant to 14-15 seconds.

And this conclusion aligns with your analysis that "pre-copy should not extent the migration time to this much".
Great finding, proving that figure_3 was incomplete in the paper.

> The reduction in downtime related to RAM copy time
> should be evidence that the pre-copy behavior here has exceeded its scope
> and is interfering with the balance between pre- and post- copy elsewhere.
As I explained above, pre-copy did its job, it didn't interfere. It was just not enough and right samples to analyze back then.
Now it is resolved. Thanks a lot for the direction.

> Thanks,
> 
> Alex
> 
> >
> > [1]
> > https://netdevconf.info/0x18/docs/netdev-0x18-paper22-talk-paper.pdf
> >

[2] https://lore.kernel.org/qemu-devel/20231010221922.40638-1-peterx@redhat.com/

Alex Williamson Nov. 1, 2024, 4:25 p.m. UTC | #10

On Thu, 31 Oct 2024 15:04:51 +0000
Parav Pandit <parav@nvidia.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, October 30, 2024 1:58 AM
> > 
> > On Mon, 28 Oct 2024 17:46:57 +0000
> > Parav Pandit <parav@nvidia.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Monday, October 28, 2024 10:24 PM
> > > >
> > > > On Mon, 28 Oct 2024 13:23:54 -0300
> > > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >  
> > > > > On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote:
> > > > >  
> > > > > > If the virtio spec doesn't support partial contexts, what makes
> > > > > > it beneficial here?  
> > > > >
> > > > > It stil lets the receiver 'warm up', like allocating memory and
> > > > > approximately sizing things.
> > > > >  
> > > > > > If it is beneficial, why is it beneficial to send initial data
> > > > > > more than once?  
> > > > >
> > > > > I guess because it is allowed to change and the benefit is highest
> > > > > when the pre copy data closely matches the final data..  
> > > >
> > > > It would be useful to see actual data here.  For instance, what is
> > > > the latency advantage to allocating anything in the warm-up and
> > > > what's the probability that allocation is simply refreshed versus starting  
> > over?  
> > > >  
> > >
> > > Allocating everything during the warm-up phase, compared to no
> > > allocation, reduced the total VM downtime from 439 ms to 128 ms. This
> > > was tested using two PCI VF hardware devices per VM.
> > >
> > > The benefit comes from the device state staying mostly the same.
> > >
> > > We tested with different configurations from 1 to 4 devices per VM,
> > > varied with vcpus and memory. Also, more detailed test results are
> > > captured in Figure-2 on page 6 at [1].  
> > 
> > Those numbers seems to correspond to column 1 of Figure 2 in the
> > referenced document, but that's looking only at downtime.    
> Yes.
> What do you mean by only looking at the downtime?

It's just a prelude to my interpretation that the paper is focusing
mostly on the benefits to downtime and downplaying the apparent longer
overall migration time while rationalizing the effect on RAM migration
downtime.

> The intention was to measure the downtime in various configurations.
> Do you mean, we should have looked at migration bandwidth, migration amount of data, migration time too?
> If so, yes, some of them were not considered as the focus was on two things:
> a. total VM downtime
> b. total migration time
> 
> But with recent tests, we looked at more things. Explained more below.

Good.  Yes, there should be a more holistic approach, improving the
thing we intend to improve without degrading other aspects.
 
> > To me that chart
> > seems to show a step function where there's ~400ms of downtime per
> > device, which suggests we're serializing device resume in the stop-copy
> > phase on the target without pre-copy.
> >  
> Yes. even without serialization, when there is single device, same bottleneck can be observed.
> And your orthogonal suggestion of using parallelism is very useful.
> The paper captures this aspect in text on page 7 after the Table 2.
> 
> > Figure 3 appears to look at total VM migration time, where pre-copy tends to
> > show marginal improvements in smaller configurations, but up to 60% worse
> > overall migration time as the vCPU, device, and VM memory size increase.
> > The paper comes to the conclusion:
> > 
> > 	It can be concluded that either of increasing the VM memory or
> > 	device configuration has equal effect on the VM total migration
> > 	time, but no effect on the VM downtime due to pre-copy
> > 	enablement.
> > 
> > Noting specifically "downtime" here ignores that the overall migration time
> > actually got worse with pre-copy.
> > 
> > Between columns 10 & 11 the device count is doubled.  With pre-copy
> > enabled, the migration time increases by 135% while with pre-copy disabled
> > we only only see a 113% increase.  Between columns 11 & 12 the VM
> > memory is further doubled.  This results in another 33% increase in
> > migration time with pre-copy enabled and only a 3% increase with pre-copy
> > disabled.  For the most part this entire figure shows that overall migration
> > time with pre-copy enabled is either on par with or worse than the same
> > with pre-copy disabled.
> >  
> I will answer this part in more detail towards the end of the email.
>  
> > We then move on to Tables 1 & 2, which are again back to specifically
> > showing timing of operations related to downtime rather than overall
> > migration time.   
> Yes, because the objective was to analyze the effects and improvements on downtime of various configurations of device, VM, pre-copy.
> 
> > The notable thing here seems to be that we've amortized
> > the 300ms per device load time across the pre-copy phase, leaving only 11ms
> > per device contributing to downtime.
> >   
> Correct.
> 
> > However, the paper also goes into this tangent:
> > 
> > 	Our observations indicate that enabling device-level pre-copy
> > 	results in more pre-copy operations of the system RAM and
> > 	device state. This leads to a 50% reduction in memory (RAM)
> > 	copy time in the device pre-copy method in the micro-benchmark
> > 	results, saving 100 milliseconds of downtime.
> > 
> > I'd argue that this is an anti-feature.  A less generous interpretation is that
> > pre-copy extended the migration time, likely resulting in more RAM transfer
> > during pre-copy, potentially to the point that the VM undershot its
> > prescribed downtime.    
> VM downtime was close to the configured downtime, on slightly higher side.
> 
> > Further analysis should also look at the total data
> > transferred for the migration and adherence to the configured VM
> > downtime, rather than just the absolute downtime.
> >  
> We did look the device side total data transferred to see how many iterations of pre-copy done.
> 
> > At the end of the paper, I think we come to the same conclusion shown in
> > Figure 1, where device load seems to be serialized and therefore significantly
> > limits scalability.  That could be parallelized, but even 300-400ms for loading
> > all devices is still too much contribution to downtime.  I'd therefore agree
> > that pre-loading the device during pre-copy improves the scaling by an order
> > of magnitude,   
> Yep.
> > but it doesn't solve the scaling problem.    
> Yes, your suggestion is very valid.
> Parallel operation from the qemu would make the downtime even smaller.
> The paper also highlighted this in page 7 after Table-2.
> 
> > Also, it should not
> > come with the cost of drawing out pre-copy and thus the overall migration
> > time to this extent.    
> Right. You pointed out rightly.
> So we did several more tests in last 2 days for insights you provided.
> And found an interesting outcome.
> 
> In 30+ samples, we collected for each, 
> (a) pre-copy enabled and
> (b) pre-copy disabled.
> 
> This was done for column 10 and 11.
> 
> The VM total migration time varied in range of 13 seconds to 60 seconds.
> Most noticeably with pre-copy off also it varied in such large range.
> 
> In the paper it was pure co-incidence that every time pre-copy=on had
> higher migration time compared to pre-copy=on. This led us to

Assuming typo here, =on vs =off.

> misguide that pre-copy influenced the higher migration time.
> 
> After some reason, we found the QEMU anomaly which was fixed/overcome
> by the knob " avail-switchover-bandwidth". Basically the bandwidth
> calculation was not accurate, due to which the migration time
> fluctuated a lot. This problem and solution are described in [2].
> 
> Following the solution_2, 
> We ran exact same tests of column 10 and 11, with "
> avail-switchover-bandwidth" configured. With that for both the modes
> pre-copy=on and off the total migration time stayed constant to 14-15
> seconds.
> 
> And this conclusion aligns with your analysis that "pre-copy should
> not extent the migration time to this much". Great finding, proving
> that figure_3 was incomplete in the paper.

Great!  So with this the difference in downtime related to RAM
migration in the trailing tables of the paper becomes negligible?  Is
this using the originally proposed algorithm of migrating device data
up to 128 consecutive times or is it using rate-limiting of device data
in pre-copy?  Any notable differences between those algorithms?

> > The reduction in downtime related to RAM copy time
> > should be evidence that the pre-copy behavior here has exceeded its
> > scope and is interfering with the balance between pre- and post-
> > copy elsewhere.  
> As I explained above, pre-copy did its job, it didn't interfere. It
> was just not enough and right samples to analyze back then. Now it is
> resolved. Thanks a lot for the direction.

Glad we could arrive at a better understanding overall.  Thanks,

Alex

Parav Pandit Nov. 3, 2024, 2:38 p.m. UTC | #11

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, November 1, 2024 9:55 PM
> 
> On Thu, 31 Oct 2024 15:04:51 +0000
> Parav Pandit <parav@nvidia.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, October 30, 2024 1:58 AM
> > >
> > > On Mon, 28 Oct 2024 17:46:57 +0000
> > > Parav Pandit <parav@nvidia.com> wrote:
> > >
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > Sent: Monday, October 28, 2024 10:24 PM
> > > > >
> > > > > On Mon, 28 Oct 2024 13:23:54 -0300 Jason Gunthorpe
> > > > > <jgg@nvidia.com> wrote:
> > > > >
> > > > > > On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote:
> > > > > >
> > > > > > > If the virtio spec doesn't support partial contexts, what
> > > > > > > makes it beneficial here?
> > > > > >
> > > > > > It stil lets the receiver 'warm up', like allocating memory
> > > > > > and approximately sizing things.
> > > > > >
> > > > > > > If it is beneficial, why is it beneficial to send initial
> > > > > > > data more than once?
> > > > > >
> > > > > > I guess because it is allowed to change and the benefit is
> > > > > > highest when the pre copy data closely matches the final data..
> > > > >
> > > > > It would be useful to see actual data here.  For instance, what
> > > > > is the latency advantage to allocating anything in the warm-up
> > > > > and what's the probability that allocation is simply refreshed
> > > > > versus starting
> > > over?
> > > > >
> > > >
> > > > Allocating everything during the warm-up phase, compared to no
> > > > allocation, reduced the total VM downtime from 439 ms to 128 ms.
> > > > This was tested using two PCI VF hardware devices per VM.
> > > >
> > > > The benefit comes from the device state staying mostly the same.
> > > >
> > > > We tested with different configurations from 1 to 4 devices per
> > > > VM, varied with vcpus and memory. Also, more detailed test results
> > > > are captured in Figure-2 on page 6 at [1].
> > >
> > > Those numbers seems to correspond to column 1 of Figure 2 in the
> > > referenced document, but that's looking only at downtime.
> > Yes.
> > What do you mean by only looking at the downtime?
> 
> It's just a prelude to my interpretation that the paper is focusing mostly on the
> benefits to downtime and downplaying the apparent longer overall migration
> time while rationalizing the effect on RAM migration downtime.
> 
Now after the new debug we shared, we know the other areas too.

> > The intention was to measure the downtime in various configurations.
> > Do you mean, we should have looked at migration bandwidth, migration
> amount of data, migration time too?
> > If so, yes, some of them were not considered as the focus was on two
> things:
> > a. total VM downtime
> > b. total migration time
> >
> > But with recent tests, we looked at more things. Explained more below.
> 
> Good.  Yes, there should be a more holistic approach, improving the thing we
> intend to improve without degrading other aspects.
> 
Yes.
> > > To me that chart
> > > seems to show a step function where there's ~400ms of downtime per
> > > device, which suggests we're serializing device resume in the
> > > stop-copy phase on the target without pre-copy.
> > >
> > Yes. even without serialization, when there is single device, same bottleneck
> can be observed.
> > And your orthogonal suggestion of using parallelism is very useful.
> > The paper captures this aspect in text on page 7 after the Table 2.
> >
> > > Figure 3 appears to look at total VM migration time, where pre-copy
> > > tends to show marginal improvements in smaller configurations, but
> > > up to 60% worse overall migration time as the vCPU, device, and VM
> memory size increase.
> > > The paper comes to the conclusion:
> > >
> > > 	It can be concluded that either of increasing the VM memory or
> > > 	device configuration has equal effect on the VM total migration
> > > 	time, but no effect on the VM downtime due to pre-copy
> > > 	enablement.
> > >
> > > Noting specifically "downtime" here ignores that the overall
> > > migration time actually got worse with pre-copy.
> > >
> > > Between columns 10 & 11 the device count is doubled.  With pre-copy
> > > enabled, the migration time increases by 135% while with pre-copy
> > > disabled we only only see a 113% increase.  Between columns 11 & 12
> > > the VM memory is further doubled.  This results in another 33%
> > > increase in migration time with pre-copy enabled and only a 3%
> > > increase with pre-copy disabled.  For the most part this entire
> > > figure shows that overall migration time with pre-copy enabled is
> > > either on par with or worse than the same with pre-copy disabled.
> > >
> > I will answer this part in more detail towards the end of the email.
> >
> > > We then move on to Tables 1 & 2, which are again back to
> > > specifically showing timing of operations related to downtime rather than
> overall
> > > migration time.
> > Yes, because the objective was to analyze the effects and improvements on
> downtime of various configurations of device, VM, pre-copy.
> >
> > > The notable thing here seems to be that we've amortized the 300ms
> > > per device load time across the pre-copy phase, leaving only 11ms
> > > per device contributing to downtime.
> > >
> > Correct.
> >
> > > However, the paper also goes into this tangent:
> > >
> > > 	Our observations indicate that enabling device-level pre-copy
> > > 	results in more pre-copy operations of the system RAM and
> > > 	device state. This leads to a 50% reduction in memory (RAM)
> > > 	copy time in the device pre-copy method in the micro-benchmark
> > > 	results, saving 100 milliseconds of downtime.
> > >
> > > I'd argue that this is an anti-feature.  A less generous
> > > interpretation is that pre-copy extended the migration time, likely
> > > resulting in more RAM transfer during pre-copy, potentially to the point
> that the VM undershot its
> > > prescribed downtime.
> > VM downtime was close to the configured downtime, on slightly higher side.
> >
> > > Further analysis should also look at the total data transferred for
> > > the migration and adherence to the configured VM downtime, rather
> > > than just the absolute downtime.
> > >
> > We did look the device side total data transferred to see how many iterations
> of pre-copy done.
> >
> > > At the end of the paper, I think we come to the same conclusion
> > > shown in Figure 1, where device load seems to be serialized and
> > > therefore significantly limits scalability.  That could be
> > > parallelized, but even 300-400ms for loading all devices is still
> > > too much contribution to downtime.  I'd therefore agree that pre-loading
> the device during pre-copy improves the scaling by an order
> > > of magnitude,
> > Yep.
> > > but it doesn't solve the scaling problem.
> > Yes, your suggestion is very valid.
> > Parallel operation from the qemu would make the downtime even smaller.
> > The paper also highlighted this in page 7 after Table-2.
> >
> > > Also, it should not
> > > come with the cost of drawing out pre-copy and thus the overall migration
> > > time to this extent.
> > Right. You pointed out rightly.
> > So we did several more tests in last 2 days for insights you provided.
> > And found an interesting outcome.
> >
> > In 30+ samples, we collected for each,
> > (a) pre-copy enabled and
> > (b) pre-copy disabled.
> >
> > This was done for column 10 and 11.
> >
> > The VM total migration time varied in range of 13 seconds to 60 seconds.
> > Most noticeably with pre-copy off also it varied in such large range.
> >
> > In the paper it was pure co-incidence that every time pre-copy=on had
> > higher migration time compared to pre-copy=on. This led us to
> 
> Assuming typo here, =on vs =off.
> 
Correct it is pre-copy=off.

> > misguide that pre-copy influenced the higher migration time.
> >
> > After some reason, we found the QEMU anomaly which was fixed/overcome
> > by the knob " avail-switchover-bandwidth". Basically the bandwidth
> > calculation was not accurate, due to which the migration time
> > fluctuated a lot. This problem and solution are described in [2].
> >
> > Following the solution_2,
> > We ran exact same tests of column 10 and 11, with "
> > avail-switchover-bandwidth" configured. With that for both the modes
> > pre-copy=on and off the total migration time stayed constant to 14-15
> > seconds.
> >
> > And this conclusion aligns with your analysis that "pre-copy should
> > not extent the migration time to this much". Great finding, proving
> > that figure_3 was incomplete in the paper.
> 
> Great!  So with this the difference in downtime related to RAM migration in the
> trailing tables of the paper becomes negligible?  
Yes.
> Is this using the originally
> proposed algorithm of migrating device data up to 128 consecutive times or is
> it using rate-limiting of device data in pre-copy?  
Both. Yishai has new rate limiting based algorithm which also has similar results.

> Any notable differences
> between those algorithms?
> 
No significant differences.
Vfio level data transfer size is less now, as the frequency is reduced with your suggested algorithm.

> > > The reduction in downtime related to RAM copy time should be
> > > evidence that the pre-copy behavior here has exceeded its scope and
> > > is interfering with the balance between pre- and post- copy
> > > elsewhere.
> > As I explained above, pre-copy did its job, it didn't interfere. It
> > was just not enough and right samples to analyze back then. Now it is
> > resolved. Thanks a lot for the direction.
> 
> Glad we could arrive at a better understanding overall.  Thanks,
> 
> Alex

[vfio,0/7] Enhances the vfio-virtio driver to support live migration

Message

Comments