diff mbox series

[V2,vfio,02/14] vfio: Extend the device migration protocol with PRE_COPY

Message ID 20221201152931.47913-3-yishaih@nvidia.com (mailing list archive)
State New, archived
Headers show
Series Add migration PRE_COPY support for mlx5 driver | expand

Commit Message

Yishai Hadas Dec. 1, 2022, 3:29 p.m. UTC
From: Jason Gunthorpe <jgg@nvidia.com>

The optional PRE_COPY states open the saving data transfer FD before
reaching STOP_COPY and allows the device to dirty track internal state
changes with the general idea to reduce the volume of data transferred
in the STOP_COPY stage.

While in PRE_COPY the device remains RUNNING, but the saving FD is open.

Only if the device also supports RUNNING_P2P can it support PRE_COPY_P2P,
which halts P2P transfers while continuing the saving FD.

PRE_COPY, with P2P support, requires the driver to implement 7 new arcs
and exists as an optional FSM branch between RUNNING and STOP_COPY:
    RUNNING -> PRE_COPY -> PRE_COPY_P2P -> STOP_COPY

A new ioctl VFIO_MIG_GET_PRECOPY_INFO is provided to allow userspace to
query the progress of the precopy operation in the driver with the idea it
will judge to move to STOP_COPY at least once the initial data set is
transferred, and possibly after the dirty size has shrunk appropriately.

This ioctl is valid only in PRE_COPY states and kernel driver should
return -EINVAL from any other migration state.

Compared to the v1 clarification, STOP_COPY -> PRE_COPY is blocked
and to be defined in future.
We also split the pending_bytes report into the initial and sustaining
values, e.g.: initial_bytes and dirty_bytes.
initial_bytes: Amount of initial precopy data.
dirty_bytes: Device state changes relative to data previously retrieved.
These fields are not required to have any bearing to STOP_COPY phase.

It is recommended to leave PRE_COPY for STOP_COPY only after the
initial_bytes field reaches zero. Leaving PRE_COPY earlier might make
things slower.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/vfio_main.c  |  74 ++++++++++++++++++++++-
 include/uapi/linux/vfio.h | 122 ++++++++++++++++++++++++++++++++++++--
 2 files changed, 190 insertions(+), 6 deletions(-)

Comments

Alex Williamson Dec. 1, 2022, 10:43 p.m. UTC | #1
On Thu, 1 Dec 2022 17:29:19 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:

> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> The optional PRE_COPY states open the saving data transfer FD before
> reaching STOP_COPY and allows the device to dirty track internal state
> changes with the general idea to reduce the volume of data transferred
> in the STOP_COPY stage.
> 
> While in PRE_COPY the device remains RUNNING, but the saving FD is open.
> 
> Only if the device also supports RUNNING_P2P can it support PRE_COPY_P2P,
> which halts P2P transfers while continuing the saving FD.
> 
> PRE_COPY, with P2P support, requires the driver to implement 7 new arcs
> and exists as an optional FSM branch between RUNNING and STOP_COPY:
>     RUNNING -> PRE_COPY -> PRE_COPY_P2P -> STOP_COPY
> 
> A new ioctl VFIO_MIG_GET_PRECOPY_INFO is provided to allow userspace to
> query the progress of the precopy operation in the driver with the idea it
> will judge to move to STOP_COPY at least once the initial data set is
> transferred, and possibly after the dirty size has shrunk appropriately.
> 
> This ioctl is valid only in PRE_COPY states and kernel driver should
> return -EINVAL from any other migration state.
> 
> Compared to the v1 clarification, STOP_COPY -> PRE_COPY is blocked
> and to be defined in future.
> We also split the pending_bytes report into the initial and sustaining
> values, e.g.: initial_bytes and dirty_bytes.
> initial_bytes: Amount of initial precopy data.
> dirty_bytes: Device state changes relative to data previously retrieved.
> These fields are not required to have any bearing to STOP_COPY phase.
> 
> It is recommended to leave PRE_COPY for STOP_COPY only after the
> initial_bytes field reaches zero. Leaving PRE_COPY earlier might make
> things slower.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Shay Drory <shayd@nvidia.com>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
>  drivers/vfio/vfio_main.c  |  74 ++++++++++++++++++++++-
>  include/uapi/linux/vfio.h | 122 ++++++++++++++++++++++++++++++++++++--
>  2 files changed, 190 insertions(+), 6 deletions(-)

This looks ok to me, so if you want to provide a branch for the first
patch we can move forward with the rest through the vfio tree as was
mentioned.

Comments and reviews still welcome, particularly I expect Shameer has
already reviewed this for the hisi-acc implementation.  Thanks,

Alex
Tian, Kevin Dec. 2, 2022, 8:48 a.m. UTC | #2
> From: Yishai Hadas <yishaih@nvidia.com>
> Sent: Thursday, December 1, 2022 11:29 PM
> 
> +/**
> + * VFIO_MIG_GET_PRECOPY_INFO - _IO(VFIO_TYPE, VFIO_BASE + 21)
> + *
> + * This ioctl is used on the migration data FD in the precopy phase of the
> + * migration data transfer. It returns an estimate of the current data sizes
> + * remaining to be transferred. It allows the user to judge when it is
> + * appropriate to leave PRE_COPY for STOP_COPY.
> + *
> + * This ioctl is valid only in PRE_COPY states and kernel driver should
> + * return -EINVAL from any other migration state.
> + *
> + * The vfio_precopy_info data structure returned by this ioctl provides
> + * estimates of data available from the device during the PRE_COPY states.
> + * This estimate is split into two categories, initial_bytes and
> + * dirty_bytes.
> + *
> + * The initial_bytes field indicates the amount of initial precopy
> + * data available from the device. This field should have a non-zero initial
> + * value and decrease as migration data is read from the device.
> + * It is recommended to leave PRE_COPY for STOP_COPY only after this field
> + * reaches zero. Leaving PRE_COPY earlier might make things slower.

'slower' because partially transferred initial state is wasted and a full
state transfer is still required in STOP_COPY?

> + *
> + * The dirty_bytes field tracks device state changes relative to data
> + * previously retrieved.  This field starts at zero and may increase as
> + * the internal device state is modified or decrease as that modified
> + * state is read from the device.
> + *
> + * Userspace may use the combination of these fields to estimate the
> + * potential data size available during the PRE_COPY phases, as well as
> + * trends relative to the rate the device is dirtying its internal
> + * state, but these fields are not required to have any bearing relative
> + * to the data size available during the STOP_COPY phase.

I didn't get what the last sentence is trying to say. By definition those
fields have nothing to do with the transferred data in STOP_COPY.

is there an example what a silly driver might do w/o this caveat?

Except above this looks good to me:

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Shameerali Kolothum Thodi Dec. 2, 2022, 9:38 a.m. UTC | #3
> -----Original Message-----
> From: Yishai Hadas [mailto:yishaih@nvidia.com]
> Sent: 01 December 2022 15:29
> To: alex.williamson@redhat.com; jgg@nvidia.com
> Cc: kvm@vger.kernel.org; kevin.tian@intel.com; joao.m.martins@oracle.com;
> leonro@nvidia.com; shayd@nvidia.com; yishaih@nvidia.com;
> maorg@nvidia.com; avihaih@nvidia.com; cohuck@redhat.com
> Subject: [PATCH V2 vfio 02/14] vfio: Extend the device migration protocol
> with PRE_COPY

[...]
 
> +/**
> + * VFIO_MIG_GET_PRECOPY_INFO - _IO(VFIO_TYPE, VFIO_BASE + 21)
> + *
> + * This ioctl is used on the migration data FD in the precopy phase of the
> + * migration data transfer. It returns an estimate of the current data sizes
> + * remaining to be transferred. It allows the user to judge when it is
> + * appropriate to leave PRE_COPY for STOP_COPY.
> + *
> + * This ioctl is valid only in PRE_COPY states and kernel driver should
> + * return -EINVAL from any other migration state.
> + *
> + * The vfio_precopy_info data structure returned by this ioctl provides
> + * estimates of data available from the device during the PRE_COPY states.
> + * This estimate is split into two categories, initial_bytes and
> + * dirty_bytes.
> + *
> + * The initial_bytes field indicates the amount of initial precopy
> + * data available from the device. This field should have a non-zero initial
> + * value and decrease as migration data is read from the device.
> + * It is recommended to leave PRE_COPY for STOP_COPY only after this field
> + * reaches zero. Leaving PRE_COPY earlier might make things slower.
> + *
> + * The dirty_bytes field tracks device state changes relative to data
> + * previously retrieved.  This field starts at zero and may increase as
> + * the internal device state is modified or decrease as that modified
> + * state is read from the device.
> + *
> + * Userspace may use the combination of these fields to estimate the
> + * potential data size available during the PRE_COPY phases, as well as
> + * trends relative to the rate the device is dirtying its internal
> + * state, but these fields are not required to have any bearing relative
> + * to the data size available during the STOP_COPY phase.
> + *
> + * Drivers have a lot of flexibility in when and what they transfer during the
> + * PRE_COPY phase, and how they report this from
> VFIO_MIG_GET_PRECOPY_INFO.
> + *
> + * During pre-copy the migration data FD has a temporary "end of stream"
> that is
> + * reached when both initial_bytes and dirty_byte are zero. For instance,
> this
> + * may indicate that the device is idle and not currently dirtying any internal
> + * state. When read() is done on this temporary end of stream the kernel
> driver
> + * should return ENOMSG from read(). Userspace can wait for more data
> (which may
> + * never come) by using poll.
> + *
> + * Once in STOP_COPY the migration data FD has a permanent end of
> stream
> + * signaled in the usual way by read() always returning 0 and poll always
> + * returning readable. ENOMSG may not be returned in STOP_COPY.
> Support
> + * for this ioctl is optional.

Isn't mandatory if the driver claims support for VFIO_MIGRATION_PRE_COPY?

Other than that looks fine to me.

Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Leon Romanovsky Dec. 4, 2022, 7:38 a.m. UTC | #4
On Thu, Dec 01, 2022 at 03:43:46PM -0700, Alex Williamson wrote:
> On Thu, 1 Dec 2022 17:29:19 +0200
> Yishai Hadas <yishaih@nvidia.com> wrote:
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>

<...>

> >  drivers/vfio/vfio_main.c  |  74 ++++++++++++++++++++++-
> >  include/uapi/linux/vfio.h | 122 ++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 190 insertions(+), 6 deletions(-)
> 
> This looks ok to me, so if you want to provide a branch for the first
> patch we can move forward with the rest through the vfio tree as was
> mentioned.

Alex, feel free to take first patch too.
We are in -rc8 tomorrow and I don't expect any merge conflicts.

Thanks
Yishai Hadas Dec. 4, 2022, 8:35 a.m. UTC | #5
On 02/12/2022 10:48, Tian, Kevin wrote:
>> From: Yishai Hadas <yishaih@nvidia.com>
>> Sent: Thursday, December 1, 2022 11:29 PM
>>
>> +/**
>> + * VFIO_MIG_GET_PRECOPY_INFO - _IO(VFIO_TYPE, VFIO_BASE + 21)
>> + *
>> + * This ioctl is used on the migration data FD in the precopy phase of the
>> + * migration data transfer. It returns an estimate of the current data sizes
>> + * remaining to be transferred. It allows the user to judge when it is
>> + * appropriate to leave PRE_COPY for STOP_COPY.
>> + *
>> + * This ioctl is valid only in PRE_COPY states and kernel driver should
>> + * return -EINVAL from any other migration state.
>> + *
>> + * The vfio_precopy_info data structure returned by this ioctl provides
>> + * estimates of data available from the device during the PRE_COPY states.
>> + * This estimate is split into two categories, initial_bytes and
>> + * dirty_bytes.
>> + *
>> + * The initial_bytes field indicates the amount of initial precopy
>> + * data available from the device. This field should have a non-zero initial
>> + * value and decrease as migration data is read from the device.
>> + * It is recommended to leave PRE_COPY for STOP_COPY only after this field
>> + * reaches zero. Leaving PRE_COPY earlier might make things slower.
> 'slower' because partially transferred initial state is wasted and a full
> state transfer is still required in STOP_COPY?

Not only, 'the initial_bytes' can serve any driver for its specific 
needs to reduce downtime.

For example, mlx5 passes by that some metadata about the state that 
allows the target to be prepared for during STOP_COPY.

This data can be used by the FW to allocate host pages pre-ahead, 
reorganize its internal data structure accordingly, etc.

Leaving PRE_COPY to STOP_COPY earlier might not give the target the 
chance to enjoy from that information and things might be slower as part 
of STOP_COPY.

>
>> + *
>> + * The dirty_bytes field tracks device state changes relative to data
>> + * previously retrieved.  This field starts at zero and may increase as
>> + * the internal device state is modified or decrease as that modified
>> + * state is read from the device.
>> + *
>> + * Userspace may use the combination of these fields to estimate the
>> + * potential data size available during the PRE_COPY phases, as well as
>> + * trends relative to the rate the device is dirtying its internal
>> + * state, but these fields are not required to have any bearing relative
>> + * to the data size available during the STOP_COPY phase.
> I didn't get what the last sentence is trying to say. By definition those
> fields have nothing to do with the transferred data in STOP_COPY.
>
> is there an example what a silly driver might do w/o this caveat?

It comes to say that user space can't assume anything about the size of 
the trailing STOP_COPY data set, this is why it's part of the UAPI header.

I believe that better keep it, as it clarifies things and prevent any 
mistake.

> Except above this looks good to me:
>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Thanks Kevin, will add your Reviewed-by as part of V3.

Yishai
Yishai Hadas Dec. 5, 2022, 1:54 p.m. UTC | #6
On 02/12/2022 11:38, Shameerali Kolothum Thodi wrote:
>
>> -----Original Message-----
>> From: Yishai Hadas [mailto:yishaih@nvidia.com]
>> Sent: 01 December 2022 15:29
>> To: alex.williamson@redhat.com; jgg@nvidia.com
>> Cc: kvm@vger.kernel.org; kevin.tian@intel.com; joao.m.martins@oracle.com;
>> leonro@nvidia.com; shayd@nvidia.com; yishaih@nvidia.com;
>> maorg@nvidia.com; avihaih@nvidia.com; cohuck@redhat.com
>> Subject: [PATCH V2 vfio 02/14] vfio: Extend the device migration protocol
>> with PRE_COPY
> [...]
>   
>> +/**
>> + * VFIO_MIG_GET_PRECOPY_INFO - _IO(VFIO_TYPE, VFIO_BASE + 21)
>> + *
>> + * This ioctl is used on the migration data FD in the precopy phase of the
>> + * migration data transfer. It returns an estimate of the current data sizes
>> + * remaining to be transferred. It allows the user to judge when it is
>> + * appropriate to leave PRE_COPY for STOP_COPY.
>> + *
>> + * This ioctl is valid only in PRE_COPY states and kernel driver should
>> + * return -EINVAL from any other migration state.
>> + *
>> + * The vfio_precopy_info data structure returned by this ioctl provides
>> + * estimates of data available from the device during the PRE_COPY states.
>> + * This estimate is split into two categories, initial_bytes and
>> + * dirty_bytes.
>> + *
>> + * The initial_bytes field indicates the amount of initial precopy
>> + * data available from the device. This field should have a non-zero initial
>> + * value and decrease as migration data is read from the device.
>> + * It is recommended to leave PRE_COPY for STOP_COPY only after this field
>> + * reaches zero. Leaving PRE_COPY earlier might make things slower.
>> + *
>> + * The dirty_bytes field tracks device state changes relative to data
>> + * previously retrieved.  This field starts at zero and may increase as
>> + * the internal device state is modified or decrease as that modified
>> + * state is read from the device.
>> + *
>> + * Userspace may use the combination of these fields to estimate the
>> + * potential data size available during the PRE_COPY phases, as well as
>> + * trends relative to the rate the device is dirtying its internal
>> + * state, but these fields are not required to have any bearing relative
>> + * to the data size available during the STOP_COPY phase.
>> + *
>> + * Drivers have a lot of flexibility in when and what they transfer during the
>> + * PRE_COPY phase, and how they report this from
>> VFIO_MIG_GET_PRECOPY_INFO.
>> + *
>> + * During pre-copy the migration data FD has a temporary "end of stream"
>> that is
>> + * reached when both initial_bytes and dirty_byte are zero. For instance,
>> this
>> + * may indicate that the device is idle and not currently dirtying any internal
>> + * state. When read() is done on this temporary end of stream the kernel
>> driver
>> + * should return ENOMSG from read(). Userspace can wait for more data
>> (which may
>> + * never come) by using poll.
>> + *
>> + * Once in STOP_COPY the migration data FD has a permanent end of
>> stream
>> + * signaled in the usual way by read() always returning 0 and poll always
>> + * returning readable. ENOMSG may not be returned in STOP_COPY.
>> Support
>> + * for this ioctl is optional.
> Isn't mandatory if the driver claims support for VFIO_MIGRATION_PRE_COPY?

It seems reasonable to let it be mandatory once the driver claims to 
support PRE_COPY.

This will also simplify things from QEMU point of view which can expect 
the IOCTL to be supported.

Will add a note here as part of V3.

>
> Other than that looks fine to me.
>
> Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
>
Thanks, will add your Reviewed-by as part of V3.

Yishai
diff mbox series

Patch

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 662e267a3e13..9c4a752dad4e 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1042,7 +1042,7 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 			    enum vfio_device_mig_state new_fsm,
 			    enum vfio_device_mig_state *next_fsm)
 {
-	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RUNNING_P2P + 1 };
+	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_PRE_COPY_P2P + 1 };
 	/*
 	 * The coding in this table requires the driver to implement the
 	 * following FSM arcs:
@@ -1057,30 +1057,65 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 	 *         RUNNING_P2P -> RUNNING
 	 *         RUNNING_P2P -> STOP
 	 *         STOP -> RUNNING_P2P
-	 * Without P2P the driver must implement:
+	 *
+	 * If precopy is supported then the driver must support these additional
+	 * FSM arcs:
+	 *         RUNNING -> PRE_COPY
+	 *         PRE_COPY -> RUNNING
+	 *         PRE_COPY -> STOP_COPY
+	 * However, if precopy and P2P are supported together then the driver
+	 * must support these additional arcs beyond the P2P arcs above:
+	 *         PRE_COPY -> RUNNING
+	 *         PRE_COPY -> PRE_COPY_P2P
+	 *         PRE_COPY_P2P -> PRE_COPY
+	 *         PRE_COPY_P2P -> RUNNING_P2P
+	 *         PRE_COPY_P2P -> STOP_COPY
+	 *         RUNNING -> PRE_COPY
+	 *         RUNNING_P2P -> PRE_COPY_P2P
+	 *
+	 * Without P2P and precopy the driver must implement:
 	 *         RUNNING -> STOP
 	 *         STOP -> RUNNING
 	 *
 	 * The coding will step through multiple states for some combination
 	 * transitions; if all optional features are supported, this means the
 	 * following ones:
+	 *         PRE_COPY -> PRE_COPY_P2P -> STOP_COPY
+	 *         PRE_COPY -> RUNNING -> RUNNING_P2P
+	 *         PRE_COPY -> RUNNING -> RUNNING_P2P -> STOP
+	 *         PRE_COPY -> RUNNING -> RUNNING_P2P -> STOP -> RESUMING
+	 *         PRE_COPY_P2P -> RUNNING_P2P -> RUNNING
+	 *         PRE_COPY_P2P -> RUNNING_P2P -> STOP
+	 *         PRE_COPY_P2P -> RUNNING_P2P -> STOP -> RESUMING
 	 *         RESUMING -> STOP -> RUNNING_P2P
+	 *         RESUMING -> STOP -> RUNNING_P2P -> PRE_COPY_P2P
 	 *         RESUMING -> STOP -> RUNNING_P2P -> RUNNING
+	 *         RESUMING -> STOP -> RUNNING_P2P -> RUNNING -> PRE_COPY
 	 *         RESUMING -> STOP -> STOP_COPY
+	 *         RUNNING -> RUNNING_P2P -> PRE_COPY_P2P
 	 *         RUNNING -> RUNNING_P2P -> STOP
 	 *         RUNNING -> RUNNING_P2P -> STOP -> RESUMING
 	 *         RUNNING -> RUNNING_P2P -> STOP -> STOP_COPY
+	 *         RUNNING_P2P -> RUNNING -> PRE_COPY
 	 *         RUNNING_P2P -> STOP -> RESUMING
 	 *         RUNNING_P2P -> STOP -> STOP_COPY
+	 *         STOP -> RUNNING_P2P -> PRE_COPY_P2P
 	 *         STOP -> RUNNING_P2P -> RUNNING
+	 *         STOP -> RUNNING_P2P -> RUNNING -> PRE_COPY
 	 *         STOP_COPY -> STOP -> RESUMING
 	 *         STOP_COPY -> STOP -> RUNNING_P2P
 	 *         STOP_COPY -> STOP -> RUNNING_P2P -> RUNNING
+	 *
+	 *  The following transitions are blocked:
+	 *         STOP_COPY -> PRE_COPY
+	 *         STOP_COPY -> PRE_COPY_P2P
 	 */
 	static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = {
 		[VFIO_DEVICE_STATE_STOP] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
@@ -1089,14 +1124,38 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 		[VFIO_DEVICE_STATE_RUNNING] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
+		[VFIO_DEVICE_STATE_PRE_COPY] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_PRE_COPY_P2P] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
 		[VFIO_DEVICE_STATE_STOP_COPY] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
@@ -1105,6 +1164,8 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 		[VFIO_DEVICE_STATE_RESUMING] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
@@ -1113,6 +1174,8 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 		[VFIO_DEVICE_STATE_RUNNING_P2P] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
@@ -1121,6 +1184,8 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 		[VFIO_DEVICE_STATE_ERROR] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_ERROR,
@@ -1131,6 +1196,11 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 	static const unsigned int state_flags_table[VFIO_DEVICE_NUM_STATES] = {
 		[VFIO_DEVICE_STATE_STOP] = VFIO_MIGRATION_STOP_COPY,
 		[VFIO_DEVICE_STATE_RUNNING] = VFIO_MIGRATION_STOP_COPY,
+		[VFIO_DEVICE_STATE_PRE_COPY] =
+			VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_PRE_COPY,
+		[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_MIGRATION_STOP_COPY |
+						   VFIO_MIGRATION_P2P |
+						   VFIO_MIGRATION_PRE_COPY,
 		[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_MIGRATION_STOP_COPY,
 		[VFIO_DEVICE_STATE_RESUMING] = VFIO_MIGRATION_STOP_COPY,
 		[VFIO_DEVICE_STATE_RUNNING_P2P] =
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 3e45dbaf190e..a2efd04e506b 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -819,12 +819,20 @@  struct vfio_device_feature {
  * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P means that RUNNING_P2P
  * is supported in addition to the STOP_COPY states.
  *
+ * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_PRE_COPY means that
+ * PRE_COPY is supported in addition to the STOP_COPY states.
+ *
+ * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P | VFIO_MIGRATION_PRE_COPY
+ * means that RUNNING_P2P, PRE_COPY and PRE_COPY_P2P are supported
+ * in addition to the STOP_COPY states.
+ *
  * Other combinations of flags have behavior to be defined in the future.
  */
 struct vfio_device_feature_migration {
 	__aligned_u64 flags;
 #define VFIO_MIGRATION_STOP_COPY	(1 << 0)
 #define VFIO_MIGRATION_P2P		(1 << 1)
+#define VFIO_MIGRATION_PRE_COPY		(1 << 2)
 };
 #define VFIO_DEVICE_FEATURE_MIGRATION 1
 
@@ -875,8 +883,13 @@  struct vfio_device_feature_mig_state {
  *  RESUMING - The device is stopped and is loading a new internal state
  *  ERROR - The device has failed and must be reset
  *
- * And 1 optional state to support VFIO_MIGRATION_P2P:
+ * And optional states to support VFIO_MIGRATION_P2P:
  *  RUNNING_P2P - RUNNING, except the device cannot do peer to peer DMA
+ * And VFIO_MIGRATION_PRE_COPY:
+ *  PRE_COPY - The device is running normally but tracking internal state
+ *             changes
+ * And VFIO_MIGRATION_P2P | VFIO_MIGRATION_PRE_COPY:
+ *  PRE_COPY_P2P - PRE_COPY, except the device cannot do peer to peer DMA
  *
  * The FSM takes actions on the arcs between FSM states. The driver implements
  * the following behavior for the FSM arcs:
@@ -908,20 +921,48 @@  struct vfio_device_feature_mig_state {
  *
  *   To abort a RESUMING session the device must be reset.
  *
+ * PRE_COPY -> RUNNING
  * RUNNING_P2P -> RUNNING
  *   While in RUNNING the device is fully operational, the device may generate
  *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
  *   and the device may advance its internal state.
  *
+ *   The PRE_COPY arc will terminate a data transfer session.
+ *
+ * PRE_COPY_P2P -> RUNNING_P2P
  * RUNNING -> RUNNING_P2P
  * STOP -> RUNNING_P2P
  *   While in RUNNING_P2P the device is partially running in the P2P quiescent
  *   state defined below.
  *
+ *   The PRE_COPY_P2P arc will terminate a data transfer session.
+ *
+ * RUNNING -> PRE_COPY
+ * RUNNING_P2P -> PRE_COPY_P2P
  * STOP -> STOP_COPY
- *   This arc begin the process of saving the device state and will return a
- *   new data_fd.
+ *   PRE_COPY, PRE_COPY_P2P and STOP_COPY form the "saving group" of states
+ *   which share a data transfer session. Moving between these states alters
+ *   what is streamed in session, but does not terminate or otherwise affect
+ *   the associated fd.
+ *
+ *   These arcs begin the process of saving the device state and will return a
+ *   new data_fd. The migration driver may perform actions such as enabling
+ *   dirty logging of device state when entering PRE_COPY or PER_COPY_P2P.
  *
+ *   Each arc does not change the device operation, the device remains
+ *   RUNNING, P2P quiesced or in STOP. The STOP_COPY state is described below
+ *   in PRE_COPY_P2P -> STOP_COPY.
+ *
+ * PRE_COPY -> PRE_COPY_P2P
+ *   Entering PRE_COPY_P2P continues all the behaviors of PRE_COPY above.
+ *   However, while in the PRE_COPY_P2P state, the device is partially running
+ *   in the P2P quiescent state defined below, like RUNNING_P2P.
+ *
+ * PRE_COPY_P2P -> PRE_COPY
+ *   This arc allows returning the device to a full RUNNING behavior while
+ *   continuing all the behaviors of PRE_COPY.
+ *
+ * PRE_COPY_P2P -> STOP_COPY
  *   While in the STOP_COPY state the device has the same behavior as STOP
  *   with the addition that the data transfers session continues to stream the
  *   migration state. End of stream on the FD indicates the entire device
@@ -939,6 +980,13 @@  struct vfio_device_feature_mig_state {
  *   device state for this arc if required to prepare the device to receive the
  *   migration data.
  *
+ * STOP_COPY -> PRE_COPY
+ * STOP_COPY -> PRE_COPY_P2P
+ *   These arcs are not permitted and return error if requested. Future
+ *   revisions of this API may define behaviors for these arcs, in this case
+ *   support will be discoverable by a new flag in
+ *   VFIO_DEVICE_FEATURE_MIGRATION.
+ *
  * any -> ERROR
  *   ERROR cannot be specified as a device state, however any transition request
  *   can be failed with an errno return and may then move the device_state into
@@ -950,7 +998,7 @@  struct vfio_device_feature_mig_state {
  * The optional peer to peer (P2P) quiescent state is intended to be a quiescent
  * state for the device for the purposes of managing multiple devices within a
  * user context where peer-to-peer DMA between devices may be active. The
- * RUNNING_P2P states must prevent the device from initiating
+ * RUNNING_P2P and PRE_COPY_P2P states must prevent the device from initiating
  * any new P2P DMA transactions. If the device can identify P2P transactions
  * then it can stop only P2P DMA, otherwise it must stop all DMA. The migration
  * driver must complete any such outstanding operations prior to completing the
@@ -963,6 +1011,8 @@  struct vfio_device_feature_mig_state {
  * above FSM arcs. As there are multiple paths through the FSM arcs the path
  * should be selected based on the following rules:
  *   - Select the shortest path.
+ *   - The path cannot have saving group states as interior arcs, only
+ *     starting/end states.
  * Refer to vfio_mig_get_next_state() for the result of the algorithm.
  *
  * The automatic transit through the FSM arcs that make up the combination
@@ -976,6 +1026,9 @@  struct vfio_device_feature_mig_state {
  * support them. The user can discover if these states are supported by using
  * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the user can
  * avoid knowing about these optional states if the kernel driver supports them.
+ *
+ * Arcs touching PRE_COPY and PRE_COPY_P2P are removed if support for PRE_COPY
+ * is not present.
  */
 enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_ERROR = 0,
@@ -984,8 +1037,69 @@  enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_STOP_COPY = 3,
 	VFIO_DEVICE_STATE_RESUMING = 4,
 	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
+	VFIO_DEVICE_STATE_PRE_COPY = 6,
+	VFIO_DEVICE_STATE_PRE_COPY_P2P = 7,
+};
+
+/**
+ * VFIO_MIG_GET_PRECOPY_INFO - _IO(VFIO_TYPE, VFIO_BASE + 21)
+ *
+ * This ioctl is used on the migration data FD in the precopy phase of the
+ * migration data transfer. It returns an estimate of the current data sizes
+ * remaining to be transferred. It allows the user to judge when it is
+ * appropriate to leave PRE_COPY for STOP_COPY.
+ *
+ * This ioctl is valid only in PRE_COPY states and kernel driver should
+ * return -EINVAL from any other migration state.
+ *
+ * The vfio_precopy_info data structure returned by this ioctl provides
+ * estimates of data available from the device during the PRE_COPY states.
+ * This estimate is split into two categories, initial_bytes and
+ * dirty_bytes.
+ *
+ * The initial_bytes field indicates the amount of initial precopy
+ * data available from the device. This field should have a non-zero initial
+ * value and decrease as migration data is read from the device.
+ * It is recommended to leave PRE_COPY for STOP_COPY only after this field
+ * reaches zero. Leaving PRE_COPY earlier might make things slower.
+ *
+ * The dirty_bytes field tracks device state changes relative to data
+ * previously retrieved.  This field starts at zero and may increase as
+ * the internal device state is modified or decrease as that modified
+ * state is read from the device.
+ *
+ * Userspace may use the combination of these fields to estimate the
+ * potential data size available during the PRE_COPY phases, as well as
+ * trends relative to the rate the device is dirtying its internal
+ * state, but these fields are not required to have any bearing relative
+ * to the data size available during the STOP_COPY phase.
+ *
+ * Drivers have a lot of flexibility in when and what they transfer during the
+ * PRE_COPY phase, and how they report this from VFIO_MIG_GET_PRECOPY_INFO.
+ *
+ * During pre-copy the migration data FD has a temporary "end of stream" that is
+ * reached when both initial_bytes and dirty_byte are zero. For instance, this
+ * may indicate that the device is idle and not currently dirtying any internal
+ * state. When read() is done on this temporary end of stream the kernel driver
+ * should return ENOMSG from read(). Userspace can wait for more data (which may
+ * never come) by using poll.
+ *
+ * Once in STOP_COPY the migration data FD has a permanent end of stream
+ * signaled in the usual way by read() always returning 0 and poll always
+ * returning readable. ENOMSG may not be returned in STOP_COPY. Support
+ * for this ioctl is optional.
+ *
+ * Return: 0 on success, -1 and errno set on failure.
+ */
+struct vfio_precopy_info {
+	__u32 argsz;
+	__u32 flags;
+	__aligned_u64 initial_bytes;
+	__aligned_u64 dirty_bytes;
 };
 
+#define VFIO_MIG_GET_PRECOPY_INFO _IO(VFIO_TYPE, VFIO_BASE + 21)
+
 /*
  * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
  * state with the platform-based power management.  Device use of lower power