Message ID | 20220130160826.32449-9-yishaih@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add mlx5 live migration driver and v2 migration protocol | expand |
On Sun, 30 Jan 2022 18:08:19 +0200 Yishai Hadas <yishaih@nvidia.com> wrote: > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index ef33ea002b0b..d9162702973a 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -605,10 +605,10 @@ struct vfio_region_gfx_edid { > > struct vfio_device_migration_info { > __u32 device_state; /* VFIO device state */ > -#define VFIO_DEVICE_STATE_STOP (0) > -#define VFIO_DEVICE_STATE_RUNNING (1 << 0) > -#define VFIO_DEVICE_STATE_SAVING (1 << 1) > -#define VFIO_DEVICE_STATE_RESUMING (1 << 2) > +#define VFIO_DEVICE_STATE_V1_STOP (0) > +#define VFIO_DEVICE_STATE_V1_RUNNING (1 << 0) > +#define VFIO_DEVICE_STATE_V1_SAVING (1 << 1) > +#define VFIO_DEVICE_STATE_V1_RESUMING (1 << 2) I assume the below is kept until we rip out all the references, but I'm not sure why we're bothering to define V1 that's not used anywhere versus just deleting the above to avoid collision with the new enum. > #define VFIO_DEVICE_STATE_MASK (VFIO_DEVICE_STATE_RUNNING | \ > VFIO_DEVICE_STATE_SAVING | \ > VFIO_DEVICE_STATE_RESUMING) > @@ -1002,6 +1002,162 @@ struct vfio_device_feature { > */ > #define VFIO_DEVICE_FEATURE_PCI_VF_TOKEN (0) > > +/* > + * Indicates the device can support the migration API. See enum > + * vfio_device_mig_state for details. If present flags must be non-zero and > + * VFIO_DEVICE_MIG_SET_STATE is supported. > + * > + * VFIO_MIGRATION_STOP_COPY means that RUNNING, STOP, STOP_COPY and > + * RESUMING are supported. > + */ > +struct vfio_device_feature_migration { > + __aligned_u64 flags; > +#define VFIO_MIGRATION_STOP_COPY (1 << 0) > +}; > +#define VFIO_DEVICE_FEATURE_MIGRATION 1 > + > +/* > + * The device migration Finite State Machine is described by the enum > + * vfio_device_mig_state. Some of the FSM arcs will create a migration data > + * transfer session by returning a FD, in this case the migration data will > + * flow over the FD using read() and write() as discussed below. > + * > + * There are 5 states to support VFIO_MIGRATION_STOP_COPY: > + * RUNNING - The device is running normally > + * STOP - The device does not change the internal or external state > + * STOP_COPY - The device internal state can be read out > + * RESUMING - The device is stopped and is loading a new internal state > + * ERROR - The device has failed and must be reset > + * > + * The FSM takes actions on the arcs between FSM states. The driver implements > + * the following behavior for the FSM arcs: > + * > + * RUNNING -> STOP > + * STOP_COPY -> STOP > + * While in STOP the device must stop the operation of the device. The > + * device must not generate interrupts, DMA, or advance its internal > + * state. When stopped the device and kernel migration driver must accept > + * and respond to interaction to support external subsystems in the STOP > + * state, for example PCI MSI-X and PCI config pace. Failure by the user to > + * restrict device access while in STOP must not result in error conditions > + * outside the user context (ex. host system faults). > + * > + * The STOP_COPY arc will terminate a data transfer session. > + * > + * RESUMING -> STOP > + * Leaving RESUMING terminates a data transfer session and indicates the > + * device should complete processing of the data delivered by write(). The > + * kernel migration driver should complete the incorporation of data written > + * to the data transfer FD into the device internal state and perform > + * final validity and consistency checking of the new device state. If the > + * user provided data is found to be incomplete, inconsistent, or otherwise > + * invalid, the migration driver must fail the SET_STATE ioctl and > + * optionally go to the ERROR state as described below. > + * > + * While in STOP the device has the same behavior as other STOP states > + * described above. > + * > + * To abort a RESUMING session the device must be reset. > + * > + * STOP -> RUNNING > + * While in RUNNING the device is fully operational, the device may generate > + * interrupts, DMA, respond to MMIO, all vfio device regions are functional, > + * and the device may advance its internal state. > + * > + * STOP -> STOP_COPY > + * This arc begin the process of saving the device state and will return a > + * new data_fd. > + * > + * While in the STOP_COPY state the device has the same behavior as STOP > + * with the addition that the data transfers session continues to stream the > + * migration state. End of stream on the FD indicates the entire device > + * state has been transferred. > + * > + * The user should take steps to restrict access to vfio device regions while > + * the device is in STOP_COPY or risk corruption of the device migration data > + * stream. > + * > + * STOP -> RESUMING > + * Entering the RESUMING state starts a process of restoring the device > + * state and will return a new data_fd. The data stream fed into the data_fd > + * should be taken from the data transfer output of the saving group states > + * from a compatible device. The migration driver may alter/reset the > + * internal device state for this arc if required to prepare the device to > + * receive the migration data. > + * > + * any -> ERROR > + * ERROR cannot be specified as a device state, however any transition request > + * can be failed with an errno return and may then move the device_state into > + * ERROR. In this case the device was unable to execute the requested arc and > + * was also unable to restore the device to any valid device_state. The ERROR > + * state will be returned as described below in VFIO_DEVICE_MIG_SET_STATE. To > + * recover from ERROR VFIO_DEVICE_RESET must be used to return the > + * device_state back to RUNNING. > + * > + * The remaining possible transitions are interpreted as combinations of the > + * above FSM arcs. As there are multiple paths through the FSM arcs the path > + * should be selected based on the following rules: > + * - Select the shortest path. > + * Refer to vfio_mig_get_next_state() for the result of the algorithm. > + * > + * The automatic transit through the FSM arcs that make up the combination > + * transition is invisible to the user. When working with combination arcs the > + * user may see any step along the path in the device_state if SET_STATE > + * fails. When handling these types of errors users should anticipate future > + * revisions of this protocol using new states and those states becoming > + * visible in this case. > + */ > +enum vfio_device_mig_state { > + VFIO_DEVICE_STATE_ERROR = 0, > + VFIO_DEVICE_STATE_STOP = 1, > + VFIO_DEVICE_STATE_RUNNING = 2, > + VFIO_DEVICE_STATE_STOP_COPY = 3, > + VFIO_DEVICE_STATE_RESUMING = 4, > +}; > + > +/** > + * VFIO_DEVICE_MIG_SET_STATE - _IO(VFIO_TYPE, VFIO_BASE + 21) > + * > + * Execute a migration state change command on the VFIO device. The new state is > + * supplied in device_state. > + * > + * The kernel migration driver must fully transition the device to the new state > + * value before the write(2) operation returns to the user. > + * > + * The kernel migration driver must not generate asynchronous device state > + * transitions outside of manipulation by the user or the VFIO_DEVICE_RESET > + * ioctl as described above. > + * > + * If this function fails and returns -1 then the device_state is updated with > + * the current state the device is in. This may be the original operating state > + * or some other state along the combination transition path. The user can then > + * decide if it should execute a VFIO_DEVICE_RESET, attempt to return to the > + * original state, or attempt to return to some other state such as RUNNING or > + * STOP. If errno is set to EOPNOTSUPP, EFAULT or ENOTTY then the device_state > + * output is not reliable. I haven't made it through the full series yet, but it's not clear to me why these specific errnos are being masked above. > + * > + * If the new_state starts a new data transfer session then the FD associated > + * with that session is returned in data_fd. The user is responsible to close > + * this FD when it is finished. The user must consider the migration data > + * segments carried over the FD to be opaque and non-fungible. During RESUMING, > + * the data segments must be written in the same order they came out of the > + * saving side FD. The lifecycle of this FD is a little sketchy. The user is responsible to close the FD, are they required to? ie. should the migration driver fail transitions if there's an outstanding FD? Should the core code mangle the f_ops or force and EOF or in some other way disconnect the FD to avoid driver bugs/exploits with users poking stale FDs? Should we be bumping a reference on the device FD such that we can't have outstanding migration FDs with the device closed (and re-assigned to a new user)? > + * > + * Setting device_state to VFIO_DEVICE_STATE_ERROR will always fail with EINVAL, > + * and take no action. However the device_state will be updated with the current > + * value. > + * > + * Return: 0 on success, -1 and errno set on failure. > + */ > +struct vfio_device_mig_set_state { > + __u32 argsz; > + __u32 device_state; > + __s32 data_fd; > + __u32 flags; > +}; argsz and flags layout is inconsistent with all other vfio ioctls. > + > +#define VFIO_DEVICE_MIG_SET_STATE _IO(VFIO_TYPE, VFIO_BASE + 21) Did you consider whether this could also be implemented as a VFIO_DEVICE_FEATURE? Seems the feature struct would just be device_state and data_fd. Perhaps there's a use case for GET as well. Thanks, Alex
On Mon, Jan 31, 2022 at 04:43:18PM -0700, Alex Williamson wrote: > On Sun, 30 Jan 2022 18:08:19 +0200 > Yishai Hadas <yishaih@nvidia.com> wrote: > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > > index ef33ea002b0b..d9162702973a 100644 > > +++ b/include/uapi/linux/vfio.h > > @@ -605,10 +605,10 @@ struct vfio_region_gfx_edid { > > > > struct vfio_device_migration_info { > > __u32 device_state; /* VFIO device state */ > > -#define VFIO_DEVICE_STATE_STOP (0) > > -#define VFIO_DEVICE_STATE_RUNNING (1 << 0) > > -#define VFIO_DEVICE_STATE_SAVING (1 << 1) > > -#define VFIO_DEVICE_STATE_RESUMING (1 << 2) > > +#define VFIO_DEVICE_STATE_V1_STOP (0) > > +#define VFIO_DEVICE_STATE_V1_RUNNING (1 << 0) > > +#define VFIO_DEVICE_STATE_V1_SAVING (1 << 1) > > +#define VFIO_DEVICE_STATE_V1_RESUMING (1 << 2) > > I assume the below is kept until we rip out all the references, but I'm > not sure why we're bothering to define V1 that's not used anywhere > versus just deleting the above to avoid collision with the new enum. I felt adding the deletion made this patch too big so I shoved it into its own patch after the v2 stuff is described. The rename here is only because we end up with a naming conflict with the enum below. > > + * If this function fails and returns -1 then the device_state is updated with > > + * the current state the device is in. This may be the original operating state > > + * or some other state along the combination transition path. The user can then > > + * decide if it should execute a VFIO_DEVICE_RESET, attempt to return to the > > + * original state, or attempt to return to some other state such as RUNNING or > > + * STOP. If errno is set to EOPNOTSUPP, EFAULT or ENOTTY then the device_state > > + * output is not reliable. > > I haven't made it through the full series yet, but it's not clear to me > why these specific errnos are being masked above. Basically, we can't return the device_state unless we properly process the ioctl. Eg old kernels that do not support this will return ENOTTY and will not update it. If userspace messed up the pointer EFAULT will be return and it will not be updated, finally EOPNOTSUPP is a generic escape for any future reason the kernel might not want to update it. In practice, I found no use for using the device_state in the error path in qemu, but it seemed useful for debugging. > > + * If the new_state starts a new data transfer session then the FD associated > > + * with that session is returned in data_fd. The user is responsible to close > > + * this FD when it is finished. The user must consider the migration data > > + * segments carried over the FD to be opaque and non-fungible. During RESUMING, > > + * the data segments must be written in the same order they came out of the > > + * saving side FD. > > The lifecycle of this FD is a little sketchy. The user is responsible > to close the FD, are they required to? No. Detecting this in the kernel would be notable added complexity to the drivers. Let's clarify it: "close this FD when it no longer has data to read/write. data_fds are not re-used, every data transfer session gets a new FD." ? > ie. should the migration driver fail transitions if there's an > outstanding FD? No, the driver should orphan that FD and use a fresh new one the next cycle. mlx5 will sanitize the FD, free all the memory, and render it inoperable which I'd view as best practice. > Should the core code mangle the f_ops or force and EOF or in some > other way disconnect the FD to avoid driver bugs/exploits with users > poking stale FDs? We looked at swapping f_ops of a running fd for the iommufd project and decided it was not allowed/desired. It needs locking. Here the driver should piggy back the force EOF using its own existing locking protecting concurrent read/write, like mlx5 did. It is straightforward. > Should we be bumping a reference on the device FD such that we can't > have outstanding migration FDs with the device closed (and > re-assigned to a new user)? The driver must ensure any activity triggered by the migration FD against the vfio_device is halted before close_device() returns, just like basically everything else connected to open/close_device(). mlx5 does this by using the same EOF sanitizing the FSM logic uses. Once sanitized the f_ops should not be touching the vfio_device, or even have a pointer to it, so there is no reason to connect the two FDs together. I'd say it is a red flag if a driver proposes to do this, likely it means it has a problem with the open/close_device() lifetime model. > > + * Setting device_state to VFIO_DEVICE_STATE_ERROR will always fail with EINVAL, > > + * and take no action. However the device_state will be updated with the current > > + * value. > > + * > > + * Return: 0 on success, -1 and errno set on failure. > > + */ > > +struct vfio_device_mig_set_state { > > + __u32 argsz; > > + __u32 device_state; > > + __s32 data_fd; > > + __u32 flags; > > +}; > > argsz and flags layout is inconsistent with all other vfio ioctls. OK > > > + > > +#define VFIO_DEVICE_MIG_SET_STATE _IO(VFIO_TYPE, VFIO_BASE + 21) > > Did you consider whether this could also be implemented as a > VFIO_DEVICE_FEATURE? Seems the feature struct would just be > device_state and data_fd. Perhaps there's a use case for GET as well. > Thanks, Only briefly.. I'm not sure what the overall VFIO vision is here.. Are we abandoning traditional ioctls in favour of a multiplexer? Calling the multiplexer ioctl "feature" is a bit odd.. It complicates the user code a bit, it is more complicated to invoke the VFIO_DEVICE_FEATURE (check the qemu patch to see the difference). Either way I don't have a strong opinion, please have a think and let us know which you'd like to follow. Thanks, Jason
On Sun, Jan 30 2022, Yishai Hadas <yishaih@nvidia.com> wrote: > @@ -1582,6 +1760,10 @@ static int vfio_ioctl_device_feature(struct vfio_device *device, > return -EINVAL; > > switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) { > + case VFIO_DEVICE_FEATURE_MIGRATION: > + return vfio_ioctl_device_feature_migration( > + device, feature.flags, arg->data, > + feature.argsz - minsz); > default: > if (unlikely(!device->ops->device_feature)) > return -EINVAL; > @@ -1597,6 +1779,8 @@ static long vfio_device_fops_unl_ioctl(struct file *filep, > struct vfio_device *device = filep->private_data; > > switch (cmd) { > + case VFIO_DEVICE_MIG_SET_STATE: > + return vfio_ioctl_mig_set_state(device, (void __user *)arg); > case VFIO_DEVICE_FEATURE: > return vfio_ioctl_device_feature(device, (void __user *)arg); > default: Not really a critique of this patch, but have we considered how mediated devices will implement migration? I.e. what parts of the ops will need to be looped through the mdev ops? Do we need to consider the scope of some queries/operations (whole device vs subdivisions etc.)? Not trying to distract from the whole new interface here, but I think we should have at least an idea.
On Tue, Feb 01, 2022 at 01:06:51PM +0100, Cornelia Huck wrote: > On Sun, Jan 30 2022, Yishai Hadas <yishaih@nvidia.com> wrote: > > > @@ -1582,6 +1760,10 @@ static int vfio_ioctl_device_feature(struct vfio_device *device, > > return -EINVAL; > > > > switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) { > > + case VFIO_DEVICE_FEATURE_MIGRATION: > > + return vfio_ioctl_device_feature_migration( > > + device, feature.flags, arg->data, > > + feature.argsz - minsz); > > default: > > if (unlikely(!device->ops->device_feature)) > > return -EINVAL; > > @@ -1597,6 +1779,8 @@ static long vfio_device_fops_unl_ioctl(struct file *filep, > > struct vfio_device *device = filep->private_data; > > > > switch (cmd) { > > + case VFIO_DEVICE_MIG_SET_STATE: > > + return vfio_ioctl_mig_set_state(device, (void __user *)arg); > > case VFIO_DEVICE_FEATURE: > > return vfio_ioctl_device_feature(device, (void __user *)arg); > > default: > > Not really a critique of this patch, but have we considered how mediated > devices will implement migration? Yes > I.e. what parts of the ops will need to be looped through the mdev > ops? I've deleted mdev ops in every driver except the intel vgpu, once Christoph's patch there is merged mdev ops will be almost gone completely. mdev drivers now implement normal vfio_device_ops and require nothing special for migration. > Do we need to consider the scope of some queries/operations (whole > device vs subdivisions etc.)? Not trying to distract from the whole new > interface here, but I think we should have at least an idea. All vfio operations on the device FD operate on whatever the struct vfio_device is. Jason
On Tue, Feb 01 2022, Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Feb 01, 2022 at 01:06:51PM +0100, Cornelia Huck wrote: >> On Sun, Jan 30 2022, Yishai Hadas <yishaih@nvidia.com> wrote: >> >> > @@ -1582,6 +1760,10 @@ static int vfio_ioctl_device_feature(struct vfio_device *device, >> > return -EINVAL; >> > >> > switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) { >> > + case VFIO_DEVICE_FEATURE_MIGRATION: >> > + return vfio_ioctl_device_feature_migration( >> > + device, feature.flags, arg->data, >> > + feature.argsz - minsz); >> > default: >> > if (unlikely(!device->ops->device_feature)) >> > return -EINVAL; >> > @@ -1597,6 +1779,8 @@ static long vfio_device_fops_unl_ioctl(struct file *filep, >> > struct vfio_device *device = filep->private_data; >> > >> > switch (cmd) { >> > + case VFIO_DEVICE_MIG_SET_STATE: >> > + return vfio_ioctl_mig_set_state(device, (void __user *)arg); >> > case VFIO_DEVICE_FEATURE: >> > return vfio_ioctl_device_feature(device, (void __user *)arg); >> > default: >> >> Not really a critique of this patch, but have we considered how mediated >> devices will implement migration? > > Yes > >> I.e. what parts of the ops will need to be looped through the mdev >> ops? > > I've deleted mdev ops in every driver except the intel vgpu, once > Christoph's patch there is merged mdev ops will be almost gone > completely. Ok, if there's nothing left to do, that's fine. (I'm assuming that the Intel vgpu patch is on its way in? I usually don't keep track of things I'm not directly involved with.)
On Tue, Feb 01, 2022 at 01:18:29PM +0100, Cornelia Huck wrote: > On Tue, Feb 01 2022, Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Tue, Feb 01, 2022 at 01:06:51PM +0100, Cornelia Huck wrote: > >> On Sun, Jan 30 2022, Yishai Hadas <yishaih@nvidia.com> wrote: > >> > >> > @@ -1582,6 +1760,10 @@ static int vfio_ioctl_device_feature(struct vfio_device *device, > >> > return -EINVAL; > >> > > >> > switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) { > >> > + case VFIO_DEVICE_FEATURE_MIGRATION: > >> > + return vfio_ioctl_device_feature_migration( > >> > + device, feature.flags, arg->data, > >> > + feature.argsz - minsz); > >> > default: > >> > if (unlikely(!device->ops->device_feature)) > >> > return -EINVAL; > >> > @@ -1597,6 +1779,8 @@ static long vfio_device_fops_unl_ioctl(struct file *filep, > >> > struct vfio_device *device = filep->private_data; > >> > > >> > switch (cmd) { > >> > + case VFIO_DEVICE_MIG_SET_STATE: > >> > + return vfio_ioctl_mig_set_state(device, (void __user *)arg); > >> > case VFIO_DEVICE_FEATURE: > >> > return vfio_ioctl_device_feature(device, (void __user *)arg); > >> > default: > >> > >> Not really a critique of this patch, but have we considered how mediated > >> devices will implement migration? > > > > Yes > > > >> I.e. what parts of the ops will need to be looped through the mdev > >> ops? > > > > I've deleted mdev ops in every driver except the intel vgpu, once > > Christoph's patch there is merged mdev ops will be almost gone > > completely. > > Ok, if there's nothing left to do, that's fine. (I'm assuming that the > Intel vgpu patch is on its way in? I usually don't keep track of things > I'm not directly involved with.) It is awaiting some infrastructure patches Intel is working on, but progressing slowly. In any event, it doesn't block other mdev drivers from using the new ops scheme, it only blocks us from deleting the core code supporting it. Jason
On Mon, 31 Jan 2022 20:31:24 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, Jan 31, 2022 at 04:43:18PM -0700, Alex Williamson wrote: > > On Sun, 30 Jan 2022 18:08:19 +0200 > > Yishai Hadas <yishaih@nvidia.com> wrote: > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > > > index ef33ea002b0b..d9162702973a 100644 > > > +++ b/include/uapi/linux/vfio.h > > > @@ -605,10 +605,10 @@ struct vfio_region_gfx_edid { > > > > > > struct vfio_device_migration_info { > > > __u32 device_state; /* VFIO device state */ > > > -#define VFIO_DEVICE_STATE_STOP (0) > > > -#define VFIO_DEVICE_STATE_RUNNING (1 << 0) > > > -#define VFIO_DEVICE_STATE_SAVING (1 << 1) > > > -#define VFIO_DEVICE_STATE_RESUMING (1 << 2) > > > +#define VFIO_DEVICE_STATE_V1_STOP (0) > > > +#define VFIO_DEVICE_STATE_V1_RUNNING (1 << 0) > > > +#define VFIO_DEVICE_STATE_V1_SAVING (1 << 1) > > > +#define VFIO_DEVICE_STATE_V1_RESUMING (1 << 2) > > > > I assume the below is kept until we rip out all the references, but I'm > > not sure why we're bothering to define V1 that's not used anywhere > > versus just deleting the above to avoid collision with the new enum. > > I felt adding the deletion made this patch too big so I shoved it into > its own patch after the v2 stuff is described. The rename here is only > because we end up with a naming conflict with the enum below. Right, but we could just as easily delete the above 4 lines here to avoid the conflict rather than renaming them to V1. > > > + * If this function fails and returns -1 then the device_state is updated with > > > + * the current state the device is in. This may be the original operating state > > > + * or some other state along the combination transition path. The user can then > > > + * decide if it should execute a VFIO_DEVICE_RESET, attempt to return to the > > > + * original state, or attempt to return to some other state such as RUNNING or > > > + * STOP. If errno is set to EOPNOTSUPP, EFAULT or ENOTTY then the device_state > > > + * output is not reliable. > > > > I haven't made it through the full series yet, but it's not clear to me > > why these specific errnos are being masked above. > > Basically, we can't return the device_state unless we properly process > the ioctl. Eg old kernels that do not support this will return ENOTTY > and will not update it. If userspace messed up the pointer EFAULT will > be return and it will not be updated, finally EOPNOTSUPP is a generic > escape for any future reason the kernel might not want to update it. > > In practice, I found no use for using the device_state in the error > path in qemu, but it seemed useful for debugging. Ok, let me parrot back to see if I understand. -ENOTTY will be returned if the ioctl doesn't exist, in which case device_state is untouched and cannot be trusted. At the same time, we expect the user to use the feature ioctl to make sure the ioctl exists, so it would seem that we've reclaimed that errno if we believe the user should follow the protocol. -EOPNOTSUPP is returned both if the driver doesn't support migration (which should be invalid based on the protocol). ie. this: + if (!device->ops->migration_set_state) + return -EOPNOTSUPP; Should return -ENOTTY, just as the feature does. But it's also for future unsupported ops, but couldn't we also specify that the driver must fill final_state with the current device state for any such case. We also have this: + if (set_state.argsz < minsz || set_state.flags) + return -EOPNOTSUPP; Which I think should be -EINVAL. That leaves -EFAULT, for example: + if (copy_from_user(&set_state, arg, minsz)) + return -EFAULT; Should we be able to know the current device state in core code such that we can fill in device state here? I think those changes would go a ways towards fully specified behavior instead of these wishy washy unreliable return values. Then we could also get rid of this paranoia protection of those errnos: + if (IS_ERR(filp)) { + if (WARN_ON(PTR_ERR(filp) == -EOPNOTSUPP || + PTR_ERR(filp) == -ENOTTY || + PTR_ERR(filp) == -EFAULT)) + filp = ERR_PTR(-EINVAL); + goto out_copy; + } Also, the original text of this uapi paragraph reads: "If this function fails and returns -1 then..." Could we clarify that to s/function/ioctl/? It caused me a moment of confusion for the returned -errnos. > > > + * If the new_state starts a new data transfer session then the FD associated > > > + * with that session is returned in data_fd. The user is responsible to close > > > + * this FD when it is finished. The user must consider the migration data > > > + * segments carried over the FD to be opaque and non-fungible. During RESUMING, > > > + * the data segments must be written in the same order they came out of the > > > + * saving side FD. > > > > The lifecycle of this FD is a little sketchy. The user is responsible > > to close the FD, are they required to? > > No. Detecting this in the kernel would be notable added complexity to > the drivers. > > Let's clarify it: > > "close this FD when it no longer has data to > read/write. data_fds are not re-used, every data transfer session gets > a new FD." > > ? Better > > ie. should the migration driver fail transitions if there's an > > outstanding FD? > > No, the driver should orphan that FD and use a fresh new one the next > cycle. mlx5 will sanitize the FD, free all the memory, and render it > inoperable which I'd view as best practice. Agreed, can we add a second sentence to the above clarification to outline those driver responsibilities? > > Should the core code mangle the f_ops or force and EOF or in some > > other way disconnect the FD to avoid driver bugs/exploits with users > > poking stale FDs? > > We looked at swapping f_ops of a running fd for the iommufd project > and decided it was not allowed/desired. It needs locking. > > Here the driver should piggy back the force EOF using its own existing > locking protecting concurrent read/write, like mlx5 did. It is > straightforward. Right, sounded ugly but I thought I'd toss it out. If we define it as the driver's responsibility, I think I'm ok. > > Should we be bumping a reference on the device FD such that we can't > > have outstanding migration FDs with the device closed (and > > re-assigned to a new user)? > > The driver must ensure any activity triggered by the migration FD > against the vfio_device is halted before close_device() returns, just > like basically everything else connected to open/close_device(). mlx5 > does this by using the same EOF sanitizing the FSM logic uses. > > Once sanitized the f_ops should not be touching the vfio_device, or > even have a pointer to it, so there is no reason to connect the two > FDs together. I'd say it is a red flag if a driver proposes to do > this, likely it means it has a problem with the open/close_device() > lifetime model. Maybe we just need a paragraph somewhere to describe the driver responsibilities and expectations in managing the migration FD, including disconnecting it after end of stream and access relative to the open state of the vfio_device. Seems an expanded descriptions somewhere near the declaration in vfio_device_ops would be appropriate. > > > + * Setting device_state to VFIO_DEVICE_STATE_ERROR will always fail with EINVAL, > > > + * and take no action. However the device_state will be updated with the current > > > + * value. > > > + * > > > + * Return: 0 on success, -1 and errno set on failure. > > > + */ > > > +struct vfio_device_mig_set_state { > > > + __u32 argsz; > > > + __u32 device_state; > > > + __s32 data_fd; > > > + __u32 flags; > > > +}; > > > > argsz and flags layout is inconsistent with all other vfio ioctls. > > OK > > > > > > + > > > +#define VFIO_DEVICE_MIG_SET_STATE _IO(VFIO_TYPE, VFIO_BASE + 21) > > > > Did you consider whether this could also be implemented as a > > VFIO_DEVICE_FEATURE? Seems the feature struct would just be > > device_state and data_fd. Perhaps there's a use case for GET as well. > > Thanks, > > Only briefly.. > > I'm not sure what the overall VFIO vision is here.. Are we abandoning > traditional ioctls in favour of a multiplexer? Calling the multiplexer > ioctl "feature" is a bit odd.. Is it really? VF Token support is a feature that a device might have and we can use the same interface to probe that it exists as well as set the UUID token. We're using it to manipulate the state of a device feature. If we're only looking for a means to expose that a device has support for something, our options are a flag bit on the vfio_device_info or a capability on that ioctl. It's arguable that the latter might be a better option for VFIO_DEVICE_FEATURE_MIGRATION since its purpose is only to return a flags field, ie. we're not interacting with a feature, we're exposing a capability with fixed properties. However as we move to MIG_SET_SET, well now we are interacting with a feature of the device and there's really nothing unique about the calling convention that would demand that we define a stand alone ioctl. > It complicates the user code a bit, it is more complicated to invoke the > VFIO_DEVICE_FEATURE (check the qemu patch to see the difference). Is it really any more than some wrapper code? Are there objections to this sort of multiplexer? As I was working on the VF Token support, it felt like a fairly small device feature and I didn't want to set a precedent of cluttering our ioctl space with every niche little feature. The s390 folks have some proposals on list for using features and I'm tempted to suggest it to Abhishek as well for their implementation of D3cold support. > Either way I don't have a strong opinion, please have a think and let > us know which you'd like to follow. I'm leaning towards a capability for migration support flags and a feature for setting the state, but let me know if this looks like a bad idea for some reason. Thanks, Alex
On Tue, Feb 01, 2022 at 10:04:08AM -0700, Alex Williamson wrote: > Ok, let me parrot back to see if I understand. -ENOTTY will be > returned if the ioctl doesn't exist, in which case device_state is > untouched and cannot be trusted. At the same time, we expect the user > to use the feature ioctl to make sure the ioctl exists, so it would > seem that we've reclaimed that errno if we believe the user should > follow the protocol. I don't follow - the documentation says what the code does, if you get ENOTTY returned then you don't get the device_state too. Saying the user shouldn't have called it in the first place is completely correct, but doesn't change the device_state output. > + if (!device->ops->migration_set_state) > + return -EOPNOTSUPP; > > Should return -ENOTTY, just as the feature does. As far as I know the kernel 'standard' is: - ENOTTY if the ioctl cmd # itself is not understood - E2BIG if the ioctl arg is longer than the kernel understands - EOPNOTSUPP if the ioctl arg contains data the kernel doesn't understand (eg flags the kernel doesn't know about), or the kernel understands the request but cannot support it for some reason. - EINVAL if the ioctl arg contains data the kernel knows about but rejects (ie invalid combinations of flags) VFIO kind of has its own thing, but I'm not entirely sure what the rules are, eg you asked for EOPNOTSUPP in the other patch, and here we are asking for ENOTTY? But sure, lets make it ENOTTY. > But it's also for future unsupported ops, but couldn't we also > specify that the driver must fill final_state with the current > device state for any such case. We also have this: > > + if (set_state.argsz < minsz || set_state.flags) > + return -EOPNOTSUPP; > > Which I think should be -EINVAL. That would match the majority of other VFIO tests. > That leaves -EFAULT, for example: > > + if (copy_from_user(&set_state, arg, minsz)) > + return -EFAULT; > > Should we be able to know the current device state in core code such > that we can fill in device state here? There is no point in doing a copy_to_user() to the same memory if a copy_from_user() failed, so device_state will still not be returned. We don't know the device_state in the core code because it can only be read under locking that is controlled by the driver. I hope when we get another driver merged that we can hoist the locking, but right now I'm not really sure - it is a complicated lock. > I think those changes would go a ways towards fully specified behavior > instead of these wishy washy unreliable return values. Then we could Huh? It is fully specified already. These changes just removed EOPNOTSUPP from the list where device_state isn't filled in. It is OK, but it is not really different... > "If this function fails and returns -1 then..." > > Could we clarify that to s/function/ioctl/? It caused me a moment of > confusion for the returned -errnos. Sure. > > > Should we be bumping a reference on the device FD such that we can't > > > have outstanding migration FDs with the device closed (and > > > re-assigned to a new user)? > > > > The driver must ensure any activity triggered by the migration FD > > against the vfio_device is halted before close_device() returns, just > > like basically everything else connected to open/close_device(). mlx5 > > does this by using the same EOF sanitizing the FSM logic uses. > > > > Once sanitized the f_ops should not be touching the vfio_device, or > > even have a pointer to it, so there is no reason to connect the two > > FDs together. I'd say it is a red flag if a driver proposes to do > > this, likely it means it has a problem with the open/close_device() > > lifetime model. > > Maybe we just need a paragraph somewhere to describe the driver > responsibilities and expectations in managing the migration FD, > including disconnecting it after end of stream and access relative to > the open state of the vfio_device. Seems an expanded descriptions > somewhere near the declaration in vfio_device_ops would be appropriate. Yes that is probably better than in the uapi header. > > I'm not sure what the overall VFIO vision is here.. Are we abandoning > > traditional ioctls in favour of a multiplexer? Calling the multiplexer > > ioctl "feature" is a bit odd.. > > Is it really? VF Token support is a feature that a device might have > and we can use the same interface to probe that it exists as well as > set the UUID token. We're using it to manipulate the state of a device > feature. > > If we're only looking for a means to expose that a device has support > for something, our options are a flag bit on the vfio_device_info or a > capability on that ioctl. It's arguable that the latter might be a > better option for VFIO_DEVICE_FEATURE_MIGRATION since its purpose is > only to return a flags field, ie. we're not interacting with a feature, > we're exposing a capability with fixed properties. I looked at this, and decided against it on practical reasons. I've organized this so the core code can do more work for the driver, which means the core code supplies the support info back to userspace. VFIO_DEVICE_INFO is currently open coded in every single driver and lifting that to get the same support looks like a huge pain. Even if we try to work it backwards somehow, we'd need to re-organize vfio-pci so other drivers can contribute to the cap chain - which is another ugly looking thing. On top of that, qemu becomes much less straightforward as we have to piggy back on the existing vfio code instead of just doing a simple ioctl to get back the small support info back. There is even an unpleasing mandatory user/kernel memory allocation and double ioctl in the caps path. The feature approach is much better, it has a much cleaner implementation in user/kernel. I think we should focus on it going forward and freeze caps. > > It complicates the user code a bit, it is more complicated to invoke the > > VFIO_DEVICE_FEATURE (check the qemu patch to see the difference). > > Is it really any more than some wrapper code? Are there objections to > this sort of multiplexer? There isn't too much reason to do this kind of stuff. Each subsystem gets something like 4 million ioctl numbers within its type, we will never run out of unique ioctls. Normal ioctls have a nice simplicity to them, adding layers creates complexity, feature is defiantly more complex to implement, and cap is a whole other level of more complex. None of this is necessary. I don't know what "cluttering" means here, I'd prefer we focus on things that give clean code and simple implementations than arbitary aesthetics. > > Either way I don't have a strong opinion, please have a think and let > > us know which you'd like to follow. > > I'm leaning towards a capability for migration support flags and a > feature for setting the state, but let me know if this looks like a bad > idea for some reason. Thanks, I don't want to touch capabilities, but we can try to use feature for set state. Please confirm this is what you want. You'll want the same for the PRE_COPY related information too? If we are into these very minor nitpicks does this mean you are OK with all the big topics now? Jason
On Tue, 1 Feb 2022 14:36:20 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Feb 01, 2022 at 10:04:08AM -0700, Alex Williamson wrote: > > > Ok, let me parrot back to see if I understand. -ENOTTY will be > > returned if the ioctl doesn't exist, in which case device_state is > > untouched and cannot be trusted. At the same time, we expect the user > > to use the feature ioctl to make sure the ioctl exists, so it would > > seem that we've reclaimed that errno if we believe the user should > > follow the protocol. > > I don't follow - the documentation says what the code does, if you get > ENOTTY returned then you don't get the device_state too. Saying the > user shouldn't have called it in the first place is completely > correct, but doesn't change the device_state output. The documentation says "...the device state output is not reliable", and I have to question whether this qualifies as a well specified, interoperable spec with such language. We're essentially asking users to keep track that certain errnos result in certain fields of the structure _maybe_ being invalid. > > + if (!device->ops->migration_set_state) > > + return -EOPNOTSUPP; > > > > Should return -ENOTTY, just as the feature does. > > As far as I know the kernel 'standard' is: > - ENOTTY if the ioctl cmd # itself is not understood > - E2BIG if the ioctl arg is longer than the kernel understands > - EOPNOTSUPP if the ioctl arg contains data the kernel doesn't > understand (eg flags the kernel doesn't know about), or the > kernel understands the request but cannot support it for some > reason. > - EINVAL if the ioctl arg contains data the kernel knows about but > rejects (ie invalid combinations of flags) > > VFIO kind of has its own thing, but I'm not entirely sure what the > rules are, eg you asked for EOPNOTSUPP in the other patch, and here we > are asking for ENOTTY? > > But sure, lets make it ENOTTY. I'd move your first example of EOPNOTSUPP to EINVAL. To me, the user providing bits/fields/values that are undefined in an invalid argument. I've typically steered away from the extended errnos in favor of things in the base set, so as you noted, there are currently no instances of EOPNOTSUPP in vfio. In the case we discussed of a user trying to do SET/GET on a feature that only supports GET/SET could go either way, it's an invalid argument for the feature and in this case the user can determine the supported arguments via the PROBE interface. But when I start seeing multiple tests that all result in an EINVAL return, then I wonder if a different errno might help user debugging. EINVAL is acceptable in the case I noted, but maybe another errno could be more descriptive. In the immediate example here, userspace really has no reason to see a difference in the ioctl between lack of kernel support for migration altogether and lack of device support for migration. So I'd fall back to the ioctl is not known "for this device", -ENOTTY. Now you're making me wonder how much I care to invest in semantic arguments over extended errnos :-\ > > But it's also for future unsupported ops, but couldn't we also > > specify that the driver must fill final_state with the current > > device state for any such case. We also have this: > > > > + if (set_state.argsz < minsz || set_state.flags) > > + return -EOPNOTSUPP; > > > > Which I think should be -EINVAL. > > That would match the majority of other VFIO tests. > > > That leaves -EFAULT, for example: > > > > + if (copy_from_user(&set_state, arg, minsz)) > > + return -EFAULT; > > > > Should we be able to know the current device state in core code such > > that we can fill in device state here? > > There is no point in doing a copy_to_user() to the same memory if a > copy_from_user() failed, so device_state will still not be returned. Duh, good point. > We don't know the device_state in the core code because it can only be > read under locking that is controlled by the driver. I hope when we > get another driver merged that we can hoist the locking, but right now > I'm not really sure - it is a complicated lock. The device cannot self transition to a new state, so if the core were to serialize this ioctl then the device_state provided by the driver is valid, regardless of its internal locking. Whether this ioctl should be serialized anyway is probably another good topic to breach. Should a user be able to have concurrent ioctls setting conflicting states? > > I think those changes would go a ways towards fully specified behavior > > instead of these wishy washy unreliable return values. Then we could > > Huh? It is fully specified already. These changes just removed > EOPNOTSUPP from the list where device_state isn't filled in. It is OK, > but it is not really different... Hmm, "output is not reliable" is fully specified? We can't really make use of return flags to identify valid fields either since the copy-out might fault. I'd suggest that ioctl return structure is only valid at all on success and we add a GET interface to return the current device state on errno given the argument above that driver locking is irrelevant because the device cannot self transition. > > "If this function fails and returns -1 then..." > > > > Could we clarify that to s/function/ioctl/? It caused me a moment of > > confusion for the returned -errnos. > > Sure. > > > > > Should we be bumping a reference on the device FD such that we can't > > > > have outstanding migration FDs with the device closed (and > > > > re-assigned to a new user)? > > > > > > The driver must ensure any activity triggered by the migration FD > > > against the vfio_device is halted before close_device() returns, just > > > like basically everything else connected to open/close_device(). mlx5 > > > does this by using the same EOF sanitizing the FSM logic uses. > > > > > > Once sanitized the f_ops should not be touching the vfio_device, or > > > even have a pointer to it, so there is no reason to connect the two > > > FDs together. I'd say it is a red flag if a driver proposes to do > > > this, likely it means it has a problem with the open/close_device() > > > lifetime model. > > > > Maybe we just need a paragraph somewhere to describe the driver > > responsibilities and expectations in managing the migration FD, > > including disconnecting it after end of stream and access relative to > > the open state of the vfio_device. Seems an expanded descriptions > > somewhere near the declaration in vfio_device_ops would be appropriate. > > Yes that is probably better than in the uapi header. > > > > I'm not sure what the overall VFIO vision is here.. Are we abandoning > > > traditional ioctls in favour of a multiplexer? Calling the multiplexer > > > ioctl "feature" is a bit odd.. > > > > Is it really? VF Token support is a feature that a device might have > > and we can use the same interface to probe that it exists as well as > > set the UUID token. We're using it to manipulate the state of a device > > feature. > > > > If we're only looking for a means to expose that a device has support > > for something, our options are a flag bit on the vfio_device_info or a > > capability on that ioctl. It's arguable that the latter might be a > > better option for VFIO_DEVICE_FEATURE_MIGRATION since its purpose is > > only to return a flags field, ie. we're not interacting with a feature, > > we're exposing a capability with fixed properties. > > I looked at this, and decided against it on practical reasons. > > I've organized this so the core code can do more work for the driver, > which means the core code supplies the support info back to > userspace. VFIO_DEVICE_INFO is currently open coded in every single > driver and lifting that to get the same support looks like a huge > pain. Even if we try to work it backwards somehow, we'd need to > re-organize vfio-pci so other drivers can contribute to the cap chain - > which is another ugly looking thing. > > On top of that, qemu becomes much less straightforward as we have to > piggy back on the existing vfio code instead of just doing a simple > ioctl to get back the small support info back. There is even an > unpleasing mandatory user/kernel memory allocation and double ioctl in > the caps path. > > The feature approach is much better, it has a much cleaner > implementation in user/kernel. I think we should focus on it going > forward and freeze caps. Ok, I'm not demanding a capability interface. > > > It complicates the user code a bit, it is more complicated to invoke the > > > VFIO_DEVICE_FEATURE (check the qemu patch to see the difference). > > > > Is it really any more than some wrapper code? Are there objections to > > this sort of multiplexer? > > There isn't too much reason to do this kind of stuff. Each subsystem > gets something like 4 million ioctl numbers within its type, we will > never run out of unique ioctls. > > Normal ioctls have a nice simplicity to them, adding layers creates > complexity, feature is defiantly more complex to implement, and cap > is a whole other level of more complex. None of this is necessary. > > I don't know what "cluttering" means here, I'd prefer we focus on > things that give clean code and simple implementations than arbitary > aesthetics. It's entirely possible that I'm overly averse to ioctl proliferation, but for every new ioctl we need to take a critical look at the proposed API, use case, applicability, and extensibility. That isn't entirely removed when we use something like this generic feature ioctl, but I consider it substantially reduced since we're working within an existing framework. A direct ioctl might be able to slightly streamline the interface (I don't think that significantly matters in this case), but on the other hand, defining this as a feature within an existing interface provides consistency and compartmentalization. > > > Either way I don't have a strong opinion, please have a think and let > > > us know which you'd like to follow. > > > > I'm leaning towards a capability for migration support flags and a > > feature for setting the state, but let me know if this looks like a bad > > idea for some reason. Thanks, > > I don't want to touch capabilities, but we can try to use feature for > set state. Please confirm this is what you want. It's a team sport, but to me it seems like it fits well both in my mental model of interacting with a device feature, without significantly altering the uAPI you're defining anyway. > You'll want the same for the PRE_COPY related information too? I hadn't gotten there yet. It seems like a discontinuity to me that we're handing out new FDs for data transfer sessions, but then we require the user to come back to the device to query about the data its reading through that other FD. Should that be an ioctl on the data stream FD itself? Is there a use case for also having it on the STOP_COPY FD? > If we are into these very minor nitpicks does this mean you are OK > with all the big topics now? I'm not hating it, but I'd like to see buy-in from others who have a vested interest in supporting migration. I don't see Intel or Huawei on the Cc list and the original collaborators of the v1 interface from NVIDIA have been silent through this redesign. Thanks, Alex
On Tue, Feb 01, 2022 at 02:49:16PM -0700, Alex Williamson wrote: > On Tue, 1 Feb 2022 14:36:20 -0400 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Tue, Feb 01, 2022 at 10:04:08AM -0700, Alex Williamson wrote: > > > > > Ok, let me parrot back to see if I understand. -ENOTTY will be > > > returned if the ioctl doesn't exist, in which case device_state is > > > untouched and cannot be trusted. At the same time, we expect the user > > > to use the feature ioctl to make sure the ioctl exists, so it would > > > seem that we've reclaimed that errno if we believe the user should > > > follow the protocol. > > > > I don't follow - the documentation says what the code does, if you get > > ENOTTY returned then you don't get the device_state too. Saying the > > user shouldn't have called it in the first place is completely > > correct, but doesn't change the device_state output. > > The documentation says "...the device state output is not reliable", and > I have to question whether this qualifies as a well specified, > interoperable spec with such language. We're essentially asking users > to keep track that certain errnos result in certain fields of the > structure _maybe_ being invalid. So you are asking to remove "is not reliable" and just phrase is as: "device_state is updated to the current value when -1 is returned, except when these XXX errnos are returned? (actually userspace can tell directly without checking the errno - as if -1 is returned the device_state cannot be the requested target state anyhow) > Now you're making me wonder how much I care to invest in semantic > arguments over extended errnos :-\ Well, I know I don't :) We don't have consistency in the kernel and userspace is hard pressed to make any sense of it most of the time, IMHO. It just doesn't practically matter.. > > We don't know the device_state in the core code because it can only be > > read under locking that is controlled by the driver. I hope when we > > get another driver merged that we can hoist the locking, but right now > > I'm not really sure - it is a complicated lock. > > The device cannot self transition to a new state, so if the core were > to serialize this ioctl then the device_state provided by the driver is > valid, regardless of its internal locking. It is allowed to transition to RUNNING due to reset events it captures and since we capture the reset through the PCI hook, not from VFIO, the core code doesn't synchronize well. See patch 14 > Whether this ioctl should be serialized anyway is probably another good > topic to breach. Should a user be able to have concurrent ioctls > setting conflicting states? The driver is required to serialize, the core code doesn't touch any global state and doesn't need serializing. > I'd suggest that ioctl return structure is only valid at all on > success and we add a GET interface to return the current device We can do this too, but it is a bunch of code to achieve this and I don't have any use case to read back the device_state beyond debugging and debugging is fine with this. IMHO > It's entirely possible that I'm overly averse to ioctl proliferation, > but for every new ioctl we need to take a critical look at the proposed > API, use case, applicability, and extensibility. This is all basicly the same no matter where it is put, the feature multiplexer is just an ioctl in some semi-standard format, but the vfio pattern of argsz/flags is also a standard format that is basically the same thing. We still need to think about extensibility, alignment, etc.. The problem I usually see with ioctls is not proliferation, but ending up with too many choices and a big ?? when it comes to adding something new. Clear rules where things should go and why is the best, it matters less what the rules actually are IMHO. > > I don't want to touch capabilities, but we can try to use feature for > > set state. Please confirm this is what you want. > > It's a team sport, but to me it seems like it fits well both in my > mental model of interacting with a device feature, without > significantly altering the uAPI you're defining anyway. Well, my advice is that ioctls are fine, and a bit easier all around. eg strace and syzkaller are a bit easier if everything neatly maps into one struct per ioctl - their generator tools are optimized for this common case. Simple multiplexors are next-best-fine, but there should be a clear idea when to use the multiplexer, or not. Things like the cap chains enter a whole world of adventure for strace/syzkaller :) > > You'll want the same for the PRE_COPY related information too? > > I hadn't gotten there yet. It seems like a discontinuity to me that > we're handing out new FDs for data transfer sessions, but then we > require the user to come back to the device to query about the data its > reading through that other FD. An earlier draft of this put it on the data FD, but v6 made it fully optional with no functional impact on the data FD. The values decrease as the data FD progresses and increases as the VM dirties data - ie it is 50/50 data_fd/device behavior. It doesn't matter which way, but it feels quite weird to have the main state function is a FEATURE and the precopy query is an ioctl. > Should that be an ioctl on the data stream FD itself? I can be. Implementation wise it is about a wash. > Is there a use case for also having it on the STOP_COPY FD? I didn't think of one worthwhile enough to mandate implementing it in every driver. > > If we are into these very minor nitpicks does this mean you are OK > > with all the big topics now? > > I'm not hating it, but I'd like to see buy-in from others who have a > vested interest in supporting migration. I don't see Intel or Huawei > on the Cc list and the original collaborators of the v1 interface > from That is an oversight, I'll ping them. I think people have been staying away until the dust settles. > NVIDIA have been silent through this redesign. We've reviewed this internally with them. They reserve judgement on the data transfer performance until they work on it, but functionally it has all the necessary semantics. They have the same P2P issue mlx5 does, and are happy with the solution under the same general provisions as already discussed for the Huawei device - RUNNING_P2P is sustainable only while the device is not touched - ie the VCPU is halted. The f_ops implemenation we used for mlx5 is basic, the full performance version would want to use the read/write_iter() fop with async completions to support the modern zero-copy iouring based data motion in userspace. This is all part of the standard FD abstraction and why it is appealing to use it. Thanks, Jason
On Tue, 1 Feb 2022 20:24:59 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Feb 01, 2022 at 02:49:16PM -0700, Alex Williamson wrote: > > On Tue, 1 Feb 2022 14:36:20 -0400 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Tue, Feb 01, 2022 at 10:04:08AM -0700, Alex Williamson wrote: > > > > > > > Ok, let me parrot back to see if I understand. -ENOTTY will be > > > > returned if the ioctl doesn't exist, in which case device_state is > > > > untouched and cannot be trusted. At the same time, we expect the user > > > > to use the feature ioctl to make sure the ioctl exists, so it would > > > > seem that we've reclaimed that errno if we believe the user should > > > > follow the protocol. > > > > > > I don't follow - the documentation says what the code does, if you get > > > ENOTTY returned then you don't get the device_state too. Saying the > > > user shouldn't have called it in the first place is completely > > > correct, but doesn't change the device_state output. > > > > The documentation says "...the device state output is not reliable", and > > I have to question whether this qualifies as a well specified, > > interoperable spec with such language. We're essentially asking users > > to keep track that certain errnos result in certain fields of the > > structure _maybe_ being invalid. > > So you are asking to remove "is not reliable" and just phrase is as: > > "device_state is updated to the current value when -1 is returned, > except when these XXX errnos are returned? > > (actually userspace can tell directly without checking the errno - as > if -1 is returned the device_state cannot be the requested target > state anyhow) If we decide to keep the existing code, then yes the spec should indicate the device_state is invalid, not just unreliable for those errnos, but I'm also of the opinion that returning an error condition AND providing valid data in the return structure for all but a few errnos and expecting userspace to get this correct is not a good API. > > Now you're making me wonder how much I care to invest in semantic > > arguments over extended errnos :-\ > > Well, I know I don't :) We don't have consistency in the kernel and > userspace is hard pressed to make any sense of it most of the time, > IMHO. It just doesn't practically matter.. > > > > We don't know the device_state in the core code because it can only be > > > read under locking that is controlled by the driver. I hope when we > > > get another driver merged that we can hoist the locking, but right now > > > I'm not really sure - it is a complicated lock. > > > > The device cannot self transition to a new state, so if the core were > > to serialize this ioctl then the device_state provided by the driver is > > valid, regardless of its internal locking. > > It is allowed to transition to RUNNING due to reset events it captures > and since we capture the reset through the PCI hook, not from VFIO, > the core code doesn't synchronize well. See patch 14 Looking... your .reset_done() function sets a deferred_reset flag and attempts to grab the state_mutex. If there's contention on that mutex, exit since the lock holder will perform the state transition when dropping that mutex, otherwise reset_done will itself drop the mutex to do that state change. The reset_lock assures that we cannot race as the state_mutex is being released. So the scenario is that the user MUST be performing a reset coincident to accessing the device_state and the solution is that the user's SET_STATE returns success and a new device state that's already bogus due to the reset. Why wouldn't the solution here be to return -EAGAIN to the user or reattempt the SET_STATE since the user is clearly now disconnected from the actual device_state? > > Whether this ioctl should be serialized anyway is probably another good > > topic to breach. Should a user be able to have concurrent ioctls > > setting conflicting states? > > The driver is required to serialize, the core code doesn't touch any > global state and doesn't need serializing. > > > I'd suggest that ioctl return structure is only valid at all on > > success and we add a GET interface to return the current device > > We can do this too, but it is a bunch of code to achieve this and I > don't have any use case to read back the device_state beyond debugging > and debugging is fine with this. IMHO A bunch of code? If we use a FEATURE ioctl, it just extends the existing implementation to add GET support. That looks rather trivial. That seems like a selling point for using the FEATURE ioctl TBH. > > It's entirely possible that I'm overly averse to ioctl proliferation, > > but for every new ioctl we need to take a critical look at the proposed > > API, use case, applicability, and extensibility. > > This is all basicly the same no matter where it is put, the feature > multiplexer is just an ioctl in some semi-standard format, but the > vfio pattern of argsz/flags is also a standard format that is > basically the same thing. > > We still need to think about extensibility, alignment, etc.. > > The problem I usually see with ioctls is not proliferation, but ending > up with too many choices and a big ?? when it comes to adding > something new. > > Clear rules where things should go and why is the best, it matters > less what the rules actually are IMHO. > > > > I don't want to touch capabilities, but we can try to use feature for > > > set state. Please confirm this is what you want. > > > > It's a team sport, but to me it seems like it fits well both in my > > mental model of interacting with a device feature, without > > significantly altering the uAPI you're defining anyway. > > Well, my advice is that ioctls are fine, and a bit easier all around. > eg strace and syzkaller are a bit easier if everything neatly maps > into one struct per ioctl - their generator tools are optimized for > this common case. > > Simple multiplexors are next-best-fine, but there should be a clear > idea when to use the multiplexer, or not. > > Things like the cap chains enter a whole world of adventure for > strace/syzkaller :) vfio's argsz/flags is not only a standard framework, but it's one that promotes extensions. We were able to add capability chains with backwards compatibility because of this design. IMO, that's avoided ioctl sprawl; we've been able to maintain a fairly small set of core ioctls rather than add add a new ioctl every time we want to describe some new property of a device or region or IOMMU. I think that improves the usability of the uAPI. I certainly wouldn't want to program to a uAPI with a million ioctls. A counter argument is that we're making the interface more complex, but at the same time we're adding shared infrastructure for dealing with that complexity. Of course we do continue to add new ioctls as necessary, including this FEATURE ioctl, and I recognize that with such a generic multiplexer we run the risk of over using it, ie. everything looks like a nail. You initially did not see the fit for setting device state as interacting with a device feature, but it doesn't seem like you had a strong objection to my explanation of it in that context. So I think if the FEATURE ioctl has an ongoing place in our uAPI (using it to expose migration flags would seem to be a point in that direction) and it doesn't require too many contortions to think of the operation we're trying to perform on the device as interacting with a device FEATURE, and there are no functional or performance implications of it, I would think we should use it. To do otherwise would suggest that we should consider the FEATURE ioctl a failed experiment and not continue to expand its use. I'd be interested to hear more input on this from the community. > > > You'll want the same for the PRE_COPY related information too? > > > > I hadn't gotten there yet. It seems like a discontinuity to me that > > we're handing out new FDs for data transfer sessions, but then we > > require the user to come back to the device to query about the data its > > reading through that other FD. > > An earlier draft of this put it on the data FD, but v6 made it fully > optional with no functional impact on the data FD. The values decrease > as the data FD progresses and increases as the VM dirties data - ie it > is 50/50 data_fd/device behavior. > > It doesn't matter which way, but it feels quite weird to have the main > state function is a FEATURE and the precopy query is an ioctl. If the main state function were a FEATURE ioctl on the device and the data transfer query was an ioctl on the FD returned from that feature ioctl, I don't see how that's weird at all. Different FDs, different interfaces. To me, the device has provided a separate FD for data transfer, so the fact that we consume the data via that FD, but monitor our progress in consuming that data back on the device FD is a bit strange. > > Should that be an ioctl on the data stream FD itself? > > I can be. Implementation wise it is about a wash. > > > Is there a use case for also having it on the STOP_COPY FD? > > I didn't think of one worthwhile enough to mandate implementing it in > every driver. Can the user perform an lseek(2) on the migration FD? Maybe that would be the difference between what we need for PRE_COPY vs STOP_COPY. In the latter case the data should be a fixes size and perhaps we don't need another interface to know how much data to expect. One use case would be that we want to be able to detect whether we can meet service guarantees as quickly as possible with the minimum resource consumption and downtime. If we can determine from the device that we can't possibly transfer its state in the required time, we can abort immediately without waiting for a downtime exception or flooding the migration link. Thanks, Alex
On Wed, Feb 02, 2022 at 04:36:56PM -0700, Alex Williamson wrote: > > So you are asking to remove "is not reliable" and just phrase is as: > > > > "device_state is updated to the current value when -1 is returned, > > except when these XXX errnos are returned? > > > > (actually userspace can tell directly without checking the errno - as > > if -1 is returned the device_state cannot be the requested target > > state anyhow) > > If we decide to keep the existing code, then yes the spec should > indicate the device_state is invalid, not just unreliable for those > errnos, but I'm also of the opinion that returning an error condition > AND providing valid data in the return structure for all but a few > errnos and expecting userspace to get this correct is not a good API. It was done this way because we didn't see any use case for the reading the device_state except debugging, and adding another ioctl and driver op just to get the device_state without a real user looked like overkill. As you already analyzed, despite the scary label in the comment, this return is actually fully reliable so long as the userspace is operating the API correctly - eg checking the feature flag and so on. So, it is not as scary as you are making it out to be - and yes maybe GET on FEATURE is cleaner. > > It is allowed to transition to RUNNING due to reset events it captures > > and since we capture the reset through the PCI hook, not from VFIO, > > the core code doesn't synchronize well. See patch 14 > > Looking... your .reset_done() function sets a deferred_reset flag and > attempts to grab the state_mutex. If there's contention on that mutex, > exit since the lock holder will perform the state transition when > dropping that mutex, otherwise reset_done will itself drop the mutex to > do that state change. The reset_lock assures that we cannot race as the > state_mutex is being released. > > So the scenario is that the user MUST be performing a reset coincident > to accessing the device_state and the solution is that the user's > SET_STATE returns success and a new device state that's already bogus > due to the reset. Er, no, you suggested the core code could just cache the return since it cannot change and then use that cached value as though it is correct. As reset happens outside the core call chain's view it means any core cache becomes out of sync. It is not a race, it just means we can't cache the value in the core. > Why wouldn't the solution here be to return -EAGAIN to the user or > reattempt the SET_STATE since the user is clearly now disconnected > from the actual device_state? This is just a race that the user inflicted on themselves. We protect kernel integrity and choose to resolve the race as though the set_state happened first in time and the reset happened second in time. The API is not designed to be used concurrently, so it is a user error if they hit this. > > We can do this too, but it is a bunch of code to achieve this and I > > don't have any use case to read back the device_state beyond debugging > > and debugging is fine with this. IMHO > > A bunch of code? If we use a FEATURE ioctl, it just extends the > existing implementation to add GET support. That looks rather trivial. > That seems like a selling point for using the FEATURE ioctl TBH. We didn't even define a driver op to return the current state, trivial code yes, but code nonetheless. > > Things like the cap chains enter a whole world of adventure for > > strace/syzkaller :) > > vfio's argsz/flags is not only a standard framework, but it's one that > promotes extensions. We were able to add capability chains with > backwards compatibility because of this design. IHMO the formal cap chains in the INFO ioctls were a mistake. The argsz/flags already provide enough extension capability to return the few extra fields directly by growing the main struct through argsz and that handles most of what is in the caps. The few variable size caps, like iova ranges, would have been simpler as system calls that return only that data. This avoids userspace from having to do all the memory allocation stuff just to read a single u32 when they don't have an interest in, say, ranges. > initially did not see the fit for setting device state as interacting > with a device feature, but it doesn't seem like you had a strong > objection to my explanation of it in that context. I don't have a strong feeling here. I think as the maintainer you should just set a clear philosophy for ioctls in VFIO and communicate it. There are many choices, most are reasonable. We tried the FEATURE path, and it is OK of course, but it looks weird as set_state is in/out due to the data_fd but it being used with SET. I can't say that it is any better, and diffstate says it is more code. > > > Should that be an ioctl on the data stream FD itself? > > > > I can be. Implementation wise it is about a wash. > > > > > Is there a use case for also having it on the STOP_COPY FD? > > > > I didn't think of one worthwhile enough to mandate implementing it in > > every driver. > > Can the user perform an lseek(2) on the migration FD? Maybe that would > be the difference between what we need for PRE_COPY vs STOP_COPY. In > the latter case the data should be a fixes size and perhaps we don't > need another interface to know how much data to expect. I'm leary to abuse the FD interface this way, we setup the FD as noseek, like a pipe, and the core fd code has some understanding of this. > One use case would be that we want to be able to detect whether we can > meet service guarantees as quickly as possible with the minimum > resource consumption and downtime. If we can determine from the device > that we can't possibly transfer its state in the required time, we can > abort immediately without waiting for a downtime exception or flooding > the migration link. Thanks, It is an idea, but I don't know how to translate bytes to time, we don't know how fast the device can generate the data for instance. Jason
On 2/2/2022 5:54 AM, Jason Gunthorpe wrote: > External email: Use caution opening links or attachments > > > On Tue, Feb 01, 2022 at 02:49:16PM -0700, Alex Williamson wrote: >> On Tue, 1 Feb 2022 14:36:20 -0400 >> Jason Gunthorpe <jgg@nvidia.com> wrote: >> >>> On Tue, Feb 01, 2022 at 10:04:08AM -0700, Alex Williamson wrote: >>> >>>> Ok, let me parrot back to see if I understand. -ENOTTY will be >>>> returned if the ioctl doesn't exist, in which case device_state is >>>> untouched and cannot be trusted. At the same time, we expect the user >>>> to use the feature ioctl to make sure the ioctl exists, so it would >>>> seem that we've reclaimed that errno if we believe the user should >>>> follow the protocol. >>> >>> I don't follow - the documentation says what the code does, if you get >>> ENOTTY returned then you don't get the device_state too. Saying the >>> user shouldn't have called it in the first place is completely >>> correct, but doesn't change the device_state output. >> >> The documentation says "...the device state output is not reliable", and >> I have to question whether this qualifies as a well specified, >> interoperable spec with such language. We're essentially asking users >> to keep track that certain errnos result in certain fields of the >> structure _maybe_ being invalid. > > So you are asking to remove "is not reliable" and just phrase is as: > > "device_state is updated to the current value when -1 is returned, > except when these XXX errnos are returned? > > (actually userspace can tell directly without checking the errno - as > if -1 is returned the device_state cannot be the requested target > state anyhow) > >> Now you're making me wonder how much I care to invest in semantic >> arguments over extended errnos :-\ > > Well, I know I don't :) We don't have consistency in the kernel and > userspace is hard pressed to make any sense of it most of the time, > IMHO. It just doesn't practically matter.. > >>> We don't know the device_state in the core code because it can only be >>> read under locking that is controlled by the driver. I hope when we >>> get another driver merged that we can hoist the locking, but right now >>> I'm not really sure - it is a complicated lock. >> >> The device cannot self transition to a new state, so if the core were >> to serialize this ioctl then the device_state provided by the driver is >> valid, regardless of its internal locking. > > It is allowed to transition to RUNNING due to reset events it captures > and since we capture the reset through the PCI hook, not from VFIO, > the core code doesn't synchronize well. See patch 14 > >> Whether this ioctl should be serialized anyway is probably another good >> topic to breach. Should a user be able to have concurrent ioctls >> setting conflicting states? > > The driver is required to serialize, the core code doesn't touch any > global state and doesn't need serializing. > >> I'd suggest that ioctl return structure is only valid at all on >> success and we add a GET interface to return the current device > > We can do this too, but it is a bunch of code to achieve this and I > don't have any use case to read back the device_state beyond debugging > and debugging is fine with this. IMHO > >> It's entirely possible that I'm overly averse to ioctl proliferation, >> but for every new ioctl we need to take a critical look at the proposed >> API, use case, applicability, and extensibility. > > This is all basicly the same no matter where it is put, the feature > multiplexer is just an ioctl in some semi-standard format, but the > vfio pattern of argsz/flags is also a standard format that is > basically the same thing. > > We still need to think about extensibility, alignment, etc.. > > The problem I usually see with ioctls is not proliferation, but ending > up with too many choices and a big ?? when it comes to adding > something new. > > Clear rules where things should go and why is the best, it matters > less what the rules actually are IMHO. > >>> I don't want to touch capabilities, but we can try to use feature for >>> set state. Please confirm this is what you want. >> >> It's a team sport, but to me it seems like it fits well both in my >> mental model of interacting with a device feature, without >> significantly altering the uAPI you're defining anyway. > > Well, my advice is that ioctls are fine, and a bit easier all around. > eg strace and syzkaller are a bit easier if everything neatly maps > into one struct per ioctl - their generator tools are optimized for > this common case. > > Simple multiplexors are next-best-fine, but there should be a clear > idea when to use the multiplexer, or not. > > Things like the cap chains enter a whole world of adventure for > strace/syzkaller :) > >>> You'll want the same for the PRE_COPY related information too? >> >> I hadn't gotten there yet. It seems like a discontinuity to me that >> we're handing out new FDs for data transfer sessions, but then we >> require the user to come back to the device to query about the data its >> reading through that other FD. > > An earlier draft of this put it on the data FD, but v6 made it fully > optional with no functional impact on the data FD. The values decrease > as the data FD progresses and increases as the VM dirties data - ie it > is 50/50 data_fd/device behavior. > > It doesn't matter which way, but it feels quite weird to have the main > state function is a FEATURE and the precopy query is an ioctl. > >> Should that be an ioctl on the data stream FD itself? > > I can be. Implementation wise it is about a wash. > >> Is there a use case for also having it on the STOP_COPY FD? > > I didn't think of one worthwhile enough to mandate implementing it in > every driver. > >>> If we are into these very minor nitpicks does this mean you are OK >>> with all the big topics now? >> >> I'm not hating it, but I'd like to see buy-in from others who have a >> vested interest in supporting migration. I don't see Intel or Huawei >> on the Cc list and the original collaborators of the v1 interface >> from > > That is an oversight, I'll ping them. I think people have been staying > away until the dust settles. > >> NVIDIA have been silent through this redesign. > > We've reviewed this internally with them. They reserve judgement on > the data transfer performance until they work on it, but functionally > it has all the necessary semantics. > Yes, we're reviewing the proposal from vGPU point of view and will update here once we have it figured out for vGPU. Thanks, Tarun > They have the same P2P issue mlx5 does, and are happy with the > solution under the same general provisions as already discussed for > the Huawei device - RUNNING_P2P is sustainable only while the device > is not touched - ie the VCPU is halted. > > The f_ops implemenation we used for mlx5 is basic, the full > performance version would want to use the read/write_iter() fop with > async completions to support the modern zero-copy iouring based data > motion in userspace. This is all part of the standard FD abstraction > and why it is appealing to use it. > > Thanks, > Jason
On Wed, Feb 02 2022, Alex Williamson <alex.williamson@redhat.com> wrote: > On Tue, 1 Feb 2022 20:24:59 -0400 > Jason Gunthorpe <jgg@nvidia.com> wrote: > >> On Tue, Feb 01, 2022 at 02:49:16PM -0700, Alex Williamson wrote: >> > On Tue, 1 Feb 2022 14:36:20 -0400 >> > Jason Gunthorpe <jgg@nvidia.com> wrote: >> > > I don't want to touch capabilities, but we can try to use feature for >> > > set state. Please confirm this is what you want. >> > >> > It's a team sport, but to me it seems like it fits well both in my >> > mental model of interacting with a device feature, without >> > significantly altering the uAPI you're defining anyway. >> >> Well, my advice is that ioctls are fine, and a bit easier all around. >> eg strace and syzkaller are a bit easier if everything neatly maps >> into one struct per ioctl - their generator tools are optimized for >> this common case. >> >> Simple multiplexors are next-best-fine, but there should be a clear >> idea when to use the multiplexer, or not. >> >> Things like the cap chains enter a whole world of adventure for >> strace/syzkaller :) > > vfio's argsz/flags is not only a standard framework, but it's one that > promotes extensions. We were able to add capability chains with > backwards compatibility because of this design. IMO, that's avoided > ioctl sprawl; we've been able to maintain a fairly small set of core > ioctls rather than add add a new ioctl every time we want to describe > some new property of a device or region or IOMMU. I think that > improves the usability of the uAPI. I certainly wouldn't want to > program to a uAPI with a million ioctls. A counter argument is that > we're making the interface more complex, but at the same time we're > adding shared infrastructure for dealing with that complexity. > > Of course we do continue to add new ioctls as necessary, including this > FEATURE ioctl, and I recognize that with such a generic multiplexer we > run the risk of over using it, ie. everything looks like a nail. You > initially did not see the fit for setting device state as interacting > with a device feature, but it doesn't seem like you had a strong > objection to my explanation of it in that context. > > So I think if the FEATURE ioctl has an ongoing place in our uAPI (using > it to expose migration flags would seem to be a point in that > direction) and it doesn't require too many contortions to think of the > operation we're trying to perform on the device as interacting with a > device FEATURE, and there are no functional or performance implications > of it, I would think we should use it. To do otherwise would suggest > that we should consider the FEATURE ioctl a failed experiment and not > continue to expand its use. > > I'd be interested to hear more input on this from the community. My personal take would be: a new ioctl is more suitable for things that may be implemented by different backends, but in a non-generic way, and for mandatory functionality; the FEATURE ioctl is more suitable for things that either are very specific to a certain backend (i.e. don't reserve an ioctl for something that will only ever be used on one platform), or for things that have a lot of commonality for the backends that implement them (i.e. you are using a familiar scheme to interact with them.) From staring at the code and the discussion here for a bit (I have not yet made my way through all of this except in a superficial way), I'd lean more towards using FEATURE here.
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index 71763e2ac561..b12be212d048 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -1557,6 +1557,184 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep) return 0; } +/* + * vfio_mig_get_next_state - Compute the next step in the FSM + * @cur_fsm - The current state the device is in + * @new_fsm - The target state to reach + * + * Return the next step in the state progression between cur_fsm and new_fsm. + * This breaks down requests for combination transitions into smaller steps and + * returns the next step to get to new_fsm. The function may need to be called + * multiple times before reaching new_fsm. + * + * VFIO_DEVICE_STATE_ERROR is returned if the state transition is not allowed. + */ +u32 vfio_mig_get_next_state(struct vfio_device *device, + enum vfio_device_mig_state cur_fsm, + enum vfio_device_mig_state new_fsm) +{ + enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RESUMING + 1 }; + /* + * The coding in this table requires the driver to implement 6 + * FSM arcs: + * RESUMING -> STOP + * RUNNING -> STOP + * STOP -> RESUMING + * STOP -> RUNNING + * STOP -> STOP_COPY + * STOP_COPY -> STOP + * + * The coding will step through multiple states for these combination + * transitions: + * RESUMING -> STOP -> RUNNING + * RESUMING -> STOP -> STOP_COPY + * RUNNING -> STOP -> RESUMING + * RUNNING -> STOP -> STOP_COPY + * STOP_COPY -> STOP -> RESUMING + * STOP_COPY -> STOP -> RUNNING + */ + static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = { + [VFIO_DEVICE_STATE_STOP] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + [VFIO_DEVICE_STATE_RUNNING] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + [VFIO_DEVICE_STATE_STOP_COPY] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + [VFIO_DEVICE_STATE_RESUMING] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + [VFIO_DEVICE_STATE_ERROR] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_ERROR, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + }; + if (cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table) || + new_fsm >= ARRAY_SIZE(vfio_from_fsm_table)) + return VFIO_DEVICE_STATE_ERROR; + + return vfio_from_fsm_table[cur_fsm][new_fsm]; +} +EXPORT_SYMBOL_GPL(vfio_mig_get_next_state); + +/* + * Convert the drivers's struct file into a FD number and return it to userspace + */ +static int vfio_ioct_mig_return_fd(struct file *filp, void __user *arg, + struct vfio_device_mig_set_state *set_state) +{ + int ret; + int fd; + + fd = get_unused_fd_flags(O_CLOEXEC); + if (fd < 0) { + ret = fd; + goto out_fput; + } + + set_state->data_fd = fd; + if (copy_to_user(arg, set_state, sizeof(*set_state))) { + ret = -EFAULT; + goto out_put_unused; + } + fd_install(fd, filp); + return 0; + +out_put_unused: + put_unused_fd(fd); +out_fput: + fput(filp); + return ret; +} + +static int vfio_ioctl_mig_set_state(struct vfio_device *device, + void __user *arg) +{ + size_t minsz = + offsetofend(struct vfio_device_mig_set_state, flags); + enum vfio_device_mig_state final_state = VFIO_DEVICE_STATE_ERROR; + struct vfio_device_mig_set_state set_state; + struct file *filp; + + if (!device->ops->migration_set_state) + return -EOPNOTSUPP; + + if (copy_from_user(&set_state, arg, minsz)) + return -EFAULT; + + if (set_state.argsz < minsz || set_state.flags) + return -EOPNOTSUPP; + + /* + * It is tempting to try to validate set_state.device_state here, but + * then we can't return final_state. The validation is done in + * vfio_mig_get_next_state(). + */ + filp = device->ops->migration_set_state(device, set_state.device_state, + &final_state); + set_state.device_state = final_state; + if (IS_ERR(filp)) { + if (WARN_ON(PTR_ERR(filp) == -EOPNOTSUPP || + PTR_ERR(filp) == -ENOTTY || + PTR_ERR(filp) == -EFAULT)) + filp = ERR_PTR(-EINVAL); + goto out_copy; + } + + if (!filp) + goto out_copy; + return vfio_ioct_mig_return_fd(filp, arg, &set_state); +out_copy: + set_state.data_fd = -1; + if (copy_to_user(arg, &set_state, sizeof(set_state))) + return -EFAULT; + if (IS_ERR(filp)) + return PTR_ERR(filp); + return 0; +} + +static int vfio_ioctl_device_feature_migration(struct vfio_device *device, + u32 flags, void __user *arg, + size_t argsz) +{ + struct vfio_device_feature_migration mig = { + .flags = VFIO_MIGRATION_STOP_COPY, + }; + int ret; + + if (!device->ops->migration_set_state) + return -ENOTTY; + + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET, + sizeof(mig)); + if (ret != 1) + return ret; + if (copy_to_user(arg, &mig, sizeof(mig))) + return -EFAULT; + return 0; +} + static int vfio_ioctl_device_feature(struct vfio_device *device, struct vfio_device_feature __user *arg) { @@ -1582,6 +1760,10 @@ static int vfio_ioctl_device_feature(struct vfio_device *device, return -EINVAL; switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) { + case VFIO_DEVICE_FEATURE_MIGRATION: + return vfio_ioctl_device_feature_migration( + device, feature.flags, arg->data, + feature.argsz - minsz); default: if (unlikely(!device->ops->device_feature)) return -EINVAL; @@ -1597,6 +1779,8 @@ static long vfio_device_fops_unl_ioctl(struct file *filep, struct vfio_device *device = filep->private_data; switch (cmd) { + case VFIO_DEVICE_MIG_SET_STATE: + return vfio_ioctl_mig_set_state(device, (void __user *)arg); case VFIO_DEVICE_FEATURE: return vfio_ioctl_device_feature(device, (void __user *)arg); default: diff --git a/include/linux/vfio.h b/include/linux/vfio.h index ca69516f869d..697790ec4065 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -56,6 +56,8 @@ struct vfio_device { * match, -errno for abort (ex. match with insufficient or incorrect * additional args) * @device_feature: Fill in the VFIO_DEVICE_FEATURE ioctl + * @migration_set_state: Optional callback to change the migration + * state for devices that support migration. */ struct vfio_device_ops { char *name; @@ -72,6 +74,10 @@ struct vfio_device_ops { int (*match)(struct vfio_device *vdev, char *buf); int (*device_feature)(struct vfio_device *device, u32 flags, void __user *arg, size_t argsz); + struct file *(*migration_set_state)( + struct vfio_device *device, + enum vfio_device_mig_state new_state, + enum vfio_device_mig_state *final_state); }; /** @@ -114,6 +120,10 @@ extern void vfio_device_put(struct vfio_device *device); int vfio_assign_device_set(struct vfio_device *device, void *set_id); +u32 vfio_mig_get_next_state(struct vfio_device *device, + enum vfio_device_mig_state cur_fsm, + enum vfio_device_mig_state new_fsm); + /* * External user API */ diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index ef33ea002b0b..d9162702973a 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -605,10 +605,10 @@ struct vfio_region_gfx_edid { struct vfio_device_migration_info { __u32 device_state; /* VFIO device state */ -#define VFIO_DEVICE_STATE_STOP (0) -#define VFIO_DEVICE_STATE_RUNNING (1 << 0) -#define VFIO_DEVICE_STATE_SAVING (1 << 1) -#define VFIO_DEVICE_STATE_RESUMING (1 << 2) +#define VFIO_DEVICE_STATE_V1_STOP (0) +#define VFIO_DEVICE_STATE_V1_RUNNING (1 << 0) +#define VFIO_DEVICE_STATE_V1_SAVING (1 << 1) +#define VFIO_DEVICE_STATE_V1_RESUMING (1 << 2) #define VFIO_DEVICE_STATE_MASK (VFIO_DEVICE_STATE_RUNNING | \ VFIO_DEVICE_STATE_SAVING | \ VFIO_DEVICE_STATE_RESUMING) @@ -1002,6 +1002,162 @@ struct vfio_device_feature { */ #define VFIO_DEVICE_FEATURE_PCI_VF_TOKEN (0) +/* + * Indicates the device can support the migration API. See enum + * vfio_device_mig_state for details. If present flags must be non-zero and + * VFIO_DEVICE_MIG_SET_STATE is supported. + * + * VFIO_MIGRATION_STOP_COPY means that RUNNING, STOP, STOP_COPY and + * RESUMING are supported. + */ +struct vfio_device_feature_migration { + __aligned_u64 flags; +#define VFIO_MIGRATION_STOP_COPY (1 << 0) +}; +#define VFIO_DEVICE_FEATURE_MIGRATION 1 + +/* + * The device migration Finite State Machine is described by the enum + * vfio_device_mig_state. Some of the FSM arcs will create a migration data + * transfer session by returning a FD, in this case the migration data will + * flow over the FD using read() and write() as discussed below. + * + * There are 5 states to support VFIO_MIGRATION_STOP_COPY: + * RUNNING - The device is running normally + * STOP - The device does not change the internal or external state + * STOP_COPY - The device internal state can be read out + * RESUMING - The device is stopped and is loading a new internal state + * ERROR - The device has failed and must be reset + * + * The FSM takes actions on the arcs between FSM states. The driver implements + * the following behavior for the FSM arcs: + * + * RUNNING -> STOP + * STOP_COPY -> STOP + * While in STOP the device must stop the operation of the device. The + * device must not generate interrupts, DMA, or advance its internal + * state. When stopped the device and kernel migration driver must accept + * and respond to interaction to support external subsystems in the STOP + * state, for example PCI MSI-X and PCI config pace. Failure by the user to + * restrict device access while in STOP must not result in error conditions + * outside the user context (ex. host system faults). + * + * The STOP_COPY arc will terminate a data transfer session. + * + * RESUMING -> STOP + * Leaving RESUMING terminates a data transfer session and indicates the + * device should complete processing of the data delivered by write(). The + * kernel migration driver should complete the incorporation of data written + * to the data transfer FD into the device internal state and perform + * final validity and consistency checking of the new device state. If the + * user provided data is found to be incomplete, inconsistent, or otherwise + * invalid, the migration driver must fail the SET_STATE ioctl and + * optionally go to the ERROR state as described below. + * + * While in STOP the device has the same behavior as other STOP states + * described above. + * + * To abort a RESUMING session the device must be reset. + * + * STOP -> RUNNING + * While in RUNNING the device is fully operational, the device may generate + * interrupts, DMA, respond to MMIO, all vfio device regions are functional, + * and the device may advance its internal state. + * + * STOP -> STOP_COPY + * This arc begin the process of saving the device state and will return a + * new data_fd. + * + * While in the STOP_COPY state the device has the same behavior as STOP + * with the addition that the data transfers session continues to stream the + * migration state. End of stream on the FD indicates the entire device + * state has been transferred. + * + * The user should take steps to restrict access to vfio device regions while + * the device is in STOP_COPY or risk corruption of the device migration data + * stream. + * + * STOP -> RESUMING + * Entering the RESUMING state starts a process of restoring the device + * state and will return a new data_fd. The data stream fed into the data_fd + * should be taken from the data transfer output of the saving group states + * from a compatible device. The migration driver may alter/reset the + * internal device state for this arc if required to prepare the device to + * receive the migration data. + * + * any -> ERROR + * ERROR cannot be specified as a device state, however any transition request + * can be failed with an errno return and may then move the device_state into + * ERROR. In this case the device was unable to execute the requested arc and + * was also unable to restore the device to any valid device_state. The ERROR + * state will be returned as described below in VFIO_DEVICE_MIG_SET_STATE. To + * recover from ERROR VFIO_DEVICE_RESET must be used to return the + * device_state back to RUNNING. + * + * The remaining possible transitions are interpreted as combinations of the + * above FSM arcs. As there are multiple paths through the FSM arcs the path + * should be selected based on the following rules: + * - Select the shortest path. + * Refer to vfio_mig_get_next_state() for the result of the algorithm. + * + * The automatic transit through the FSM arcs that make up the combination + * transition is invisible to the user. When working with combination arcs the + * user may see any step along the path in the device_state if SET_STATE + * fails. When handling these types of errors users should anticipate future + * revisions of this protocol using new states and those states becoming + * visible in this case. + */ +enum vfio_device_mig_state { + VFIO_DEVICE_STATE_ERROR = 0, + VFIO_DEVICE_STATE_STOP = 1, + VFIO_DEVICE_STATE_RUNNING = 2, + VFIO_DEVICE_STATE_STOP_COPY = 3, + VFIO_DEVICE_STATE_RESUMING = 4, +}; + +/** + * VFIO_DEVICE_MIG_SET_STATE - _IO(VFIO_TYPE, VFIO_BASE + 21) + * + * Execute a migration state change command on the VFIO device. The new state is + * supplied in device_state. + * + * The kernel migration driver must fully transition the device to the new state + * value before the write(2) operation returns to the user. + * + * The kernel migration driver must not generate asynchronous device state + * transitions outside of manipulation by the user or the VFIO_DEVICE_RESET + * ioctl as described above. + * + * If this function fails and returns -1 then the device_state is updated with + * the current state the device is in. This may be the original operating state + * or some other state along the combination transition path. The user can then + * decide if it should execute a VFIO_DEVICE_RESET, attempt to return to the + * original state, or attempt to return to some other state such as RUNNING or + * STOP. If errno is set to EOPNOTSUPP, EFAULT or ENOTTY then the device_state + * output is not reliable. + * + * If the new_state starts a new data transfer session then the FD associated + * with that session is returned in data_fd. The user is responsible to close + * this FD when it is finished. The user must consider the migration data + * segments carried over the FD to be opaque and non-fungible. During RESUMING, + * the data segments must be written in the same order they came out of the + * saving side FD. + * + * Setting device_state to VFIO_DEVICE_STATE_ERROR will always fail with EINVAL, + * and take no action. However the device_state will be updated with the current + * value. + * + * Return: 0 on success, -1 and errno set on failure. + */ +struct vfio_device_mig_set_state { + __u32 argsz; + __u32 device_state; + __s32 data_fd; + __u32 flags; +}; + +#define VFIO_DEVICE_MIG_SET_STATE _IO(VFIO_TYPE, VFIO_BASE + 21) + /* -------- API for Type1 VFIO IOMMU -------- */ /**