Message ID | 20230716081541.27900-2-avihaih@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfio/migration: Add P2P support for VFIO migration | expand |
On 7/16/23 10:15, Avihai Horon wrote: > Changing the device state from STOP_COPY to STOP can take time as the > device may need to free resources and do other operations as part of the > transition. Currently, this is done in vfio_save_complete_precopy() and > therefore it is counted in the migration downtime. > > To avoid this, change the device state from STOP_COPY to STOP in > vfio_save_cleanup(), which is called after migration has completed and > thus is not part of migration downtime. > > Signed-off-by: Avihai Horon <avihaih@nvidia.com> Have you tried this series with vGPUs ? If so, are there any improvement to report ? Thanks, C. > --- > hw/vfio/migration.c | 19 +++++++++++++------ > 1 file changed, 13 insertions(+), 6 deletions(-) > > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c > index 2674f4bc47..8acd182a8b 100644 > --- a/hw/vfio/migration.c > +++ b/hw/vfio/migration.c > @@ -383,6 +383,19 @@ static void vfio_save_cleanup(void *opaque) > VFIODevice *vbasedev = opaque; > VFIOMigration *migration = vbasedev->migration; > > + /* > + * Changing device state from STOP_COPY to STOP can take time. Do it here, > + * after migration has completed, so it won't increase downtime. > + */ > + if (migration->device_state == VFIO_DEVICE_STATE_STOP_COPY) { > + /* > + * If setting the device in STOP state fails, the device should be > + * reset. To do so, use ERROR state as a recover state. > + */ > + vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP, > + VFIO_DEVICE_STATE_ERROR); > + } > + > g_free(migration->data_buffer); > migration->data_buffer = NULL; > migration->precopy_init_size = 0; > @@ -508,12 +521,6 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) > return ret; > } > > - /* > - * If setting the device in STOP state fails, the device should be reset. > - * To do so, use ERROR state as a recover state. > - */ > - ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP, > - VFIO_DEVICE_STATE_ERROR); > trace_vfio_save_complete_precopy(vbasedev->name, ret); > > return ret;
On 30/07/2023 19:25, Cédric Le Goater wrote: > External email: Use caution opening links or attachments > > > On 7/16/23 10:15, Avihai Horon wrote: >> Changing the device state from STOP_COPY to STOP can take time as the >> device may need to free resources and do other operations as part of the >> transition. Currently, this is done in vfio_save_complete_precopy() and >> therefore it is counted in the migration downtime. >> >> To avoid this, change the device state from STOP_COPY to STOP in >> vfio_save_cleanup(), which is called after migration has completed and >> thus is not part of migration downtime. >> >> Signed-off-by: Avihai Horon <avihaih@nvidia.com> > > Have you tried this series with vGPUs ? If so, are there any improvement > to report ? > Not with a vGPU. But I tried it with a ConnectX-7 VF and I could see up to 6% downtime improvement. I mentioned this in the cover letter. Do you want to mention it also in the commit message? Thanks. > >> --- >> hw/vfio/migration.c | 19 +++++++++++++------ >> 1 file changed, 13 insertions(+), 6 deletions(-) >> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c >> index 2674f4bc47..8acd182a8b 100644 >> --- a/hw/vfio/migration.c >> +++ b/hw/vfio/migration.c >> @@ -383,6 +383,19 @@ static void vfio_save_cleanup(void *opaque) >> VFIODevice *vbasedev = opaque; >> VFIOMigration *migration = vbasedev->migration; >> >> + /* >> + * Changing device state from STOP_COPY to STOP can take time. >> Do it here, >> + * after migration has completed, so it won't increase downtime. >> + */ >> + if (migration->device_state == VFIO_DEVICE_STATE_STOP_COPY) { >> + /* >> + * If setting the device in STOP state fails, the device >> should be >> + * reset. To do so, use ERROR state as a recover state. >> + */ >> + vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP, >> + VFIO_DEVICE_STATE_ERROR); >> + } >> + >> g_free(migration->data_buffer); >> migration->data_buffer = NULL; >> migration->precopy_init_size = 0; >> @@ -508,12 +521,6 @@ static int vfio_save_complete_precopy(QEMUFile >> *f, void *opaque) >> return ret; >> } >> >> - /* >> - * If setting the device in STOP state fails, the device should >> be reset. >> - * To do so, use ERROR state as a recover state. >> - */ >> - ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP, >> - VFIO_DEVICE_STATE_ERROR); >> trace_vfio_save_complete_precopy(vbasedev->name, ret); >> >> return ret; >
On 7/31/23 08:32, Avihai Horon wrote: > > On 30/07/2023 19:25, Cédric Le Goater wrote: >> External email: Use caution opening links or attachments >> >> >> On 7/16/23 10:15, Avihai Horon wrote: >>> Changing the device state from STOP_COPY to STOP can take time as the >>> device may need to free resources and do other operations as part of the >>> transition. Currently, this is done in vfio_save_complete_precopy() and >>> therefore it is counted in the migration downtime. >>> >>> To avoid this, change the device state from STOP_COPY to STOP in >>> vfio_save_cleanup(), which is called after migration has completed and >>> thus is not part of migration downtime. >>> >>> Signed-off-by: Avihai Horon <avihaih@nvidia.com> >> >> Have you tried this series with vGPUs ? If so, are there any improvement >> to report ? >> > Not with a vGPU. > But I tried it with a ConnectX-7 VF and I could see up to 6% downtime improvement. > I mentioned this in the cover letter. > Do you want to mention it also in the commit message? This is not necessary. Thanks, C.
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 2674f4bc47..8acd182a8b 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -383,6 +383,19 @@ static void vfio_save_cleanup(void *opaque) VFIODevice *vbasedev = opaque; VFIOMigration *migration = vbasedev->migration; + /* + * Changing device state from STOP_COPY to STOP can take time. Do it here, + * after migration has completed, so it won't increase downtime. + */ + if (migration->device_state == VFIO_DEVICE_STATE_STOP_COPY) { + /* + * If setting the device in STOP state fails, the device should be + * reset. To do so, use ERROR state as a recover state. + */ + vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP, + VFIO_DEVICE_STATE_ERROR); + } + g_free(migration->data_buffer); migration->data_buffer = NULL; migration->precopy_init_size = 0; @@ -508,12 +521,6 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) return ret; } - /* - * If setting the device in STOP state fails, the device should be reset. - * To do so, use ERROR state as a recover state. - */ - ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP, - VFIO_DEVICE_STATE_ERROR); trace_vfio_save_complete_precopy(vbasedev->name, ret); return ret;
Changing the device state from STOP_COPY to STOP can take time as the device may need to free resources and do other operations as part of the transition. Currently, this is done in vfio_save_complete_precopy() and therefore it is counted in the migration downtime. To avoid this, change the device state from STOP_COPY to STOP in vfio_save_cleanup(), which is called after migration has completed and thus is not part of migration downtime. Signed-off-by: Avihai Horon <avihaih@nvidia.com> --- hw/vfio/migration.c | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-)