Message ID | 20190729154112.7644-1-ross.lagerwall@citrix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | xen: Avoid calling device suspend/resume callbacks | expand |
On 29.07.2019 17:41, Ross Lagerwall wrote: > When suspending/resuming or migrating under Xen, there isn't much need > for suspending and resuming all the attached devices since the Xen/QEMU > should correctly maintain the hardware state. Drop these calls and > replace with more specific calls to ensure Xen frontend devices are > properly reconnected. Is this true for the general pass-through case as well? While migration may not be (fully) compatible with pass-through, iirc save/restore is. Would qemu restore state of physical PCI devices? Jan
On 29/07/2019 16:57, Jan Beulich wrote: > On 29.07.2019 17:41, Ross Lagerwall wrote: >> When suspending/resuming or migrating under Xen, there isn't much need >> for suspending and resuming all the attached devices since the Xen/QEMU >> should correctly maintain the hardware state. Drop these calls and >> replace with more specific calls to ensure Xen frontend devices are >> properly reconnected. > Is this true for the general pass-through case as well? While migration > may not be (fully) compatible with pass-through, iirc save/restore is. What gives you this impression? Migration and save/restore are *literally* the same thing, except that in one case you're piping the data to/from disk, and in the other you're piping it to the destination and restoring it immediately. If you look at the toolstack code, it is all in terms of reading/writing an fd (including libxl's API) which is either a network socket or a file, as chosen by xl. > Would qemu restore state of physical PCI devices? What state would Qemu be in a position to know about, which isn't already present in Qemu's datablob? What we do with graphics cards is to merge Xens logdirty bitmap, with a dirty list provided by the card itself. This needs a device-specific knowledge. In addition, there is an opaque blob of data produced by the source card, which is handed to the destination card. That also lives in the stream. Intel's Scalable IOV spec is attempting to rationalise this by having a standard ways of getting logdirty and "internal state" information out of a device, but for the moment, it requires custom device-driver specific code to do anything migration related with real hardware. As for why its safe to do like this, the best argument is that this is how all other vendors do migration, including KVM. Xen is the odd-one-out using the full S3 path. ~Andrew
On 29.07.2019 19:06, Andrew Cooper wrote: > On 29/07/2019 16:57, Jan Beulich wrote: >> On 29.07.2019 17:41, Ross Lagerwall wrote: >>> When suspending/resuming or migrating under Xen, there isn't much need >>> for suspending and resuming all the attached devices since the Xen/QEMU >>> should correctly maintain the hardware state. Drop these calls and >>> replace with more specific calls to ensure Xen frontend devices are >>> properly reconnected. >> Is this true for the general pass-through case as well? While migration >> may not be (fully) compatible with pass-through, iirc save/restore is. > > What gives you this impression? > > Migration and save/restore are *literally* the same thing, except that > in one case you're piping the data to/from disk, and in the other you're > piping it to the destination and restoring it immediately. > > If you look at the toolstack code, it is all in terms of reading/writing > an fd (including libxl's API) which is either a network socket or a > file, as chosen by xl. Sure. The main difference is where the restore happens (by default): For live migration I expect this to be on a different host, whereas for a non-live restore I'd expect this to be the same host. And it is only the "same host" case where one can assume the same physical piece of hardware to be available again for passing through to this guest. In the "different host" case restore _may_ be possible, using identical hardware. (And yes, in the "same host" case restore may also be impossible, if the hardware meanwhile has been assigned to another guest. But as said, I'm talking about the default case here.) >> Would qemu restore state of physical PCI devices? > > What state would Qemu be in a position to know about, which isn't > already present in Qemu's datablob? That's a valid (rhetorical) question, but not helping to answer mine. > What we do with graphics cards is to merge Xens logdirty bitmap, with a > dirty list provided by the card itself. This needs a device-specific > knowledge. In addition, there is an opaque blob of data produced by the > source card, which is handed to the destination card. That also lives > in the stream. > > Intel's Scalable IOV spec is attempting to rationalise this by having a > standard ways of getting logdirty and "internal state" information out > of a device, but for the moment, it requires custom device-driver > specific code to do anything migration related with real hardware. Which isn't very nice, since it doesn't scale well as a model. > As for why its safe to do like this, the best argument is that this is > how all other vendors do migration, including KVM. Xen is the > odd-one-out using the full S3 path. So how do "all other vendors" deal with device specific state? So far I was under the impression that to deal with this is precisely why we use the S3 logic in the kernel. Jan
On Tue, Jul 30, 2019 at 07:55:02AM +0000, Jan Beulich wrote: > On 29.07.2019 19:06, Andrew Cooper wrote: > > On 29/07/2019 16:57, Jan Beulich wrote: > >> On 29.07.2019 17:41, Ross Lagerwall wrote: > >>> When suspending/resuming or migrating under Xen, there isn't much need > >>> for suspending and resuming all the attached devices since the Xen/QEMU > >>> should correctly maintain the hardware state. Drop these calls and > >>> replace with more specific calls to ensure Xen frontend devices are > >>> properly reconnected. > >> Is this true for the general pass-through case as well? While migration > >> may not be (fully) compatible with pass-through, iirc save/restore is. > > > > What gives you this impression? > > > > Migration and save/restore are *literally* the same thing, except that > > in one case you're piping the data to/from disk, and in the other you're > > piping it to the destination and restoring it immediately. > > > > If you look at the toolstack code, it is all in terms of reading/writing > > an fd (including libxl's API) which is either a network socket or a > > file, as chosen by xl. > > Sure. The main difference is where the restore happens (by default): > For live migration I expect this to be on a different host, whereas > for a non-live restore I'd expect this to be the same host. And it > is only the "same host" case where one can assume the same physical > piece of hardware to be available again for passing through to this > guest. In the "different host" case restore _may_ be possible, using > identical hardware. (And yes, in the "same host" case restore may > also be impossible, if the hardware meanwhile has been assigned to > another guest. But as said, I'm talking about the default case here.) > > >> Would qemu restore state of physical PCI devices? > > > > What state would Qemu be in a position to know about, which isn't > > already present in Qemu's datablob? > > That's a valid (rhetorical) question, but not helping to answer mine. > > > What we do with graphics cards is to merge Xens logdirty bitmap, with a > > dirty list provided by the card itself. This needs a device-specific > > knowledge. In addition, there is an opaque blob of data produced by the > > source card, which is handed to the destination card. That also lives > > in the stream. > > > > Intel's Scalable IOV spec is attempting to rationalise this by having a > > standard ways of getting logdirty and "internal state" information out > > of a device, but for the moment, it requires custom device-driver > > specific code to do anything migration related with real hardware. > > Which isn't very nice, since it doesn't scale well as a model. > > > As for why its safe to do like this, the best argument is that this is > > how all other vendors do migration, including KVM. Xen is the > > odd-one-out using the full S3 path. > > So how do "all other vendors" deal with device specific state? So > far I was under the impression that to deal with this is precisely > why we use the S3 logic in the kernel. FWIW in Qubes we specifically S3 domUs with PCI devices assigned just before host S3 - to let the driver save the state and later restore it. AFAIR in same cases (but it might be PV, not HVM then) not doing so was breaking host S3 altogether.
diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c index cd046684e0d1..53768e0e2560 100644 --- a/drivers/xen/manage.c +++ b/drivers/xen/manage.c @@ -113,21 +113,12 @@ static void do_suspend(void) goto out_thaw; } - err = dpm_suspend_start(PMSG_FREEZE); - if (err) { - pr_err("%s: dpm_suspend_start %d\n", __func__, err); - goto out_thaw; - } + xenbus_suspend_frontends(); printk(KERN_DEBUG "suspending xenstore...\n"); xs_suspend(); - err = dpm_suspend_end(PMSG_FREEZE); - if (err) { - pr_err("dpm_suspend_end failed: %d\n", err); - si.cancelled = 0; - goto out_resume; - } + suspend_device_irqs(); xen_arch_suspend(); @@ -141,7 +132,7 @@ static void do_suspend(void) raw_notifier_call_chain(&xen_resume_notifier, 0, NULL); - dpm_resume_start(si.cancelled ? PMSG_THAW : PMSG_RESTORE); + resume_device_irqs(); if (err) { pr_err("failed to start xen_suspend: %d\n", err); @@ -150,13 +141,12 @@ static void do_suspend(void) xen_arch_resume(); -out_resume: - if (!si.cancelled) + if (!si.cancelled) { xs_resume(); - else + xenbus_resume_frontends(); + } else { xs_suspend_cancel(); - - dpm_resume_end(si.cancelled ? PMSG_THAW : PMSG_RESTORE); + } out_thaw: thaw_processes(); diff --git a/drivers/xen/xenbus/xenbus_probe_frontend.c b/drivers/xen/xenbus/xenbus_probe_frontend.c index a7d90a719cea..8cd836c402e1 100644 --- a/drivers/xen/xenbus/xenbus_probe_frontend.c +++ b/drivers/xen/xenbus/xenbus_probe_frontend.c @@ -153,6 +153,28 @@ static struct xen_bus_type xenbus_frontend = { }, }; +static int xenbus_suspend_one(struct device *dev, void *data) +{ + xenbus_dev_suspend(dev); + return 0; +} + +void xenbus_suspend_frontends(void) +{ + bus_for_each_dev(&xenbus_frontend.bus, NULL, NULL, xenbus_suspend_one); +} + +static int xenbus_resume_one(struct device *dev, void *data) +{ + xenbus_frontend_dev_resume(dev); + return 0; +} + +void xenbus_resume_frontends(void) +{ + bus_for_each_dev(&xenbus_frontend.bus, NULL, NULL, xenbus_resume_one); +} + static void frontend_changed(struct xenbus_watch *watch, const char *path, const char *token) { diff --git a/include/xen/xenbus.h b/include/xen/xenbus.h index 869c816d5f8c..71eeb442c375 100644 --- a/include/xen/xenbus.h +++ b/include/xen/xenbus.h @@ -233,4 +233,7 @@ extern const struct file_operations xen_xenbus_fops; extern struct xenstore_domain_interface *xen_store_interface; extern int xen_store_evtchn; +void xenbus_suspend_frontends(void); +void xenbus_resume_frontends(void); + #endif /* _XEN_XENBUS_H */
When suspending/resuming or migrating under Xen, there isn't much need for suspending and resuming all the attached devices since the Xen/QEMU should correctly maintain the hardware state. Drop these calls and replace with more specific calls to ensure Xen frontend devices are properly reconnected. This change is needed to make NVIDIA vGPU migration work under Xen since the vGPU device being suspended interferes with that working correctly. However, it has the added benefit of reducing migration downtime - by approximately 500ms with an HVM guest in my environment. Tested by putting an HVM guest through 1000 migration cycles. I also tested PV guest migration (though less rigorously). Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> --- drivers/xen/manage.c | 24 +++++++--------------- drivers/xen/xenbus/xenbus_probe_frontend.c | 22 ++++++++++++++++++++ include/xen/xenbus.h | 3 +++ 3 files changed, 32 insertions(+), 17 deletions(-)