xen: Avoid calling device suspend/resume callbacks

Message ID	20190729154112.7644-1-ross.lagerwall@citrix.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> Received-SPF: None (esa4.hc3370-68.iphmx.com: no sender authenticity information available from domain of ross.lagerwall@citrix.com) identity=pra; client-ip=162.221.158.21; receiver=esa4.hc3370-68.iphmx.com; envelope-from="ross.lagerwall@citrix.com"; x-sender="ross.lagerwall@citrix.com"; x-conformance=sidf_compatible Received-SPF: Pass (esa4.hc3370-68.iphmx.com: domain of ross.lagerwall@citrix.com designates 162.221.158.21 as permitted sender) identity=mailfrom; client-ip=162.221.158.21; receiver=esa4.hc3370-68.iphmx.com; envelope-from="ross.lagerwall@citrix.com"; x-sender="ross.lagerwall@citrix.com"; x-conformance=sidf_compatible; x-record-type="v=spf1"; x-record-text="v=spf1 ip4:209.167.231.154 ip4:178.63.86.133 ip4:195.66.111.40/30 ip4:85.115.9.32/28 ip4:199.102.83.4 ip4:192.28.146.160 ip4:192.28.146.107 ip4:216.52.6.88 ip4:216.52.6.188 ip4:162.221.158.21 ip4:162.221.156.83 ~all" Received-SPF: None (esa4.hc3370-68.iphmx.com: no sender authenticity information available from domain of postmaster@mail.citrix.com) identity=helo; client-ip=162.221.158.21; receiver=esa4.hc3370-68.iphmx.com; envelope-from="ross.lagerwall@citrix.com"; x-sender="postmaster@mail.citrix.com"; x-conformance=sidf_compatible IronPort-SDR: EfVmCSp/3iSv1XZUlDJQdS4SR95+T40KnPCZJ6RZRzFnzo6kauir8xqWQJZGVeFONIY13aDAqx ggtolvWdsfbdTB/99VAfGZFR2QOQB8YWwyzImnKJOiLil0WblUPR5AZAWqHuPry7AMyjyiwBxz cH0DUSO3YTJtfbkoEcUklYxs/DJUWB2ImQFZrvmK05adZQny+1CMlxopjjXi5ENumd/xcmWfBY IpPsGBAjtdS1kKO2FJm6dp+3gkilSXQzXnF3Kt9Rr0unhgqCMpBsE/7YhrPpX+wbNo4GePDqSu /MQ= From: Ross Lagerwall <ross.lagerwall@citrix.com> To: <xen-devel@lists.xenproject.org> Date: Mon, 29 Jul 2019 16:41:12 +0100 Message-ID: <20190729154112.7644-1-ross.lagerwall@citrix.com> MIME-Version: 1.0 Subject: [Xen-devel] [PATCH] xen: Avoid calling device suspend/resume callbacks Precedence: list Cc: Juergen Gross <jgross@suse.com>, Ross Lagerwall <ross.lagerwall@citrix.com>, Boris Ostrovsky <boris.ostrovsky@oracle.com>, Stefano Stabellini <sstabellini@kernel.org> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
Series	xen: Avoid calling device suspend/resume callbacks \| expand xen: Avoid calling device suspend/resume callbacks

Message ID

20190729154112.7644-1-ross.lagerwall@citrix.com (mailing list archive)

State

New, archived

Headers

Received-SPF: None (esa4.hc3370-68.iphmx.com: no sender
 authenticity information available from domain of
 ross.lagerwall@citrix.com) identity=pra;
 client-ip=162.221.158.21; receiver=esa4.hc3370-68.iphmx.com;
 envelope-from="ross.lagerwall@citrix.com";
 x-sender="ross.lagerwall@citrix.com";
 x-conformance=sidf_compatible
Received-SPF: Pass (esa4.hc3370-68.iphmx.com: domain of
 ross.lagerwall@citrix.com designates 162.221.158.21 as
 permitted sender) identity=mailfrom;
 client-ip=162.221.158.21; receiver=esa4.hc3370-68.iphmx.com;
 envelope-from="ross.lagerwall@citrix.com";
 x-sender="ross.lagerwall@citrix.com";
 x-conformance=sidf_compatible; x-record-type="v=spf1";
 x-record-text="v=spf1 ip4:209.167.231.154 ip4:178.63.86.133
 ip4:195.66.111.40/30 ip4:85.115.9.32/28 ip4:199.102.83.4
 ip4:192.28.146.160 ip4:192.28.146.107 ip4:216.52.6.88
 ip4:216.52.6.188 ip4:162.221.158.21 ip4:162.221.156.83 ~all"
Received-SPF: None (esa4.hc3370-68.iphmx.com: no sender
 authenticity information available from domain of
 postmaster@mail.citrix.com) identity=helo;
 client-ip=162.221.158.21; receiver=esa4.hc3370-68.iphmx.com;
 envelope-from="ross.lagerwall@citrix.com";
 x-sender="postmaster@mail.citrix.com";
 x-conformance=sidf_compatible
IronPort-SDR: 
 EfVmCSp/3iSv1XZUlDJQdS4SR95+T40KnPCZJ6RZRzFnzo6kauir8xqWQJZGVeFONIY13aDAqx
 ggtolvWdsfbdTB/99VAfGZFR2QOQB8YWwyzImnKJOiLil0WblUPR5AZAWqHuPry7AMyjyiwBxz
 cH0DUSO3YTJtfbkoEcUklYxs/DJUWB2ImQFZrvmK05adZQny+1CMlxopjjXi5ENumd/xcmWfBY
 IpPsGBAjtdS1kKO2FJm6dp+3gkilSXQzXnF3Kt9Rr0unhgqCMpBsE/7YhrPpX+wbNo4GePDqSu
 /MQ=
From: Ross Lagerwall <ross.lagerwall@citrix.com>
To: <xen-devel@lists.xenproject.org>
Date: Mon, 29 Jul 2019 16:41:12 +0100
Message-ID: <20190729154112.7644-1-ross.lagerwall@citrix.com>
MIME-Version: 1.0
Subject: [Xen-devel] [PATCH] xen: Avoid calling device suspend/resume
 callbacks
Precedence: list
Cc: Juergen Gross <jgross@suse.com>, Ross
 Lagerwall <ross.lagerwall@citrix.com>,
 Boris Ostrovsky <boris.ostrovsky@oracle.com>,
 Stefano Stabellini <sstabellini@kernel.org>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: xen-devel-bounces@lists.xenproject.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>

Series

xen: Avoid calling device suspend/resume callbacks | expand

Commit Message

Ross Lagerwall July 29, 2019, 3:41 p.m. UTC

When suspending/resuming or migrating under Xen, there isn't much need
for suspending and resuming all the attached devices since the Xen/QEMU
should correctly maintain the hardware state. Drop these calls and
replace with more specific calls to ensure Xen frontend devices are
properly reconnected.

This change is needed to make NVIDIA vGPU migration work under Xen since
the vGPU device being suspended interferes with that working correctly.
However, it has the added benefit of reducing migration downtime - by
approximately 500ms with an HVM guest in my environment.

Tested by putting an HVM guest through 1000 migration cycles. I also
tested PV guest migration (though less rigorously).

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
---
 drivers/xen/manage.c                       | 24 +++++++---------------
 drivers/xen/xenbus/xenbus_probe_frontend.c | 22 ++++++++++++++++++++
 include/xen/xenbus.h                       |  3 +++
 3 files changed, 32 insertions(+), 17 deletions(-)

Comments

Jan Beulich July 29, 2019, 3:57 p.m. UTC | #1

On 29.07.2019 17:41, Ross Lagerwall wrote:
> When suspending/resuming or migrating under Xen, there isn't much need
> for suspending and resuming all the attached devices since the Xen/QEMU
> should correctly maintain the hardware state. Drop these calls and
> replace with more specific calls to ensure Xen frontend devices are
> properly reconnected.

Is this true for the general pass-through case as well? While migration
may not be (fully) compatible with pass-through, iirc save/restore is.
Would qemu restore state of physical PCI devices?

Jan

Andrew Cooper July 29, 2019, 5:06 p.m. UTC | #2

On 29/07/2019 16:57, Jan Beulich wrote:
> On 29.07.2019 17:41, Ross Lagerwall wrote:
>> When suspending/resuming or migrating under Xen, there isn't much need
>> for suspending and resuming all the attached devices since the Xen/QEMU
>> should correctly maintain the hardware state. Drop these calls and
>> replace with more specific calls to ensure Xen frontend devices are
>> properly reconnected.
> Is this true for the general pass-through case as well? While migration
> may not be (fully) compatible with pass-through, iirc save/restore is.

What gives you this impression?

Migration and save/restore are *literally* the same thing, except that
in one case you're piping the data to/from disk, and in the other you're
piping it to the destination and restoring it immediately.

If you look at the toolstack code, it is all in terms of reading/writing
an fd (including libxl's API) which is either a network socket or a
file, as chosen by xl.

> Would qemu restore state of physical PCI devices?

What state would Qemu be in a position to know about, which isn't
already present in Qemu's datablob?

What we do with graphics cards is to merge Xens logdirty bitmap, with a
dirty list provided by the card itself.  This needs a device-specific
knowledge.  In addition, there is an opaque blob of data produced by the
source card, which is handed to the destination card.  That also lives
in the stream.

Intel's Scalable IOV spec is attempting to rationalise this by having a
standard ways of getting logdirty and "internal state" information out
of a device, but for the moment, it requires custom device-driver
specific code to do anything migration related with real hardware.

As for why its safe to do like this, the best argument is that this is
how all other vendors do migration, including KVM.  Xen is the
odd-one-out using the full S3 path.

~Andrew

Jan Beulich July 30, 2019, 7:55 a.m. UTC | #3

On 29.07.2019 19:06, Andrew Cooper wrote:
> On 29/07/2019 16:57, Jan Beulich wrote:
>> On 29.07.2019 17:41, Ross Lagerwall wrote:
>>> When suspending/resuming or migrating under Xen, there isn't much need
>>> for suspending and resuming all the attached devices since the Xen/QEMU
>>> should correctly maintain the hardware state. Drop these calls and
>>> replace with more specific calls to ensure Xen frontend devices are
>>> properly reconnected.
>> Is this true for the general pass-through case as well? While migration
>> may not be (fully) compatible with pass-through, iirc save/restore is.
> 
> What gives you this impression?
> 
> Migration and save/restore are *literally* the same thing, except that
> in one case you're piping the data to/from disk, and in the other you're
> piping it to the destination and restoring it immediately.
> 
> If you look at the toolstack code, it is all in terms of reading/writing
> an fd (including libxl's API) which is either a network socket or a
> file, as chosen by xl.

Sure. The main difference is where the restore happens (by default):
For live migration I expect this to be on a different host, whereas
for a non-live restore I'd expect this to be the same host. And it
is only the "same host" case where one can assume the same physical
piece of hardware to be available again for passing through to this
guest. In the "different host" case restore _may_ be possible, using
identical hardware. (And yes, in the "same host" case restore may
also be impossible, if the hardware meanwhile has been assigned to
another guest. But as said, I'm talking about the default case here.)

>> Would qemu restore state of physical PCI devices?
> 
> What state would Qemu be in a position to know about, which isn't
> already present in Qemu's datablob?

That's a valid (rhetorical) question, but not helping to answer mine.

> What we do with graphics cards is to merge Xens logdirty bitmap, with a
> dirty list provided by the card itself.  This needs a device-specific
> knowledge.  In addition, there is an opaque blob of data produced by the
> source card, which is handed to the destination card.  That also lives
> in the stream.
> 
> Intel's Scalable IOV spec is attempting to rationalise this by having a
> standard ways of getting logdirty and "internal state" information out
> of a device, but for the moment, it requires custom device-driver
> specific code to do anything migration related with real hardware.

Which isn't very nice, since it doesn't scale well as a model.

> As for why its safe to do like this, the best argument is that this is
> how all other vendors do migration, including KVM.  Xen is the
> odd-one-out using the full S3 path.

So how do "all other vendors" deal with device specific state? So
far I was under the impression that to deal with this is precisely
why we use the S3 logic in the kernel.

Jan

Marek Marczykowski-Górecki July 30, 2019, 12:21 p.m. UTC | #4

On Tue, Jul 30, 2019 at 07:55:02AM +0000, Jan Beulich wrote:
> On 29.07.2019 19:06, Andrew Cooper wrote:
> > On 29/07/2019 16:57, Jan Beulich wrote:
> >> On 29.07.2019 17:41, Ross Lagerwall wrote:
> >>> When suspending/resuming or migrating under Xen, there isn't much need
> >>> for suspending and resuming all the attached devices since the Xen/QEMU
> >>> should correctly maintain the hardware state. Drop these calls and
> >>> replace with more specific calls to ensure Xen frontend devices are
> >>> properly reconnected.
> >> Is this true for the general pass-through case as well? While migration
> >> may not be (fully) compatible with pass-through, iirc save/restore is.
> > 
> > What gives you this impression?
> > 
> > Migration and save/restore are *literally* the same thing, except that
> > in one case you're piping the data to/from disk, and in the other you're
> > piping it to the destination and restoring it immediately.
> > 
> > If you look at the toolstack code, it is all in terms of reading/writing
> > an fd (including libxl's API) which is either a network socket or a
> > file, as chosen by xl.
> 
> Sure. The main difference is where the restore happens (by default):
> For live migration I expect this to be on a different host, whereas
> for a non-live restore I'd expect this to be the same host. And it
> is only the "same host" case where one can assume the same physical
> piece of hardware to be available again for passing through to this
> guest. In the "different host" case restore _may_ be possible, using
> identical hardware. (And yes, in the "same host" case restore may
> also be impossible, if the hardware meanwhile has been assigned to
> another guest. But as said, I'm talking about the default case here.)
> 
> >> Would qemu restore state of physical PCI devices?
> > 
> > What state would Qemu be in a position to know about, which isn't
> > already present in Qemu's datablob?
> 
> That's a valid (rhetorical) question, but not helping to answer mine.
> 
> > What we do with graphics cards is to merge Xens logdirty bitmap, with a
> > dirty list provided by the card itself.  This needs a device-specific
> > knowledge.  In addition, there is an opaque blob of data produced by the
> > source card, which is handed to the destination card.  That also lives
> > in the stream.
> > 
> > Intel's Scalable IOV spec is attempting to rationalise this by having a
> > standard ways of getting logdirty and "internal state" information out
> > of a device, but for the moment, it requires custom device-driver
> > specific code to do anything migration related with real hardware.
> 
> Which isn't very nice, since it doesn't scale well as a model.
> 
> > As for why its safe to do like this, the best argument is that this is
> > how all other vendors do migration, including KVM.  Xen is the
> > odd-one-out using the full S3 path.
> 
> So how do "all other vendors" deal with device specific state? So
> far I was under the impression that to deal with this is precisely
> why we use the S3 logic in the kernel.

FWIW in Qubes we specifically S3 domUs with PCI devices assigned just
before host S3 - to let the driver save the state and later restore it.
AFAIR in same cases (but it might be PV, not HVM then) not doing so was
breaking host S3 altogether.

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index cd046684e0d1..53768e0e2560 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -113,21 +113,12 @@  static void do_suspend(void)
 		goto out_thaw;
 	}
 
-	err = dpm_suspend_start(PMSG_FREEZE);
-	if (err) {
-		pr_err("%s: dpm_suspend_start %d\n", __func__, err);
-		goto out_thaw;
-	}
+	xenbus_suspend_frontends();
 
 	printk(KERN_DEBUG "suspending xenstore...\n");
 	xs_suspend();
 
-	err = dpm_suspend_end(PMSG_FREEZE);
-	if (err) {
-		pr_err("dpm_suspend_end failed: %d\n", err);
-		si.cancelled = 0;
-		goto out_resume;
-	}
+	suspend_device_irqs();
 
 	xen_arch_suspend();
 
@@ -141,7 +132,7 @@  static void do_suspend(void)
 
 	raw_notifier_call_chain(&xen_resume_notifier, 0, NULL);
 
-	dpm_resume_start(si.cancelled ? PMSG_THAW : PMSG_RESTORE);
+	resume_device_irqs();
 
 	if (err) {
 		pr_err("failed to start xen_suspend: %d\n", err);
@@ -150,13 +141,12 @@  static void do_suspend(void)
 
 	xen_arch_resume();
 
-out_resume:
-	if (!si.cancelled)
+	if (!si.cancelled) {
 		xs_resume();
-	else
+		xenbus_resume_frontends();
+	} else {
 		xs_suspend_cancel();
-
-	dpm_resume_end(si.cancelled ? PMSG_THAW : PMSG_RESTORE);
+	}
 
 out_thaw:
 	thaw_processes();
diff --git a/drivers/xen/xenbus/xenbus_probe_frontend.c b/drivers/xen/xenbus/xenbus_probe_frontend.c
index a7d90a719cea..8cd836c402e1 100644
--- a/drivers/xen/xenbus/xenbus_probe_frontend.c
+++ b/drivers/xen/xenbus/xenbus_probe_frontend.c
@@ -153,6 +153,28 @@  static struct xen_bus_type xenbus_frontend = {
 	},
 };
 
+static int xenbus_suspend_one(struct device *dev, void *data)
+{
+	xenbus_dev_suspend(dev);
+	return 0;
+}
+
+void xenbus_suspend_frontends(void)
+{
+	bus_for_each_dev(&xenbus_frontend.bus, NULL, NULL, xenbus_suspend_one);
+}
+
+static int xenbus_resume_one(struct device *dev, void *data)
+{
+	xenbus_frontend_dev_resume(dev);
+	return 0;
+}
+
+void xenbus_resume_frontends(void)
+{
+	bus_for_each_dev(&xenbus_frontend.bus, NULL, NULL, xenbus_resume_one);
+}
+
 static void frontend_changed(struct xenbus_watch *watch,
 			     const char *path, const char *token)
 {
diff --git a/include/xen/xenbus.h b/include/xen/xenbus.h
index 869c816d5f8c..71eeb442c375 100644
--- a/include/xen/xenbus.h
+++ b/include/xen/xenbus.h
@@ -233,4 +233,7 @@  extern const struct file_operations xen_xenbus_fops;
 extern struct xenstore_domain_interface *xen_store_interface;
 extern int xen_store_evtchn;
 
+void xenbus_suspend_frontends(void);
+void xenbus_resume_frontends(void);
+
 #endif /* _XEN_XENBUS_H */

xen: Avoid calling device suspend/resume callbacks

Commit Message

Comments

Patch