Message ID | 20161128135357.5543-1-cbosdonnat@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Nov 28, 2016 at 02:53:57PM +0100, Cédric Bosdonnat wrote: > Resume is sometimes silently failing for HVM guests. Getting the > xc_domain_resume() and libxl__domain_resume_device_model() in the > reverse order than what is in the suspend code fixes the problem. > > Signed-off-by: Cédric Bosdonnat <cbosdonnat@suse.com> I think it would be nice to explain why reversing the order fixes the problem for you. My guess is because device model needs to be ready when the guest runs, but I'm not fully convinced by this explanation -- guests should just be trapped in the hypervisor waiting for device model to come up. I also CC'ed other people who are more familiar with this area so that they can provide insight. Wei.
On 29/11/16 08:34, Wei Liu wrote: > On Mon, Nov 28, 2016 at 02:53:57PM +0100, Cédric Bosdonnat wrote: >> Resume is sometimes silently failing for HVM guests. Getting the >> xc_domain_resume() and libxl__domain_resume_device_model() in the >> reverse order than what is in the suspend code fixes the problem. >> >> Signed-off-by: Cédric Bosdonnat <cbosdonnat@suse.com> > > I think it would be nice to explain why reversing the order fixes the > problem for you. My guess is because device model needs to be ready when > the guest runs, but I'm not fully convinced by this explanation -- > guests should just be trapped in the hypervisor waiting for device model > to come up. I'm not completely sure this is true. qemu is in "stopped" state, so it might be any emulation requests are just silently dropped. In any case it is just weird to stop qemu in suspend case only after suspending the domain, but let it continue _after_ resuming the domain. So I'd rather expect an explanation (not from Cedric) why this should be okay in case the patch isn't accepted. > I also CC'ed other people who are more familiar with this area so that > they can provide insight. And adding Stefano and Anthony, too. Juergen
On Tue, 29 Nov 2016, Juergen Gross wrote: > On 29/11/16 08:34, Wei Liu wrote: > > On Mon, Nov 28, 2016 at 02:53:57PM +0100, Cédric Bosdonnat wrote: > >> Resume is sometimes silently failing for HVM guests. Getting the > >> xc_domain_resume() and libxl__domain_resume_device_model() in the > >> reverse order than what is in the suspend code fixes the problem. > >> > >> Signed-off-by: Cédric Bosdonnat <cbosdonnat@suse.com> > > > > I think it would be nice to explain why reversing the order fixes the > > problem for you. My guess is because device model needs to be ready when > > the guest runs, but I'm not fully convinced by this explanation -- > > guests should just be trapped in the hypervisor waiting for device model > > to come up. > > I'm not completely sure this is true. qemu is in "stopped" state, so it > might be any emulation requests are just silently dropped. In any case > it is just weird to stop qemu in suspend case only after suspending the > domain, but let it continue _after_ resuming the domain. So I'd rather > expect an explanation (not from Cedric) why this should be okay in case > the patch isn't accepted. Calling xc_domain_resume before libxl__domain_resume_device_model seems wrong to me. For example in libxl_domain_unpause we call libxl__domain_resume_device_model, then xc_domain_unpause. We should get the DM ready before resuming the VM, right? TBH I don't know exactly what would happen if an ioreq comes in QEMU before we send the QMP "cont" command. It could be silenty dropped, causing the issue described above, but it would be nice if somebody instrumented QEMU with some debug printf to be sure.
On Tue, Nov 29, 2016 at 11:15:36AM -0800, Stefano Stabellini wrote: > On Tue, 29 Nov 2016, Juergen Gross wrote: > > On 29/11/16 08:34, Wei Liu wrote: > > > On Mon, Nov 28, 2016 at 02:53:57PM +0100, Cédric Bosdonnat wrote: > > >> Resume is sometimes silently failing for HVM guests. Getting the > > >> xc_domain_resume() and libxl__domain_resume_device_model() in the > > >> reverse order than what is in the suspend code fixes the problem. > > >> > > >> Signed-off-by: Cédric Bosdonnat <cbosdonnat@suse.com> > > > > > > I think it would be nice to explain why reversing the order fixes the > > > problem for you. My guess is because device model needs to be ready when > > > the guest runs, but I'm not fully convinced by this explanation -- > > > guests should just be trapped in the hypervisor waiting for device model > > > to come up. > > > > I'm not completely sure this is true. qemu is in "stopped" state, so it > > might be any emulation requests are just silently dropped. In any case > > it is just weird to stop qemu in suspend case only after suspending the > > domain, but let it continue _after_ resuming the domain. So I'd rather > > expect an explanation (not from Cedric) why this should be okay in case > > the patch isn't accepted. > > Calling xc_domain_resume before libxl__domain_resume_device_model seems > wrong to me. For example in libxl_domain_unpause we call > libxl__domain_resume_device_model, then xc_domain_unpause. We should get > the DM ready before resuming the VM, right? > Yes, I would think so, too. I'm inclined to accept this patch. At the end of the day, even if QEMU doesn't drop requests now, it doesn't mean it will never drop requests in the future. Wei.
On Mon, Nov 28, 2016 at 02:53:57PM +0100, Cédric Bosdonnat wrote: > Resume is sometimes silently failing for HVM guests. Getting the > xc_domain_resume() and libxl__domain_resume_device_model() in the > reverse order than what is in the suspend code fixes the problem. > > Signed-off-by: Cédric Bosdonnat <cbosdonnat@suse.com> Acked + applied. Due to the acceptance of LOG*D patch series, I need to rebase this patch a bit. Please check the result. Wei.
diff --git a/tools/libxl/libxl_dom_suspend.c b/tools/libxl/libxl_dom_suspend.c index 0648919..3e29c01 100644 --- a/tools/libxl/libxl_dom_suspend.c +++ b/tools/libxl/libxl_dom_suspend.c @@ -456,12 +456,6 @@ int libxl__domain_resume(libxl__gc *gc, uint32_t domid, int suspend_cancel) { int rc = 0; - if (xc_domain_resume(CTX->xch, domid, suspend_cancel)) { - LOGE(ERROR, "xc_domain_resume failed for domain %u", domid); - rc = ERROR_FAIL; - goto out; - } - libxl_domain_type type = libxl__domain_type(gc, domid); if (type == LIBXL_DOMAIN_TYPE_INVALID) { rc = ERROR_FAIL; @@ -477,6 +471,12 @@ int libxl__domain_resume(libxl__gc *gc, uint32_t domid, int suspend_cancel) } } + if (xc_domain_resume(CTX->xch, domid, suspend_cancel)) { + LOGE(ERROR, "xc_domain_resume failed for domain %u", domid); + rc = ERROR_FAIL; + goto out; + } + if (!xs_resume_domain(CTX->xsh, domid)) { LOGE(ERROR, "xs_resume_domain failed for domain %u", domid); rc = ERROR_FAIL;
Resume is sometimes silently failing for HVM guests. Getting the xc_domain_resume() and libxl__domain_resume_device_model() in the reverse order than what is in the suspend code fixes the problem. Signed-off-by: Cédric Bosdonnat <cbosdonnat@suse.com> --- tools/libxl/libxl_dom_suspend.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)