Message ID | 20190807112024.19480-1-elnikety@amazon.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | evtchn: make support for different ABIs tunable | expand |
On 07/08/2019 12:20, Eslam Elnikety wrote: > Adding support for FIFO event channel ABI was first introduced in Xen 4.4 > (see 88910061ec6). Make this support tunable, since the choice of which > event channel ABI has implications for hibernation. Consider resuming a > pre Xen 4.4 hibernated Linux guest. The guest boot kernel defaults to FIFO > ABI, whereas the resume kernel assumes 2L. This, in turn, results in Xen > and the resumed kernel talking past each other (due to different protocols > FIFO vs 2L). I'm afraid I don't follow. We have a Linux kernel which knows about FIFO, which was first booted on Xen < 4.4, so configured 2L mode. It is then suspended, and resumed on a newer Xen >= 4.4. The guest now has a choice between 2L mode, and FIFO mode. What is the problem? When resuming, the guest in question should continue to use 2L mode, because that is what it was using previously. ~Andrew
On Wed, 2019-08-07 at 12:40 +0100, Andrew Cooper wrote: > On 07/08/2019 12:20, Eslam Elnikety wrote: > > Adding support for FIFO event channel ABI was first introduced in Xen 4.4 > > (see 88910061ec6). Make this support tunable, since the choice of which > > event channel ABI has implications for hibernation. Consider resuming a > > pre Xen 4.4 hibernated Linux guest. The guest boot kernel defaults to FIFO > > ABI, whereas the resume kernel assumes 2L. This, in turn, results in Xen > > and the resumed kernel talking past each other (due to different protocols > > FIFO vs 2L). > > I'm afraid I don't follow. > > We have a Linux kernel which knows about FIFO, which was first booted on > Xen < 4.4, so configured 2L mode. > > It is then suspended, and resumed on a newer Xen >= 4.4. The guest now > has a choice between 2L mode, and FIFO mode. > > What is the problem? > > When resuming, the guest in question should continue to use 2L mode, > because that is what it was using previously. On resume, the guest first boots a Linux kernel (the 'boot' kernel). That kernel then loads the previous state (the 'resumed' kernel) from disk and then transfers control to it. I believe the problem occurs when the boot kernel sees and enables FIFO mode, then transfers control to the resumed kernel which is expecting 2L. I wonder if treating it more like a kexec and doing SHUTDOWN_soft_reset in the handover might be a viable long-term approach (and deal with other KASLR-related problems). Not that soft reset currently resets this AFAICT at a quick glance, but maybe it should? In the meantime though, hiding 2L mode for guests which were first booted without it is a simple option which doesn't force guests to upgrade. Amazon Development Centre (London) Ltd. Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom.
> On 7. Aug 2019, at 13:40, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > On 07/08/2019 12:20, Eslam Elnikety wrote: >> Adding support for FIFO event channel ABI was first introduced in Xen 4.4 >> (see 88910061ec6). Make this support tunable, since the choice of which >> event channel ABI has implications for hibernation. Consider resuming a >> pre Xen 4.4 hibernated Linux guest. The guest boot kernel defaults to FIFO >> ABI, whereas the resume kernel assumes 2L. This, in turn, results in Xen >> and the resumed kernel talking past each other (due to different protocols >> FIFO vs 2L). > > I'm afraid I don't follow. > > We have a Linux kernel which knows about FIFO, which was first booted on > Xen < 4.4, so configured 2L mode. > > It is then suspended, and resumed on a newer Xen >= 4.4. The guest now > has a choice between 2L mode, and FIFO mode. > > What is the problem? > > When resuming, the guest in question should continue to use 2L mode, > because that is what it was using previously. > > ~Andrew After resuming (i.e., Linux's software_resume), the guest will indeed continue to use 2L. However, Xen has already done evtchn_fifo_init_control as part of the boot kernel init (before the guest's software_resume). Then, we reach the point where guest assumes 2L and Xen assumes FIFO.
On 07.08.2019 14:04, Woodhouse, David wrote: > On Wed, 2019-08-07 at 12:40 +0100, Andrew Cooper wrote: >> On 07/08/2019 12:20, Eslam Elnikety wrote: >>> Adding support for FIFO event channel ABI was first introduced in Xen 4.4 >>> (see 88910061ec6). Make this support tunable, since the choice of which >>> event channel ABI has implications for hibernation. Consider resuming a >>> pre Xen 4.4 hibernated Linux guest. The guest boot kernel defaults to FIFO >>> ABI, whereas the resume kernel assumes 2L. This, in turn, results in Xen >>> and the resumed kernel talking past each other (due to different protocols >>> FIFO vs 2L). >> >> I'm afraid I don't follow. >> >> We have a Linux kernel which knows about FIFO, which was first booted on >> Xen < 4.4, so configured 2L mode. >> >> It is then suspended, and resumed on a newer Xen >= 4.4. The guest now >> has a choice between 2L mode, and FIFO mode. >> >> What is the problem? >> >> When resuming, the guest in question should continue to use 2L mode, >> because that is what it was using previously. > > On resume, the guest first boots a Linux kernel (the 'boot' kernel). > That kernel then loads the previous state (the 'resumed' kernel) from > disk and then transfers control to it. > > I believe the problem occurs when the boot kernel sees and enables FIFO > mode, then transfers control to the resumed kernel which is expecting > 2L. > > I wonder if treating it more like a kexec and doing SHUTDOWN_soft_reset > in the handover might be a viable long-term approach (and deal with > other KASLR-related problems). Not that soft reset currently resets > this AFAICT at a quick glance, but maybe it should? Oh, definitely, if it doesn't already. Soft-reset should really do what its name says and put state back to what it was initially. Jan
On 07.08.2019 14:07, Elnikety, Eslam wrote: >> On 7. Aug 2019, at 13:40, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> On 07/08/2019 12:20, Eslam Elnikety wrote: >>> Adding support for FIFO event channel ABI was first introduced in Xen 4.4 >>> (see 88910061ec6). Make this support tunable, since the choice of which >>> event channel ABI has implications for hibernation. Consider resuming a >>> pre Xen 4.4 hibernated Linux guest. The guest boot kernel defaults to FIFO >>> ABI, whereas the resume kernel assumes 2L. This, in turn, results in Xen >>> and the resumed kernel talking past each other (due to different protocols >>> FIFO vs 2L). >> >> I'm afraid I don't follow. >> >> We have a Linux kernel which knows about FIFO, which was first booted on >> Xen < 4.4, so configured 2L mode. >> >> It is then suspended, and resumed on a newer Xen >= 4.4. The guest now >> has a choice between 2L mode, and FIFO mode. >> >> What is the problem? >> >> When resuming, the guest in question should continue to use 2L mode, >> because that is what it was using previously. > > After resuming (i.e., Linux's software_resume), the guest will indeed continue > to use 2L. However, Xen has already done evtchn_fifo_init_control as part of > the boot kernel init (before the guest's software_resume). Then, we reach the > point where guest assumes 2L and Xen assumes FIFO. This involvement of two distinct kernels wasn't obvious at all from the initial posting, despite the use of the terms "guest boot kernel" and "resumed kernel". In any event - isn't this an issue to be solved between the two kernels, without (as far as possible) Xen's involvement, and without restricting guest capabilities? Jan
On 07/08/2019 13:04, Woodhouse, David wrote: > On Wed, 2019-08-07 at 12:40 +0100, Andrew Cooper wrote: >> On 07/08/2019 12:20, Eslam Elnikety wrote: >>> Adding support for FIFO event channel ABI was first introduced in Xen 4.4 >>> (see 88910061ec6). Make this support tunable, since the choice of which >>> event channel ABI has implications for hibernation. Consider resuming a >>> pre Xen 4.4 hibernated Linux guest. The guest boot kernel defaults to FIFO >>> ABI, whereas the resume kernel assumes 2L. This, in turn, results in Xen >>> and the resumed kernel talking past each other (due to different protocols >>> FIFO vs 2L). >> I'm afraid I don't follow. >> >> We have a Linux kernel which knows about FIFO, which was first booted on >> Xen < 4.4, so configured 2L mode. >> >> It is then suspended, and resumed on a newer Xen >= 4.4. The guest now >> has a choice between 2L mode, and FIFO mode. >> >> What is the problem? >> >> When resuming, the guest in question should continue to use 2L mode, >> because that is what it was using previously. > On resume, the guest first boots a Linux kernel (the 'boot' kernel). > That kernel then loads the previous state (the 'resumed' kernel) from > disk and then transfers control to it. Right, so the 'boot' kernel is setting up a mode, which the on-disk kernel/state can't cope with. > I believe the problem occurs when the boot kernel sees and enables FIFO > mode, then transfers control to the resumed kernel which is expecting > 2L. And I presume the underlying mess here is because virtual subsystems aren't saved/restored like most other devices? Ultimately, this looks like a Linux bug, but in the principle of "the virtual environment a guest sees should be constant for its logical lifetime", restricting the visible ABI probably is right thing to do. ~Andrew
On 07/08/2019 13:07, Elnikety, Eslam wrote: >> On 7. Aug 2019, at 13:40, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> >> On 07/08/2019 12:20, Eslam Elnikety wrote: >>> Adding support for FIFO event channel ABI was first introduced in Xen 4.4 >>> (see 88910061ec6). Make this support tunable, since the choice of which >>> event channel ABI has implications for hibernation. Consider resuming a >>> pre Xen 4.4 hibernated Linux guest. The guest boot kernel defaults to FIFO >>> ABI, whereas the resume kernel assumes 2L. This, in turn, results in Xen >>> and the resumed kernel talking past each other (due to different protocols >>> FIFO vs 2L). >> I'm afraid I don't follow. >> >> We have a Linux kernel which knows about FIFO, which was first booted on >> Xen < 4.4, so configured 2L mode. >> >> It is then suspended, and resumed on a newer Xen >= 4.4. The guest now >> has a choice between 2L mode, and FIFO mode. >> >> What is the problem? >> >> When resuming, the guest in question should continue to use 2L mode, >> because that is what it was using previously. >> >> ~Andrew > > After resuming (i.e., Linux's software_resume), the guest will indeed continue to use 2L. However, Xen has already done evtchn_fifo_init_control as part of the boot kernel init (before the guest's software_resume). Then, we reach the point where guest assumes 2L and Xen assumes FIFO. With Davids explanation, I now understand the problem. However for clarity, it is probably worth making a correction here. It isn't Xen which does evtchn_fifo_init_control(). It is the "boot" kernel issuing EVTCHNOP_init_control hypercall which switches the domain from 2L mode into FIFO mode. Xen is doing exactly as it was instructed. The underlying bug is a mismatch in behaviour between the "boot" kernel and the on-disk kernel/state. ~Andrew
> On 7. Aug 2019, at 14:30, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > On 07/08/2019 13:07, Elnikety, Eslam wrote: >>> On 7. Aug 2019, at 13:40, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>> >>> On 07/08/2019 12:20, Eslam Elnikety wrote: >>>> Adding support for FIFO event channel ABI was first introduced in Xen 4.4 >>>> (see 88910061ec6). Make this support tunable, since the choice of which >>>> event channel ABI has implications for hibernation. Consider resuming a >>>> pre Xen 4.4 hibernated Linux guest. The guest boot kernel defaults to FIFO >>>> ABI, whereas the resume kernel assumes 2L. This, in turn, results in Xen >>>> and the resumed kernel talking past each other (due to different protocols >>>> FIFO vs 2L). >>> I'm afraid I don't follow. >>> >>> We have a Linux kernel which knows about FIFO, which was first booted on >>> Xen < 4.4, so configured 2L mode. >>> >>> It is then suspended, and resumed on a newer Xen >= 4.4. The guest now >>> has a choice between 2L mode, and FIFO mode. >>> >>> What is the problem? >>> >>> When resuming, the guest in question should continue to use 2L mode, >>> because that is what it was using previously. >>> >>> ~Andrew >> >> After resuming (i.e., Linux's software_resume), the guest will indeed continue to use 2L. However, Xen has already done evtchn_fifo_init_control as part of the boot kernel init (before the guest's software_resume). Then, we reach the point where guest assumes 2L and Xen assumes FIFO. > > With Davids explanation, I now understand the problem. However for > clarity, it is probably worth making a correction here. > > It isn't Xen which does evtchn_fifo_init_control(). It is the "boot" > kernel issuing EVTCHNOP_init_control hypercall which switches the domain > from 2L mode into FIFO mode. > > Xen is doing exactly as it was instructed. The underlying bug is a > mismatch in behaviour between the "boot" kernel and the on-disk > kernel/state. Thanks a lot for the prompt responses, everyone. Yes, Xen is doing the right thing. But, I would not call it a Linux bug either. Hibernate/resume (rightfully) assumes that the underlying HW and the virtual subsystem and its capabilities have not changed during the guest's logical lifetime. The knob introduced in this patch goes along the same lines: allow an administrator to control which event channel ABI support level to expose in order to maintain a particular guest's view of the underlying HW and virtual subsystem. > > ~Andrew
On 7. Aug 2019, at 14:20, Jan Beulich <jbeulich@suse.com<mailto:jbeulich@suse.com>> wrote: On 07.08.2019 14:07, Elnikety, Eslam wrote: On 7. Aug 2019, at 13:40, Andrew Cooper <andrew.cooper3@citrix.com<mailto:andrew.cooper3@citrix.com>> wrote: On 07/08/2019 12:20, Eslam Elnikety wrote: Adding support for FIFO event channel ABI was first introduced in Xen 4.4 (see 88910061ec6). Make this support tunable, since the choice of which event channel ABI has implications for hibernation. Consider resuming a pre Xen 4.4 hibernated Linux guest. The guest boot kernel defaults to FIFO ABI, whereas the resume kernel assumes 2L. This, in turn, results in Xen and the resumed kernel talking past each other (due to different protocols FIFO vs 2L). I'm afraid I don't follow. We have a Linux kernel which knows about FIFO, which was first booted on Xen < 4.4, so configured 2L mode. It is then suspended, and resumed on a newer Xen >= 4.4. The guest now has a choice between 2L mode, and FIFO mode. What is the problem? When resuming, the guest in question should continue to use 2L mode, because that is what it was using previously. After resuming (i.e., Linux's software_resume), the guest will indeed continue to use 2L. However, Xen has already done evtchn_fifo_init_control as part of the boot kernel init (before the guest's software_resume). Then, we reach the point where guest assumes 2L and Xen assumes FIFO. This involvement of two distinct kernels wasn't obvious at all from the initial posting, despite the use of the terms "guest boot kernel" and "resumed kernel". In any event - isn't this an issue to be solved between the two kernels, without (as far as possible) Xen's involvement, and without restricting guest capabilities? Jan I think a re-write for the commit message is in order, given that the distinction between boot and resume kernels was not clear. I will do that, along with other changes if needed, subject to the maintainers being happy with the patch at a high level. In principle, we can instruct the boot kernel to not use FIFO. Yet, this will be needed when resuming on Xen >= 4.4, but not needed when resuming on Xen < 4.4. I think this is grounds to introduce the knob. Thanks, Eslam
On 07/08/2019 12:20, Eslam Elnikety wrote: > diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h > index 19486d5e32..654b4fdd22 100644 > --- a/xen/include/public/domctl.h > +++ b/xen/include/public/domctl.h > @@ -64,6 +64,9 @@ struct xen_domctl_createdomain { > /* Is this a xenstore domain? */ > #define _XEN_DOMCTL_CDF_xs_domain 4 > #define XEN_DOMCTL_CDF_xs_domain (1U<<_XEN_DOMCTL_CDF_xs_domain) > + /* Disable FIFO event channels? */ > +#define _XEN_DOMCTL_CDF_disable_fifo 5 > +#define XEN_DOMCTL_CDF_disable_fifo (1U<<_XEN_DOMCTL_CDF_disable_fifo) > uint32_t flags; On the subject of the the patch itself, I think this is broadly the right principle, but wants to be expressed differently. First, you'll want to rebase onto a very recent master, and specifically over c/s d8f2490561eb which has changed how this field is handled in Xen. Furthermore, if there is this problem for event channels, then there is almost certainly a related problem for grant tables. The control in Xen should be expressed in a positive form, or the logic will become a tangle. It should be a bit permitting the use of the FIFO ABI, rather than a bit saying "oh actually, you can't use that". That said, it might be easier to declare FIFO to be "event channel v2", and specify max_{grant,evntchn}_ver instead. I'm open to other suggestions as well. ~Andrew
On 07.08.2019 15:41, Andrew Cooper wrote: > On 07/08/2019 12:20, Eslam Elnikety wrote: >> diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h >> index 19486d5e32..654b4fdd22 100644 >> --- a/xen/include/public/domctl.h >> +++ b/xen/include/public/domctl.h >> @@ -64,6 +64,9 @@ struct xen_domctl_createdomain { >> /* Is this a xenstore domain? */ >> #define _XEN_DOMCTL_CDF_xs_domain 4 >> #define XEN_DOMCTL_CDF_xs_domain (1U<<_XEN_DOMCTL_CDF_xs_domain) >> + /* Disable FIFO event channels? */ >> +#define _XEN_DOMCTL_CDF_disable_fifo 5 >> +#define XEN_DOMCTL_CDF_disable_fifo (1U<<_XEN_DOMCTL_CDF_disable_fifo) >> uint32_t flags; > > On the subject of the the patch itself, I think this is broadly the > right principle, but wants to be expressed differently. > > First, you'll want to rebase onto a very recent master, and specifically > over c/s d8f2490561eb which has changed how this field is handled in Xen. > > Furthermore, if there is this problem for event channels, then there is > almost certainly a related problem for grant tables. > > The control in Xen should be expressed in a positive form, or the logic > will become a tangle. It should be a bit permitting the use of the FIFO > ABI, rather than a bit saying "oh actually, you can't use that". > > That said, it might be easier to declare FIFO to be "event channel v2", > and specify max_{grant,evntchn}_ver instead. I'm not sure assuming linear (or actually any) ordering between variants is a good thing. Yes, right now we only have gnttab v1 and v2 and evtchn 2l and fifo, which could be considered v1 and v2 as you suggest. However, assuming a 3rd variant surfaces, why would it be that one has to expose v2 just to make v3 usable? In particular gnttab v2 has various issues (which is why you introduced a way to disable its use in the first place), yet I'd hope we'd end up with a less quirky v3 if one ever becomes necessary. And in turn I'd hope we could hide v2 from any v3 users. IOW I think a bitmap to permit use of "advanced" versions is more future proof. (As a side note, I don't think we want to introduce a disable for the respective v1 interfaces.) Jan
On 07.08.2019 13:20, Eslam Elnikety wrote: > Adding support for FIFO event channel ABI was first introduced in Xen 4.4 > (see 88910061ec6). Make this support tunable, since the choice of which > event channel ABI has implications for hibernation. Consider resuming a > pre Xen 4.4 hibernated Linux guest. The guest boot kernel defaults to FIFO > ABI, whereas the resume kernel assumes 2L. This, in turn, results in Xen > and the resumed kernel talking past each other (due to different protocols > FIFO vs 2L). > > In order to announce to guests that the event channel ABI does not support > FIFO, the hypervisor returns ENOSYS on init_control operation. When this > operation fails, the guest should continue to use the 2L event channel ABI. > For example, in Linux drivers/xen/events/events_base.c: > > if (fifo_events) > ret = xen_evtchn_fifo_init(); > if (ret < 0) > xen_evtchn_2l_init(); > > and xen_evtchn_fifo_init fails when EVTCHNOP_init_control fails. This commit > does not change the current default behaviour: announce FIFO event channels > ABI support for guests unless explicitly stated otherwise at domaincreate. > > Signed-off-by: Eslam Elnikety <elnikety@amazon.com> Seeing that there looks to form agreement that restricting evtchn variants just like gnttab ones is wanted (I'm still not really convinced it is the right approach for the problem at hand, but I do agree having such control may be helpful in general), a few remarks on the patch itself: > --- a/xen/common/domain.c > +++ b/xen/common/domain.c > @@ -444,6 +444,9 @@ struct domain *domain_create(domid_t domid, > d->controller_pause_count = 1; > atomic_inc(&d->pause_count); > > + if ( d->options & XEN_DOMCTL_CDF_disable_fifo ) > + d->disable_evtchn_fifo = 1; This wants to be "true". > --- a/xen/common/event_channel.c > +++ b/xen/common/event_channel.c > @@ -1170,6 +1170,11 @@ long do_event_channel_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) > > case EVTCHNOP_init_control: { > struct evtchn_init_control init_control; > + > + /* Fail init_control for domains that must use 2l ABI */ > + if ( current->domain->disable_evtchn_fifo ) > + return -ENOSYS; ENOSYS is not an appropriate error code here. EOPNOTSUPP might be, and I guess there are more options (like EPERM or EACCES). > --- a/xen/include/xen/sched.h > +++ b/xen/include/xen/sched.h > @@ -338,6 +338,7 @@ struct domain > struct evtchn **evtchn_group[NR_EVTCHN_GROUPS]; /* all other buckets */ > unsigned int max_evtchns; /* number supported by ABI */ > unsigned int max_evtchn_port; /* max permitted port number */ > + bool disable_evtchn_fifo; /* force 2l ABI */ > unsigned int valid_evtchns; /* number of allocated event channels */ > spinlock_t event_lock; > const struct evtchn_port_ops *evtchn_port_ops; I suppose you can find a better place to put this 1-byte field than between two 32-bit ones, leaving a 3-byte hole. Jan
Hi, On 07/08/2019 15:35, Jan Beulich wrote: > On 07.08.2019 13:20, Eslam Elnikety wrote: >> --- a/xen/include/xen/sched.h >> +++ b/xen/include/xen/sched.h >> @@ -338,6 +338,7 @@ struct domain >> struct evtchn **evtchn_group[NR_EVTCHN_GROUPS]; /* all other buckets */ >> unsigned int max_evtchns; /* number supported by ABI */ >> unsigned int max_evtchn_port; /* max permitted port number */ >> + bool disable_evtchn_fifo; /* force 2l ABI */ >> unsigned int valid_evtchns; /* number of allocated event channels */ >> spinlock_t event_lock; >> const struct evtchn_port_ops *evtchn_port_ops; > > I suppose you can find a better place to put this 1-byte field > than between two 32-bit ones, leaving a 3-byte hole. Or just remove the field because the same value is already stored in d->options... Cheers,
On 07/08/2019 15:30, Jan Beulich wrote: > On 07.08.2019 15:41, Andrew Cooper wrote: >> On 07/08/2019 12:20, Eslam Elnikety wrote: >>> diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h >>> index 19486d5e32..654b4fdd22 100644 >>> --- a/xen/include/public/domctl.h >>> +++ b/xen/include/public/domctl.h >>> @@ -64,6 +64,9 @@ struct xen_domctl_createdomain { >>> /* Is this a xenstore domain? */ >>> #define _XEN_DOMCTL_CDF_xs_domain 4 >>> #define XEN_DOMCTL_CDF_xs_domain (1U<<_XEN_DOMCTL_CDF_xs_domain) >>> + /* Disable FIFO event channels? */ >>> +#define _XEN_DOMCTL_CDF_disable_fifo 5 >>> +#define XEN_DOMCTL_CDF_disable_fifo >>> (1U<<_XEN_DOMCTL_CDF_disable_fifo) >>> uint32_t flags; >> >> On the subject of the the patch itself, I think this is broadly the >> right principle, but wants to be expressed differently. >> >> First, you'll want to rebase onto a very recent master, and specifically >> over c/s d8f2490561eb which has changed how this field is handled in >> Xen. >> >> Furthermore, if there is this problem for event channels, then there is >> almost certainly a related problem for grant tables. >> >> The control in Xen should be expressed in a positive form, or the logic >> will become a tangle. It should be a bit permitting the use of the FIFO >> ABI, rather than a bit saying "oh actually, you can't use that". >> >> That said, it might be easier to declare FIFO to be "event channel v2", >> and specify max_{grant,evntchn}_ver instead. > > I'm not sure assuming linear (or actually any) ordering between > variants is a good thing. Yes, right now we only have gnttab > v1 and v2 and evtchn 2l and fifo, which could be considered v1 > and v2 as you suggest. However, assuming a 3rd variant surfaces, > why would it be that one has to expose v2 just to make v3 > usable? In particular gnttab v2 has various issues (which is why > you introduced a way to disable its use in the first place), yet > I'd hope we'd end up with a less quirky v3 if one ever becomes > necessary. And in turn I'd hope we could hide v2 from any v3 > users. > > IOW I think a bitmap to permit use of "advanced" versions is > more future proof. (As a side note, I don't think we want to > introduce a disable for the respective v1 interfaces.) We absolutely do want a way to turn everything off. The inability to turn the Xen extensions off for HVM guests (even if only for debugging purposes) is a severely short sighted decision. It is also a feature which has been requested repeatedly by users in the past, and I am very deliberately building a way to do this into the CPUID work. However, it is an unreasonable request to bundle into this bugfix, hence why I didn't suggest it. Now I think about it, things like available ABIs really should be in the Xen hypervisor CPUID leaves, but again, that ship sailed a decade ago. ~Andrew
On 07.08.2019 17:00, Andrew Cooper wrote: > On 07/08/2019 15:30, Jan Beulich wrote: >> On 07.08.2019 15:41, Andrew Cooper wrote: >>> Furthermore, if there is this problem for event channels, then there is >>> almost certainly a related problem for grant tables. >>> >>> The control in Xen should be expressed in a positive form, or the logic >>> will become a tangle. It should be a bit permitting the use of the FIFO >>> ABI, rather than a bit saying "oh actually, you can't use that". >>> >>> That said, it might be easier to declare FIFO to be "event channel v2", >>> and specify max_{grant,evntchn}_ver instead. >> >> I'm not sure assuming linear (or actually any) ordering between >> variants is a good thing. Yes, right now we only have gnttab >> v1 and v2 and evtchn 2l and fifo, which could be considered v1 >> and v2 as you suggest. However, assuming a 3rd variant surfaces, >> why would it be that one has to expose v2 just to make v3 >> usable? In particular gnttab v2 has various issues (which is why >> you introduced a way to disable its use in the first place), yet >> I'd hope we'd end up with a less quirky v3 if one ever becomes >> necessary. And in turn I'd hope we could hide v2 from any v3 >> users. >> >> IOW I think a bitmap to permit use of "advanced" versions is >> more future proof. (As a side note, I don't think we want to >> introduce a disable for the respective v1 interfaces.) > > We absolutely do want a way to turn everything off. > > The inability to turn the Xen extensions off for HVM guests (even if > only for debugging purposes) is a severely short sighted decision. For HVM perhaps, but not for PV. > It is also a feature which has been requested repeatedly by users in the > past, and I am very deliberately building a way to do this into the > CPUID work. > > However, it is an unreasonable request to bundle into this bugfix, hence > why I didn't suggest it. There's no bug fix here, as there's no bug (in Xen). > Now I think about it, things like available ABIs really should be in the > Xen hypervisor CPUID leaves, but again, that ship sailed a decade ago. For comparison, do you know of any CPU architecture making _all_ of its basic insns and other functionality available conditionally only? Just like there, I think it is reasonable to have a basic set available in all cases. Nevertheless, as said above, HVM might have benefited from making even some basic hypercalls conditional, because there they're strictly extensions, not base functionality. Jan
> On 7. Aug 2019, at 15:41, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > On 07/08/2019 12:20, Eslam Elnikety wrote: >> diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h >> index 19486d5e32..654b4fdd22 100644 >> --- a/xen/include/public/domctl.h >> +++ b/xen/include/public/domctl.h >> @@ -64,6 +64,9 @@ struct xen_domctl_createdomain { >> /* Is this a xenstore domain? */ >> #define _XEN_DOMCTL_CDF_xs_domain 4 >> #define XEN_DOMCTL_CDF_xs_domain (1U<<_XEN_DOMCTL_CDF_xs_domain) >> + /* Disable FIFO event channels? */ >> +#define _XEN_DOMCTL_CDF_disable_fifo 5 >> +#define XEN_DOMCTL_CDF_disable_fifo (1U<<_XEN_DOMCTL_CDF_disable_fifo) >> uint32_t flags; > > On the subject of the the patch itself, I think this is broadly the > right principle, but wants to be expressed differently. > > First, you'll want to rebase onto a very recent master, and specifically > over c/s d8f2490561eb which has changed how this field is handled in Xen. The patch was already on master c/s 0a6ad045c5fe. That said, I was not using the new d->options. Thanks for the hint. v2 will. > > Furthermore, if there is this problem for event channels, then there is > almost certainly a related problem for grant tables. I would concur. For grant tables, there is at least the opt_gnttab_max_version controlled by commandline. It would be nice to have it per-domain. > > The control in Xen should be expressed in a positive form, or the logic > will become a tangle. It should be a bit permitting the use of the FIFO > ABI, rather than a bit saying "oh actually, you can't use that”. I chose the negative form since otherwise the patch will have to enable that bit in many places to retain the current default behaviour of allowing FIFO unless stated otherwise. (The form, as is, is similar to other knobs: feature-disable-keyboard and feature-disable-pointer.) > > That said, it might be easier to declare FIFO to be "event channel v2", > and specify max_{grant,evntchn}_ver instead. > > I'm open to other suggestions as well. > > ~Andrew Andrew, Jan, and Julien, thanks for reviewing and the feedback. I will revise and send v2 over the next couple of days. Thanks, Eslam
On 07/08/2019 16:08, Jan Beulich wrote: > On 07.08.2019 17:00, Andrew Cooper wrote: >> On 07/08/2019 15:30, Jan Beulich wrote: >>> On 07.08.2019 15:41, Andrew Cooper wrote: >>>> Furthermore, if there is this problem for event channels, then >>>> there is >>>> almost certainly a related problem for grant tables. >>>> >>>> The control in Xen should be expressed in a positive form, or the >>>> logic >>>> will become a tangle. It should be a bit permitting the use of the >>>> FIFO >>>> ABI, rather than a bit saying "oh actually, you can't use that". >>>> >>>> That said, it might be easier to declare FIFO to be "event channel >>>> v2", >>>> and specify max_{grant,evntchn}_ver instead. >>> >>> I'm not sure assuming linear (or actually any) ordering between >>> variants is a good thing. Yes, right now we only have gnttab >>> v1 and v2 and evtchn 2l and fifo, which could be considered v1 >>> and v2 as you suggest. However, assuming a 3rd variant surfaces, >>> why would it be that one has to expose v2 just to make v3 >>> usable? In particular gnttab v2 has various issues (which is why >>> you introduced a way to disable its use in the first place), yet >>> I'd hope we'd end up with a less quirky v3 if one ever becomes >>> necessary. And in turn I'd hope we could hide v2 from any v3 >>> users. >>> >>> IOW I think a bitmap to permit use of "advanced" versions is >>> more future proof. (As a side note, I don't think we want to >>> introduce a disable for the respective v1 interfaces.) >> >> We absolutely do want a way to turn everything off. >> >> The inability to turn the Xen extensions off for HVM guests (even if >> only for debugging purposes) is a severely short sighted decision. > > For HVM perhaps, but not for PV. Right... I'm confused as to what in my sentence is in any way unclear. > >> It is also a feature which has been requested repeatedly by users in the >> past, and I am very deliberately building a way to do this into the >> CPUID work. >> >> However, it is an unreasonable request to bundle into this bugfix, hence >> why I didn't suggest it. > > There's no bug fix here, as there's no bug (in Xen). ? I didn't say it was a bug in Xen, but the change is specifically to fix a bug. > >> Now I think about it, things like available ABIs really should be in the >> Xen hypervisor CPUID leaves, but again, that ship sailed a decade ago. > > For comparison, do you know of any CPU architecture making _all_ > of its basic insns and other functionality available conditionally > only? No, but that's also not what I'm suggesting. For an HVM guest, literally everything Xen-specific, from hypercalls to CPUID leaves to xenstored and PV drivers, is an extension on a base IBM-compatible PC. There should always have been a way of of running an HVM guest without any of this, and we (the Xen community) really do have users which want to be able to do this. ~Andrew
On 07.08.2019 17:57, Andrew Cooper wrote: > On 07/08/2019 16:08, Jan Beulich wrote: >> On 07.08.2019 17:00, Andrew Cooper wrote: >>> On 07/08/2019 15:30, Jan Beulich wrote: >>>> On 07.08.2019 15:41, Andrew Cooper wrote: >>>>> Furthermore, if there is this problem for event channels, then >>>>> there is >>>>> almost certainly a related problem for grant tables. >>>>> >>>>> The control in Xen should be expressed in a positive form, or the >>>>> logic >>>>> will become a tangle. It should be a bit permitting the use of the >>>>> FIFO >>>>> ABI, rather than a bit saying "oh actually, you can't use that". >>>>> >>>>> That said, it might be easier to declare FIFO to be "event channel >>>>> v2", >>>>> and specify max_{grant,evntchn}_ver instead. >>>> >>>> I'm not sure assuming linear (or actually any) ordering between >>>> variants is a good thing. Yes, right now we only have gnttab >>>> v1 and v2 and evtchn 2l and fifo, which could be considered v1 >>>> and v2 as you suggest. However, assuming a 3rd variant surfaces, >>>> why would it be that one has to expose v2 just to make v3 >>>> usable? In particular gnttab v2 has various issues (which is why >>>> you introduced a way to disable its use in the first place), yet >>>> I'd hope we'd end up with a less quirky v3 if one ever becomes >>>> necessary. And in turn I'd hope we could hide v2 from any v3 >>>> users. >>>> >>>> IOW I think a bitmap to permit use of "advanced" versions is >>>> more future proof. (As a side note, I don't think we want to >>>> introduce a disable for the respective v1 interfaces.) >>> >>> We absolutely do want a way to turn everything off. >>> >>> The inability to turn the Xen extensions off for HVM guests (even if >>> only for debugging purposes) is a severely short sighted decision. >> >> For HVM perhaps, but not for PV. > > Right... > > I'm confused as to what in my sentence is in any way unclear. I'm sorry, I must have been completely blind to the "HVM" in what you've said. >>> It is also a feature which has been requested repeatedly by users in the >>> past, and I am very deliberately building a way to do this into the >>> CPUID work. >>> >>> However, it is an unreasonable request to bundle into this bugfix, hence >>> why I didn't suggest it. >> >> There's no bug fix here, as there's no bug (in Xen). > > ? I didn't say it was a bug in Xen, but the change is specifically to > fix a bug. Whatever we do in Xen, it'll only allow to work around that issue. An actual fix belongs in the kernel(s). For this reason I suppose what we're talking about here is a feature (from Xen's pov), not a bug fix. And it being a feature, it should preferably be coded in a way that's usable also going forward. Jan
On 8/7/19 5:03 PM, Jan Beulich wrote: > Whatever we do in Xen, it'll only allow to work around that issue. > An actual fix belongs in the kernel(s). For this reason I suppose > what we're talking about here is a feature (from Xen's pov), not a > bug fix. And it being a feature, it should preferably be coded in > a way that's usable also going forward. FWIW, I agree with what I understand Jan to be saying: 1. It would be good to be able to disable FIFO event channels, but 2. Disabling them in Xen isn't really the right way to fix Amazon's issue. Rather, probably the soft reboot should reset the event channel state. -George
On 14/08/2019 13:51, George Dunlap wrote: > On 8/7/19 5:03 PM, Jan Beulich wrote: >> Whatever we do in Xen, it'll only allow to work around that issue. >> An actual fix belongs in the kernel(s). For this reason I suppose >> what we're talking about here is a feature (from Xen's pov), not a >> bug fix. And it being a feature, it should preferably be coded in >> a way that's usable also going forward. > FWIW, I agree with what I understand Jan to be saying: > > 1. It would be good to be able to disable FIFO event channels, but > > 2. Disabling them in Xen isn't really the right way to fix Amazon's > issue. Rather, probably the soft reboot should reset the event channel > state. Depends what you mean about "fix the issue". Amazon have real customer VMs in this situation which will break on a Xen old => new upgrade. Modifying Xen is the only rational option. They are also doing this in an upstream compatible manner (which is great). One way or another, Xen needs to gain this ability to work around broken-linux which is already in the field. As for the ideal way to fix this, this bug has existed in Linux longer than soft-reset has been a thing, and frankly, its still a gruesome hack. We need some interfaces which don't suck so terribly. ~Andrew
On 14.08.19 15:02, Andrew Cooper wrote: > On 14/08/2019 13:51, George Dunlap wrote: >> On 8/7/19 5:03 PM, Jan Beulich wrote: >>> Whatever we do in Xen, it'll only allow to work around that issue. >>> An actual fix belongs in the kernel(s). For this reason I suppose >>> what we're talking about here is a feature (from Xen's pov), not a >>> bug fix. And it being a feature, it should preferably be coded in >>> a way that's usable also going forward. >> FWIW, I agree with what I understand Jan to be saying: >> >> 1. It would be good to be able to disable FIFO event channels, but >> >> 2. Disabling them in Xen isn't really the right way to fix Amazon's >> issue. Rather, probably the soft reboot should reset the event channel >> state. > > Depends what you mean about "fix the issue". > > Amazon have real customer VMs in this situation which will break on a > Xen old => new upgrade. Modifying Xen is the only rational option. > > They are also doing this in an upstream compatible manner (which is > great). One way or another, Xen needs to gain this ability to work > around broken-linux which is already in the field. > > As for the ideal way to fix this, this bug has existed in Linux longer > than soft-reset has been a thing, and frankly, its still a gruesome > hack. We need some interfaces which don't suck so terribly. > > ~Andrew > Thanks, Andrew. I second those points. I have just sent v3 of this patch, mostly addressing comments from Jan. Have a look, and let me know if there are further tweaks you would rather see. Thanks. Cheers, Eslam
diff --git a/docs/man/xl.cfg.5.pod.in b/docs/man/xl.cfg.5.pod.in index c99d40307e..f204d8b4f0 100644 --- a/docs/man/xl.cfg.5.pod.in +++ b/docs/man/xl.cfg.5.pod.in @@ -1262,6 +1262,11 @@ FIFO-based event channel ABI support up to 131,071 event channels. Other guests are limited to 4095 (64-bit x86 and ARM) or 1023 (32-bit x86). +=item B<disable_evtchn_fifo=BOOLEAN> + +Indicates if support for FIFO event channel ABI is disabled. The default +is false (0). + =item B<vdispl=[ "VDISPL_SPEC_STRING", "VDISPL_SPEC_STRING", ...]> Specifies the virtual display devices to be provided to the guest. diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h index 9bacfb97f0..75b2ee3d1b 100644 --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -169,6 +169,14 @@ */ #define LIBXL_HAVE_BUILDINFO_EVENT_CHANNELS 1 +/* + * LIBXL_HAVE_BUILDINFO_DISABLE_EVTCHN_FIFO indicates that the + * libxl_domain_build_info structure contains a boolean + * disable_evtchn_fifo which instructs libxl to enable/disable + * support for FIFO event channel ABI at create time. + */ +#define LIBXL_HAVE_BUILDINFO_DISABLE_EVTCHN_FIFO 1 + /* * libxl_domain_build_info has the u.hvm.ms_vm_genid field. */ diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c index 03ce166f4f..aa87e45643 100644 --- a/tools/libxl/libxl_create.c +++ b/tools/libxl/libxl_create.c @@ -217,6 +217,8 @@ int libxl__domain_build_info_setdefault(libxl__gc *gc, if (!b_info->event_channels) b_info->event_channels = 1023; + libxl_defbool_setdefault(&b_info->disable_evtchn_fifo, false); + libxl__arch_domain_build_info_setdefault(gc, b_info); libxl_defbool_setdefault(&b_info->dm_restrict, false); @@ -564,6 +566,9 @@ int libxl__domain_make(libxl__gc *gc, libxl_domain_config *d_config, libxl_defbool_val(info->oos) ? 0 : XEN_DOMCTL_CDF_oos_off; } + if (libxl_defbool_val(b_info->disable_evtchn_fifo)) + create.flags |= XEN_DOMCTL_CDF_disable_fifo; + /* Ultimately, handle is an array of 16 uint8_t, same as uuid */ libxl_uuid_copy(ctx, (libxl_uuid *)&create.handle, &info->uuid); diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl index b61399ce36..5f30570443 100644 --- a/tools/libxl/libxl_types.idl +++ b/tools/libxl/libxl_types.idl @@ -521,6 +521,7 @@ libxl_domain_build_info = Struct("domain_build_info",[ ("iomem", Array(libxl_iomem_range, "num_iomem")), ("claim_mode", libxl_defbool), ("event_channels", uint32), + ("disable_evtchn_fifo", libxl_defbool), ("kernel", string), ("cmdline", string), ("ramdisk", string), diff --git a/tools/xl/xl_parse.c b/tools/xl/xl_parse.c index e105bda2bb..bcf16c31d4 100644 --- a/tools/xl/xl_parse.c +++ b/tools/xl/xl_parse.c @@ -1496,6 +1496,8 @@ void parse_config_data(const char *config_source, if (!xlu_cfg_get_long(config, "max_event_channels", &l, 0)) b_info->event_channels = l; + xlu_cfg_get_defbool(config, "disable_evtchn_fifo", + &b_info->disable_evtchn_fifo, 0); xlu_cfg_replace_string (config, "kernel", &b_info->kernel, 0); xlu_cfg_replace_string (config, "ramdisk", &b_info->ramdisk, 0); xlu_cfg_replace_string (config, "device_tree", &b_info->device_tree, 0); diff --git a/tools/xl/xl_sxp.c b/tools/xl/xl_sxp.c index 359a001570..52e98c6c61 100644 --- a/tools/xl/xl_sxp.c +++ b/tools/xl/xl_sxp.c @@ -71,6 +71,8 @@ void printf_info_sexp(int domid, libxl_domain_config *d_config, FILE *fh) fprintf(fh, "\t(target_memkb %"PRId64")\n", b_info->target_memkb); fprintf(fh, "\t(nomigrate %s)\n", libxl_defbool_to_string(b_info->disable_migrate)); + fprintf(fh, "\t(disable_evtchn_fifo %s)\n", + libxl_defbool_to_string(b_info->disable_evtchn_fifo)); if (c_info->type == LIBXL_DOMAIN_TYPE_PV && b_info->bootloader) { fprintf(fh, "\t(bootloader %s)\n", b_info->bootloader); diff --git a/xen/common/domain.c b/xen/common/domain.c index 744b572195..d54674e28c 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -444,6 +444,9 @@ struct domain *domain_create(domid_t domid, d->controller_pause_count = 1; atomic_inc(&d->pause_count); + if ( d->options & XEN_DOMCTL_CDF_disable_fifo ) + d->disable_evtchn_fifo = 1; + if ( (err = evtchn_init(d, config->max_evtchn_port)) != 0 ) goto fail; init_status |= INIT_evtchn; diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c index e86e2bfab0..ce3dbb90ab 100644 --- a/xen/common/event_channel.c +++ b/xen/common/event_channel.c @@ -1170,6 +1170,11 @@ long do_event_channel_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) case EVTCHNOP_init_control: { struct evtchn_init_control init_control; + + /* Fail init_control for domains that must use 2l ABI */ + if ( current->domain->disable_evtchn_fifo ) + return -ENOSYS; + if ( copy_from_guest(&init_control, arg, 1) != 0 ) return -EFAULT; rc = evtchn_fifo_init_control(&init_control); diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h index 19486d5e32..654b4fdd22 100644 --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -64,6 +64,9 @@ struct xen_domctl_createdomain { /* Is this a xenstore domain? */ #define _XEN_DOMCTL_CDF_xs_domain 4 #define XEN_DOMCTL_CDF_xs_domain (1U<<_XEN_DOMCTL_CDF_xs_domain) + /* Disable FIFO event channels? */ +#define _XEN_DOMCTL_CDF_disable_fifo 5 +#define XEN_DOMCTL_CDF_disable_fifo (1U<<_XEN_DOMCTL_CDF_disable_fifo) uint32_t flags; /* diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 2e6e0d3488..fcdd802665 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -338,6 +338,7 @@ struct domain struct evtchn **evtchn_group[NR_EVTCHN_GROUPS]; /* all other buckets */ unsigned int max_evtchns; /* number supported by ABI */ unsigned int max_evtchn_port; /* max permitted port number */ + bool disable_evtchn_fifo; /* force 2l ABI */ unsigned int valid_evtchns; /* number of allocated event channels */ spinlock_t event_lock; const struct evtchn_port_ops *evtchn_port_ops;
Adding support for FIFO event channel ABI was first introduced in Xen 4.4 (see 88910061ec6). Make this support tunable, since the choice of which event channel ABI has implications for hibernation. Consider resuming a pre Xen 4.4 hibernated Linux guest. The guest boot kernel defaults to FIFO ABI, whereas the resume kernel assumes 2L. This, in turn, results in Xen and the resumed kernel talking past each other (due to different protocols FIFO vs 2L). In order to announce to guests that the event channel ABI does not support FIFO, the hypervisor returns ENOSYS on init_control operation. When this operation fails, the guest should continue to use the 2L event channel ABI. For example, in Linux drivers/xen/events/events_base.c: if (fifo_events) ret = xen_evtchn_fifo_init(); if (ret < 0) xen_evtchn_2l_init(); and xen_evtchn_fifo_init fails when EVTCHNOP_init_control fails. This commit does not change the current default behaviour: announce FIFO event channels ABI support for guests unless explicitly stated otherwise at domaincreate. Signed-off-by: Eslam Elnikety <elnikety@amazon.com> --- docs/man/xl.cfg.5.pod.in | 5 +++++ tools/libxl/libxl.h | 8 ++++++++ tools/libxl/libxl_create.c | 5 +++++ tools/libxl/libxl_types.idl | 1 + tools/xl/xl_parse.c | 2 ++ tools/xl/xl_sxp.c | 2 ++ xen/common/domain.c | 3 +++ xen/common/event_channel.c | 5 +++++ xen/include/public/domctl.h | 3 +++ xen/include/xen/sched.h | 1 + 10 files changed, 35 insertions(+)