diff mbox

[v6,for,Xen,4.7,1/4] xen: enable per-VCPU parameter settings for RTDS scheduler

Message ID 1457433240.3102.161.camel@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dario Faggioli March 8, 2016, 10:34 a.m. UTC
On Tue, 2016-03-08 at 02:10 -0700, Jan Beulich wrote:
> > > > On 07.03.16 at 18:53, <dario.faggioli@citrix.com> wrote:
> > On Mon, 2016-03-07 at 09:40 -0700, Jan Beulich wrote:
> > > 
> > IIRC, I was looking at how XEN_SYSCTL_pcitopoinfo is handled, for
> > reference, and that has some guest_handle_is_null()==>EINVAL
> > sainity
> > checking (in xen/common/sysctl.c), which, when I thought about it,
> > made
> > sense to me.
> > 
> > My reasoning was, sort of:
> >  1. if the handle is NULL, no point getting into the somewhat 
> >     complicated logic of the while,
> >  2. more accurate error reporting: as being passed a NULL handler 
> >     looked something we could identify and call invalid, rather
> > than 
> >     waiting for the copy to fault.
> I think the XEN_SYSCTL_pcitopoinfo was misguided in this respect,
> cloning non applicable logic here which returns the number of needed
> (array) elements in such a case for a few other operations.
> 
Sorry, I'm not sure I am getting: are you saying that, for _these_
domctls, we should consider the handle being NULL as a way of the
caller to ask for the size of the array?

*If* yes, well, that is "just" the number of vcpus of the guest, but,
nevertheless, that, FWIW, looks fine to me.

> > About the structure of the code, as said above, I do like
> > how XEN_SYSCTL_pcitopoinfo ended up being handled, I think it is a
> > great fit for this specific case and, comparing at both this and
> > previous version, I do think this one is (bugs apart) looking
> > better.
> > 
> > I'm sure I said this --long ago-- when discussing v4 (and maybe
> > even
> > previous versions), as well as more recently, when reviewing v5,
> > and
> > that's why Chong (finally! :-D) did it.
> > 
> > So, with the comment in place (and with bugs fixed :-)), are you
> > (Jan)
> > ok with this being done this way?
>
> Well, this _might_ be acceptable for "get" (since the caller
> abandoning the sequence of calls prematurely is no problem),
> but for "set" it looks less suitable, as similar abandoning would
> leave the guest in some inconsistent / unintended state.
>
Are you referring to the fact that, with this interface, the caller has
the chance to leave intentionally, or that it may happen that not all
vcpus are updated because of some bug (still in the caller)?

Well, if it's intentional, or even if the caller is buggy in the sense
that the code is written in a way that it misses updating some vcpus
(and if the interface and the behavior is well documented, as you
request), then one gets what he "wants" (and, in the latter case, it
wouldn't be too hard to debug and figure out the issue, I think).

If it's for bugs (still in the caller) like copy_from_guest_offset()
faulting because the array is too small, that can happen if using
continuation too, can't it? And it would still leave the guest in
similar inconsistent or unintended state, IMO...

One last point. Of course, since we are talking about bugs, the final
status is not the one desired, but it's not inconsistent in the sense
that the guest can't continue running, or crashes, or anything like
that. It's something like:
 - you wants all the 32 vcpus of guest A to have these new parameters
 - due to a bug, you're (for instance) passing me an array with only 
   16 vcpus parameters
 - result: onlyt 16 vcpus will have the new parameters.

>  The
> issue with XEN_SYSCTL_pcitopoinfo was, iirc, the lack of a
> good way of encoding the continuation information, and while
> that would seem applicable here too I'm not sure now whether
> doing it the way it was done was the best choice. 
>
As far as I can remember and see, it was being done by means of an
additional dedicated parameter in the handle (called ti->first_dev in
that case). Then at some point, you said:

http://lists.xenproject.org/archives/html/xen-devel/2015-03/msg02623.html
"Considering this is a tools only interface, enforcing a not too high
 limit on num_devs would seem better than this not really clean
 continuation mechanism. The (tool stack) caller(s) can be made
 iterate."

With which I did agree (and I still do :-)), as well as I agree on the
fact that we basically are in the same situation here.

Chong tried doing things with continuations for a few rounds, including
v5, which is here:
http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00817.html

and he also used an additional field (vcpu_index).

So, all this being said, my preference stays for the way the code looks
like in this version (with all the due commenting added). Of course,
it's your preference that really matters here, me not being the
maintainer of this code. :-)

So, how do you prefer Chong to continue doing this?

> Clearly
> stating (in the public interface header) that certain normally
> input-only fields are volatile would allow the continuation to
> be handled without tool stack assistance afaict.
>
Which (sorry, I'm not getting again) I guess is something
different/more than what was done in v5 (the relevant hunks of which
I'm pasting at the bottom of this email, for convenience)?

> > BTW, Chong, I'm not sure this has to do with what Jan is saying,
> > but
> > looking again at XEN_SYSCTL_pcitopoinfo, it looks to me you're
> > missing
> > copying nr_vcpus back up to the guest (which is actually what makes
> > libxc knows whether all vcpus have been processed or now).
> Indeed that is why the conditional makes sense there, but not here.
> And the copying back is already being taken care of by the caller of
> sched_adjust().
> 
Indeed. So your original point was: we're always copying back, so it's
fine to always update the field, with the only exception of errors
having occurred. I do get it now.

Regards,
Dario

---

Comments

Jan Beulich March 8, 2016, 11:47 a.m. UTC | #1
>>> On 08.03.16 at 11:34, <dario.faggioli@citrix.com> wrote:
> On Tue, 2016-03-08 at 02:10 -0700, Jan Beulich wrote:
>> > > > On 07.03.16 at 18:53, <dario.faggioli@citrix.com> wrote:
>> > On Mon, 2016-03-07 at 09:40 -0700, Jan Beulich wrote:
>> > > 
>> > IIRC, I was looking at how XEN_SYSCTL_pcitopoinfo is handled, for
>> > reference, and that has some guest_handle_is_null()==>EINVAL
>> > sainity
>> > checking (in xen/common/sysctl.c), which, when I thought about it,
>> > made
>> > sense to me.
>> > 
>> > My reasoning was, sort of:
>> >  1. if the handle is NULL, no point getting into the somewhat 
>> >     complicated logic of the while,
>> >  2. more accurate error reporting: as being passed a NULL handler 
>> >     looked something we could identify and call invalid, rather
>> > than 
>> >     waiting for the copy to fault.
>> I think the XEN_SYSCTL_pcitopoinfo was misguided in this respect,
>> cloning non applicable logic here which returns the number of needed
>> (array) elements in such a case for a few other operations.
>> 
> Sorry, I'm not sure I am getting: are you saying that, for _these_
> domctls, we should consider the handle being NULL as a way of the
> caller to ask for the size of the array?

No, I've tried to point out that _when_ such behavior is intended,
the special casing of a null handle is warranted. But not (normally)
in other cases.

>> > About the structure of the code, as said above, I do like
>> > how XEN_SYSCTL_pcitopoinfo ended up being handled, I think it is a
>> > great fit for this specific case and, comparing at both this and
>> > previous version, I do think this one is (bugs apart) looking
>> > better.
>> > 
>> > I'm sure I said this --long ago-- when discussing v4 (and maybe
>> > even
>> > previous versions), as well as more recently, when reviewing v5,
>> > and
>> > that's why Chong (finally! :-D) did it.
>> > 
>> > So, with the comment in place (and with bugs fixed :-)), are you
>> > (Jan)
>> > ok with this being done this way?
>>
>> Well, this _might_ be acceptable for "get" (since the caller
>> abandoning the sequence of calls prematurely is no problem),
>> but for "set" it looks less suitable, as similar abandoning would
>> leave the guest in some inconsistent / unintended state.
>>
> Are you referring to the fact that, with this interface, the caller has
> the chance to leave intentionally, or that it may happen that not all
> vcpus are updated because of some bug (still in the caller)?
> 
> Well, if it's intentional, or even if the caller is buggy in the sense
> that the code is written in a way that it misses updating some vcpus
> (and if the interface and the behavior is well documented, as you
> request), then one gets what he "wants" (and, in the latter case, it
> wouldn't be too hard to debug and figure out the issue, I think).

Intentional or buggy doesn't matter much - if we can avoid making
ourselves dependent upon user mode code behaving well, I think
we should.

> If it's for bugs (still in the caller) like copy_from_guest_offset()
> faulting because the array is too small, that can happen if using
> continuation too, can't it? And it would still leave the guest in
> similar inconsistent or unintended state, IMO...

True, albeit that's what error indications are for.

> One last point. Of course, since we are talking about bugs, the final
> status is not the one desired, but it's not inconsistent in the sense
> that the guest can't continue running, or crashes, or anything like
> that. It's something like:
>  - you wants all the 32 vcpus of guest A to have these new parameters
>  - due to a bug, you're (for instance) passing me an array with only 
>    16 vcpus parameters
>  - result: onlyt 16 vcpus will have the new parameters.

That was my understanding, yes.

>>  The
>> issue with XEN_SYSCTL_pcitopoinfo was, iirc, the lack of a
>> good way of encoding the continuation information, and while
>> that would seem applicable here too I'm not sure now whether
>> doing it the way it was done was the best choice. 
>>
> As far as I can remember and see, it was being done by means of an
> additional dedicated parameter in the handle (called ti->first_dev in
> that case). Then at some point, you said:
> 
> http://lists.xenproject.org/archives/html/xen-devel/2015-03/msg02623.html 
> "Considering this is a tools only interface, enforcing a not too high
>  limit on num_devs would seem better than this not really clean
>  continuation mechanism. The (tool stack) caller(s) can be made
>  iterate."
> 
> With which I did agree (and I still do :-)), as well as I agree on the
> fact that we basically are in the same situation here.
> 
> Chong tried doing things with continuations for a few rounds, including
> v5, which is here:
> http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00817.html 
> 
> and he also used an additional field (vcpu_index).
> 
> So, all this being said, my preference stays for the way the code looks
> like in this version (with all the due commenting added). Of course,
> it's your preference that really matters here, me not being the
> maintainer of this code. :-)
> 
> So, how do you prefer Chong to continue doing this?

Well, modeling this after the other sysctl makes it something I
can't reasonably reject. Hence what I've been saying so far were
merely suggestions, as - other than you - I'm not convinced
anymore the model used there was a good one. The more that
here - afaict - no extra fields would be needed: The continuation
encoding would simply consist of op->u.v.nr_vcpus getting
suitably decreased and op->u.v.vcpus suitably bumped.

>> Clearly
>> stating (in the public interface header) that certain normally
>> input-only fields are volatile would allow the continuation to
>> be handled without tool stack assistance afaict.
>>
> Which (sorry, I'm not getting again) I guess is something
> different/more than what was done in v5 (the relevant hunks of which
> I'm pasting at the bottom of this email, for convenience)?

The hunks you had included didn't cover the public interface
header changes. My point here is: Generally we demand that
fields in public interface structures not documented as OUT
would not get modified by a hypercall. Since I'm suggesting to
use both fields to encode continuation information, that rule
would get violated, which therefore needs spelling out in a
comment in the header.

Jan
diff mbox

Patch

diff --git a/xen/common/domctl.c b/xen/common/domctl.c
index 46b967e..b294221 100644
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -847,9 +847,14 @@  long
do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) 
u_domctl)
     }
 
     case XEN_DOMCTL_scheduler_op:
+    {
         ret = sched_adjust(d, &op->u.scheduler_op);
+        if ( ret == -ERESTART )
+            ret = hypercall_create_continuation(
+                __HYPERVISOR_domctl, "h", u_domctl);
         copyback = 1;
         break;
+    }
 
     case XEN_DOMCTL_getdomaininfo:
     {

diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
index 3f1d047..34ae48d 100644
--- a/xen/common/sched_rt.c
+++ b/xen/common/sched_rt.c
@@ -1163,6 +1168,94 @@  rt_dom_cntl(
         }
         spin_unlock_irqrestore(&prv->lock, flags);
         break;
+    case XEN_DOMCTL_SCHEDOP_getvcpuinfo: 
+        for ( index = op->u.v.vcpu_index; index < op->u.v.nr_vcpus; index++ )
+        {
+            spin_lock_irqsave(&prv->lock, flags);
+            if ( copy_from_guest_offset(&local_sched,
+                          op->u.v.vcpus, index, 1) )
+            {
+                rc = -EFAULT;
+                spin_unlock_irqrestore(&prv->lock, flags);
+                break;
+            }
+            if ( local_sched.vcpuid >= d->max_vcpus ||
+                          d->vcpu[local_sched.vcpuid] == NULL )
+            {
+                rc = -EINVAL;
+                spin_unlock_irqrestore(&prv->lock, flags);
+                break;
+            }
+            svc = rt_vcpu(d->vcpu[local_sched.vcpuid]);
+
+            local_sched.s.rtds.budget = svc->budget / MICROSECS(1);
+            local_sched.s.rtds.period = svc->period / MICROSECS(1);
+
+            if ( __copy_to_guest_offset(op->u.v.vcpus, index,
+                    &local_sched, 1) )
+            {
+                rc = -EFAULT;
+                spin_unlock_irqrestore(&prv->lock, flags);
+                break;
+            }
+            spin_unlock_irqrestore(&prv->lock, flags);
+            if ( hypercall_preempt_check() )
+            {
+                op->u.v.vcpu_index = index + 1;
+                /* hypercall (after preemption) will continue at vcpu_index */
+                rc = -ERESTART;
+                break;
+            }
+        }
+        break;
+    case XEN_DOMCTL_SCHEDOP_putvcpuinfo:
+        for ( index = op->u.v.vcpu_index; index < op->u.v.nr_vcpus; index++ )
+        {
+            spin_lock_irqsave(&prv->lock, flags);
+            if ( copy_from_guest_offset(&local_sched,
+                          op->u.v.vcpus, index, 1) )
+            {
+                rc = -EFAULT;
+                spin_unlock_irqrestore(&prv->lock, flags);
+                break;
+            }
+            if ( local_sched.vcpuid >= d->max_vcpus ||
+                          d->vcpu[local_sched.vcpuid] == NULL )
+            {
+                rc = -EINVAL;
+                spin_unlock_irqrestore(&prv->lock, flags);
+                break;
+            }
+            svc = rt_vcpu(d->vcpu[local_sched.vcpuid]);
+            period = MICROSECS(local_sched.s.rtds.period);
+            budget = MICROSECS(local_sched.s.rtds.budget);
+            if ( period > RTDS_MAX_PERIOD || budget < RTDS_MIN_BUDGET ||
+                          budget > period )
+            {
+                rc = -EINVAL;
+                spin_unlock_irqrestore(&prv->lock, flags);
+                break;
+            }
+
+            /* 
+             * We accept period/budget less than 100 us, but will warn users about
+             * the large scheduling overhead due to it
+             */
+            if ( period < MICROSECS(100) || budget < MICROSECS(100) )
+                printk("Warning: period/budget less than 100 micro-secs "
+                       "results in large scheduling overhead.\n");
+
+            svc->period = period;
+            svc->budget = budget;
+            spin_unlock_irqrestore(&prv->lock, flags);
+            if ( hypercall_preempt_check() )
+            {
+                op->u.v.vcpu_index = index + 1;
+                rc = -ERESTART;
+                break;
+            }
+        }
+        break;
     }
 
     return rc;