Message ID | 1473312603-28581-1-git-send-email-dongli.zhang@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, 2016-09-08 at 13:30 +0800, Dongli Zhang wrote: > diff --git a/xen/common/memory.c b/xen/common/memory.c > index f34dd56..3641469 100644 > @@ -150,6 +152,12 @@ static void populate_physmap(struct memop_args > *a) > max_order(curr_d)) ) > return; > > + /* MEMF_no_tlbflush can be set only during vm creation phase > when > + * already_scheduled is still 0 before this domain gets > scheduled for > + * the first time. */ > /* * Comment style for multi line comments in Xen * includes the 'wings'. :-) */ Yes, I know there's some inconsistency in this file (and in many others :-/), but still. > + if ( d->already_scheduled == 0 ) > unlikely() maybe? > + a->memflags |= MEMF_no_tlbflush; > + > for ( i = a->nr_done; i < a->nr_extents; i++ ) > { > if ( i != a->nr_done && hypercall_preempt_check() ) > @@ -214,6 +222,21 @@ static void populate_physmap(struct memop_args > *a) > goto out; > } > > + if ( d->already_scheduled == 0 ) > + { > + for ( j = 0; j < (1U << a->extent_order); j++ ) > + { > + if ( page[j].u.free.need_tlbflush && > + (page[j].tlbflush_timestamp <= > tlbflush_current_time()) && > + (!need_tlbflush || > + (page[j].tlbflush_timestamp > > tlbflush_timestamp)) ) > This check is long, complicated to read (at least to a non TLBflush guru), and also appear twice.. can it be put in an inline function with a talking name? Oh, and I think you don't need the parenthesis around these twos: (page[j].tlbflush_timestamp <= tlbflush_current_time()) (page[j].tlbflush_timestamp > tlbflush_timestamp) > + { > + need_tlbflush = 1; > + tlbflush_timestamp = > page[j].tlbflush_timestamp; > + } > + } > + } > + > mfn = page_to_mfn(page); > } > diff --git a/xen/common/schedule.c b/xen/common/schedule.c > index 32a300f..593541a 100644 > @@ -1376,6 +1376,11 @@ static void schedule(void) > > next = next_slice.task; > > + /* Set already_scheduled to 1 when this domain gets scheduled > for the > + * first time */ > Wings again. And, about the content, it's already clear from the code that this gets set when a vcpu of a domain is scheduled. What we want here is a _quick_ explanation of why we need the scheduler to record this information. > + if ( next->domain->already_scheduled == 0 ) > unlikely() (and here I'm sure :-)). > + next->domain->already_scheduled = 1; > + > And, finally, I'd move this toward the bottom of the function, outside of the pcpu_schedule_lock() critical section, e.g., around the call to vcpu_periodic_timer_work(next); > sd->curr = next; > > if ( next_slice.time >= 0 ) /* -ve means no limit */ Regards, Dario
On Thu, Sep 08, 2016 at 01:30:03PM +0800, Dongli Zhang wrote: > This patch implemented parts of TODO left in commit id > a902c12ee45fc9389eb8fe54eeddaf267a555c58. It moved TLB-flush filtering out > into populate_physmap. Because of TLB-flush in alloc_heap_pages, it's very > slow to create a guest with memory size of more than 100GB on host with > 100+ cpus. > > This patch introduced a "MEMF_no_tlbflush" bit to memflags to indicate > whether TLB-flush should be done in alloc_heap_pages or its caller > populate_physmap. Once this bit is set in memflags, alloc_heap_pages will > ignore TLB-flush. To use this bit after vm is created might lead to > security issue, that is, this would make pages accessible to the guest B, > when guest A may still have a cached mapping to them. > > Therefore, this patch also introduced a "already_scheduled" field to struct > domain to indicate whether this domain has ever got scheduled by > hypervisor. MEMF_no_tlbflush can be set only during vm creation phase when > already_scheduled is still 0 before this domain gets scheduled for the > first time. > > TODO: ballooning very huge amount of memory cannot benefit from this patch > and might still be slow. > > Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> > > --- > Changed since v2: > * Limit this optimization to domain creation time. > > --- > xen/common/domain.c | 2 ++ > xen/common/memory.c | 33 +++++++++++++++++++++++++++++++++ > xen/common/page_alloc.c | 3 ++- > xen/common/schedule.c | 5 +++++ > xen/include/xen/mm.h | 2 ++ > xen/include/xen/sched.h | 3 +++ > 6 files changed, 47 insertions(+), 1 deletion(-) > > diff --git a/xen/common/domain.c b/xen/common/domain.c > index a8804e4..611a471 100644 > --- a/xen/common/domain.c > +++ b/xen/common/domain.c > @@ -303,6 +303,8 @@ struct domain *domain_create(domid_t domid, unsigned int domcr_flags, > if ( !zalloc_cpumask_var(&d->domain_dirty_cpumask) ) > goto fail; > > + d->already_scheduled = 0; > + Use false please -- this is a bool_t. [...] > diff --git a/xen/common/schedule.c b/xen/common/schedule.c > index 32a300f..593541a 100644 > --- a/xen/common/schedule.c > +++ b/xen/common/schedule.c > @@ -1376,6 +1376,11 @@ static void schedule(void) > > next = next_slice.task; > > + /* Set already_scheduled to 1 when this domain gets scheduled for the > + * first time */ > + if ( next->domain->already_scheduled == 0 ) > + next->domain->already_scheduled = 1; > + Can be simplified by omitting the "if" altogether. And use "true" here. Wei.
On Thu, 2016-09-08 at 11:50 +0100, Wei Liu wrote: > On Thu, Sep 08, 2016 at 01:30:03PM +0800, Dongli Zhang wrote: > > > > diff --git a/xen/common/schedule.c b/xen/common/schedule.c > > index 32a300f..593541a 100644 > > --- a/xen/common/schedule.c > > +++ b/xen/common/schedule.c > > @@ -1376,6 +1376,11 @@ static void schedule(void) > > > > next = next_slice.task; > > > > + /* Set already_scheduled to 1 when this domain gets scheduled > > for the > > + * first time */ > > + if ( next->domain->already_scheduled == 0 ) > > + next->domain->already_scheduled = 1; > > + > Can be simplified by omitting the "if" altogether. > Are you sure? I mean looking at the cases when the flag is already true (which means, during the life of a domain, basically **always** except a handful of instances after creation), what costs less, a check that is always false, or a write that is always updating a value with its current value? And I'm not being ironic or anything, I honestly am not sure and this is a genuine question. > And use "true" here. > Yeah, or just: if ( unlikely(!next->domain->already_scheduled) ) ... > Wei.
On Thu, Sep 08, 2016 at 01:01:40PM +0200, Dario Faggioli wrote: > On Thu, 2016-09-08 at 11:50 +0100, Wei Liu wrote: > > On Thu, Sep 08, 2016 at 01:30:03PM +0800, Dongli Zhang wrote: > > > > > > diff --git a/xen/common/schedule.c b/xen/common/schedule.c > > > index 32a300f..593541a 100644 > > > --- a/xen/common/schedule.c > > > +++ b/xen/common/schedule.c > > > @@ -1376,6 +1376,11 @@ static void schedule(void) > > > > > > next = next_slice.task; > > > > > > + /* Set already_scheduled to 1 when this domain gets scheduled > > > for the > > > + * first time */ > > > + if ( next->domain->already_scheduled == 0 ) > > > + next->domain->already_scheduled = 1; > > > + > > Can be simplified by omitting the "if" altogether. > > > Are you sure? I mean looking at the cases when the flag is already true > (which means, during the life of a domain, basically **always** except > a handful of instances after creation), what costs less, a check that > is always false, or a write that is always updating a value with its > current value? Omitting the check certain results in less instructions. And it would probably eliminate misses in instruction cache and branch prediction logic in the processor. In the grand scheme of things, this is a rather minor optimisation, so I wouldn't argue strongly for this. Wei. > > And I'm not being ironic or anything, I honestly am not sure and this > is a genuine question. > > > And use "true" here. > > > Yeah, or just: > > if ( unlikely(!next->domain->already_scheduled) ) > ... > > > Wei. > -- > <<This happens because I choose it to happen!>> (Raistlin Majere) > ----------------------------------------------------------------- > Dario Faggioli, Ph.D, http://about.me/dario.faggioli > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) >
On 08/09/16 13:11, Wei Liu wrote: > On Thu, Sep 08, 2016 at 01:01:40PM +0200, Dario Faggioli wrote: >> On Thu, 2016-09-08 at 11:50 +0100, Wei Liu wrote: >>> On Thu, Sep 08, 2016 at 01:30:03PM +0800, Dongli Zhang wrote: >>>> >>>> diff --git a/xen/common/schedule.c b/xen/common/schedule.c >>>> index 32a300f..593541a 100644 >>>> --- a/xen/common/schedule.c >>>> +++ b/xen/common/schedule.c >>>> @@ -1376,6 +1376,11 @@ static void schedule(void) >>>> >>>> next = next_slice.task; >>>> >>>> + /* Set already_scheduled to 1 when this domain gets scheduled >>>> for the >>>> + * first time */ >>>> + if ( next->domain->already_scheduled == 0 ) >>>> + next->domain->already_scheduled = 1; >>>> + >>> Can be simplified by omitting the "if" altogether. >>> >> Are you sure? I mean looking at the cases when the flag is already true >> (which means, during the life of a domain, basically **always** except >> a handful of instances after creation), what costs less, a check that >> is always false, or a write that is always updating a value with its >> current value? > > Omitting the check certain results in less instructions. And it would > probably eliminate misses in instruction cache and branch prediction > logic in the processor. The first scheduling is done via unpausing the domain. Why not setting the flag to true in that path? Juergen
Wei Liu writes ("Re: [PATCH v3 1/1] xen: move TLB-flush filtering out into populate_physmap during vm creation"): > On Thu, Sep 08, 2016 at 01:01:40PM +0200, Dario Faggioli wrote: > > On Thu, 2016-09-08 at 11:50 +0100, Wei Liu wrote: > > > On Thu, Sep 08, 2016 at 01:30:03PM +0800, Dongli Zhang wrote: > > > > + if ( next->domain->already_scheduled == 0 ) > > > > + next->domain->already_scheduled = 1; > > > > + > > > Can be simplified by omitting the "if" altogether. > > > > > Are you sure? I mean looking at the cases when the flag is already true > > (which means, during the life of a domain, basically **always** except > > a handful of instances after creation), what costs less, a check that > > is always false, or a write that is always updating a value with its > > current value? > > Omitting the check certain results in less instructions. And it would > probably eliminate misses in instruction cache and branch prediction > logic in the processor. > > In the grand scheme of things, this is a rather minor optimisation, so I > wouldn't argue strongly for this. Are we sure we ought to be discussing this in terms of optimisation ? I doubt it makes any significant difference either way. But there is a difference in clarity. I would not normally expect to see this: bool x; ... if (!x) x = 1; If I saw that I would wonder if the programmer was confused, or whether I was missing something. Looking at it without the benefit of the definition of x, it looks more like x might be a non-boolean type. Thanks, Ian.
On Thu, 2016-09-08 at 13:19 +0200, Juergen Gross wrote: > The first scheduling is done via unpausing the domain. Why not > setting > the flag to true in that path? > That could be a good idea. And in general, I'm all for finding a place and/or a state that better represents the condition of "setting to run this vcpu for the first time" and set the flag there, rather than re-assigning 1 to an "already_scheduled" flag each an every time we go through schedule() (and not for performance and optimization reason). Which domain_unpause() (or whatever) do you have in mind precisely? Thanks and Regards, Dario
On 08/09/16 15:49, Dario Faggioli wrote: > On Thu, 2016-09-08 at 13:19 +0200, Juergen Gross wrote: >> The first scheduling is done via unpausing the domain. Why not >> setting >> the flag to true in that path? >> > That could be a good idea. > > And in general, I'm all for finding a place and/or a state that better > represents the condition of "setting to run this vcpu for the first > time" and set the flag there, rather than re-assigning 1 to an > "already_scheduled" flag each an every time we go through schedule() > (and not for performance and optimization reason). > > Which domain_unpause() (or whatever) do you have in mind precisely? Domains are created with a single systemcontroller pause refcount, which must be undone by the use of XEN_DOMCTL_unpausedomain when construction is complete. That would be the appropriate place to set such a variable. ~Andrew
>>> On 08.09.16 at 07:30, <dongli.zhang@oracle.com> wrote: > --- a/xen/include/xen/sched.h > +++ b/xen/include/xen/sched.h > @@ -474,6 +474,9 @@ struct domain > unsigned int guest_request_enabled : 1; > unsigned int guest_request_sync : 1; > } monitor; > + > + /* set to 1 the first time this domain gets scheduled. */ > + bool_t already_scheduled; Did you go through and check that there is nothing this information can already get derived from? I can't immediately point you at anything, but it feels like there should. And if indeed there isn't, then - to extend on someone else's comments (I think it was Dario) - please use plain bool in new additions. Jan
On Thu, 2016-09-08 at 09:36 -0600, Jan Beulich wrote: > > > > > > > > > > > > > On 08.09.16 at 07:30, <dongli.zhang@oracle.com> wrote: > > --- a/xen/include/xen/sched.h > > +++ b/xen/include/xen/sched.h > > @@ -474,6 +474,9 @@ struct domain > > unsigned int guest_request_enabled : 1; > > unsigned int guest_request_sync : 1; > > } monitor; > > + > > + /* set to 1 the first time this domain gets scheduled. */ > > + bool_t already_scheduled; > Did you go through and check that there is nothing this information > can already get derived from? I can't immediately point you at > anything, but it feels like there should. > Indeed. And if there isn't and we need to do add our own flagging, isn't there a better way and place where to put it (e.g., what Juergen and Andrew are hinting at)? > And if indeed there isn't, > then - to extend on someone else's comments (I think it was Dario) > - please use plain bool in new additions. > It's Wei that commented about bool-s use in the patch. :-) Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
diff --git a/xen/common/domain.c b/xen/common/domain.c index a8804e4..611a471 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -303,6 +303,8 @@ struct domain *domain_create(domid_t domid, unsigned int domcr_flags, if ( !zalloc_cpumask_var(&d->domain_dirty_cpumask) ) goto fail; + d->already_scheduled = 0; + if ( domcr_flags & DOMCRF_hvm ) d->guest_type = guest_type_hvm; else if ( domcr_flags & DOMCRF_pvh ) diff --git a/xen/common/memory.c b/xen/common/memory.c index f34dd56..3641469 100644 --- a/xen/common/memory.c +++ b/xen/common/memory.c @@ -141,6 +141,8 @@ static void populate_physmap(struct memop_args *a) unsigned int i, j; xen_pfn_t gpfn, mfn; struct domain *d = a->domain, *curr_d = current->domain; + bool_t need_tlbflush = 0; + uint32_t tlbflush_timestamp = 0; if ( !guest_handle_subrange_okay(a->extent_list, a->nr_done, a->nr_extents-1) ) @@ -150,6 +152,12 @@ static void populate_physmap(struct memop_args *a) max_order(curr_d)) ) return; + /* MEMF_no_tlbflush can be set only during vm creation phase when + * already_scheduled is still 0 before this domain gets scheduled for + * the first time. */ + if ( d->already_scheduled == 0 ) + a->memflags |= MEMF_no_tlbflush; + for ( i = a->nr_done; i < a->nr_extents; i++ ) { if ( i != a->nr_done && hypercall_preempt_check() ) @@ -214,6 +222,21 @@ static void populate_physmap(struct memop_args *a) goto out; } + if ( d->already_scheduled == 0 ) + { + for ( j = 0; j < (1U << a->extent_order); j++ ) + { + if ( page[j].u.free.need_tlbflush && + (page[j].tlbflush_timestamp <= tlbflush_current_time()) && + (!need_tlbflush || + (page[j].tlbflush_timestamp > tlbflush_timestamp)) ) + { + need_tlbflush = 1; + tlbflush_timestamp = page[j].tlbflush_timestamp; + } + } + } + mfn = page_to_mfn(page); } @@ -232,6 +255,16 @@ static void populate_physmap(struct memop_args *a) } out: + if ( need_tlbflush ) + { + cpumask_t mask = cpu_online_map; + tlbflush_filter(mask, tlbflush_timestamp); + if ( !cpumask_empty(&mask) ) + { + perfc_incr(need_flush_tlb_flush); + flush_tlb_mask(&mask); + } + } a->nr_done = i; } diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c index 18ff6cf..e0283fc 100644 --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -827,7 +827,8 @@ static struct page_info *alloc_heap_pages( BUG_ON(pg[i].count_info != PGC_state_free); pg[i].count_info = PGC_state_inuse; - if ( pg[i].u.free.need_tlbflush && + if ( !(memflags & MEMF_no_tlbflush) && + pg[i].u.free.need_tlbflush && (pg[i].tlbflush_timestamp <= tlbflush_current_time()) && (!need_tlbflush || (pg[i].tlbflush_timestamp > tlbflush_timestamp)) ) diff --git a/xen/common/schedule.c b/xen/common/schedule.c index 32a300f..593541a 100644 --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -1376,6 +1376,11 @@ static void schedule(void) next = next_slice.task; + /* Set already_scheduled to 1 when this domain gets scheduled for the + * first time */ + if ( next->domain->already_scheduled == 0 ) + next->domain->already_scheduled = 1; + sd->curr = next; if ( next_slice.time >= 0 ) /* -ve means no limit */ diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h index 58bc0b8..880ca88 100644 --- a/xen/include/xen/mm.h +++ b/xen/include/xen/mm.h @@ -221,6 +221,8 @@ struct npfec { #define MEMF_exact_node (1U<<_MEMF_exact_node) #define _MEMF_no_owner 5 #define MEMF_no_owner (1U<<_MEMF_no_owner) +#define _MEMF_no_tlbflush 6 +#define MEMF_no_tlbflush (1U<<_MEMF_no_tlbflush) #define _MEMF_node 8 #define MEMF_node_mask ((1U << (8 * sizeof(nodeid_t))) - 1) #define MEMF_node(n) ((((n) + 1) & MEMF_node_mask) << _MEMF_node) diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 2f9c15f..cbd8329 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -474,6 +474,9 @@ struct domain unsigned int guest_request_enabled : 1; unsigned int guest_request_sync : 1; } monitor; + + /* set to 1 the first time this domain gets scheduled. */ + bool_t already_scheduled; }; /* Protect updates/reads (resp.) of domain_list and domain_hash. */
This patch implemented parts of TODO left in commit id a902c12ee45fc9389eb8fe54eeddaf267a555c58. It moved TLB-flush filtering out into populate_physmap. Because of TLB-flush in alloc_heap_pages, it's very slow to create a guest with memory size of more than 100GB on host with 100+ cpus. This patch introduced a "MEMF_no_tlbflush" bit to memflags to indicate whether TLB-flush should be done in alloc_heap_pages or its caller populate_physmap. Once this bit is set in memflags, alloc_heap_pages will ignore TLB-flush. To use this bit after vm is created might lead to security issue, that is, this would make pages accessible to the guest B, when guest A may still have a cached mapping to them. Therefore, this patch also introduced a "already_scheduled" field to struct domain to indicate whether this domain has ever got scheduled by hypervisor. MEMF_no_tlbflush can be set only during vm creation phase when already_scheduled is still 0 before this domain gets scheduled for the first time. TODO: ballooning very huge amount of memory cannot benefit from this patch and might still be slow. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> --- Changed since v2: * Limit this optimization to domain creation time. --- xen/common/domain.c | 2 ++ xen/common/memory.c | 33 +++++++++++++++++++++++++++++++++ xen/common/page_alloc.c | 3 ++- xen/common/schedule.c | 5 +++++ xen/include/xen/mm.h | 2 ++ xen/include/xen/sched.h | 3 +++ 6 files changed, 47 insertions(+), 1 deletion(-)