Message ID | 1473151360-4758-1-git-send-email-dongli.zhang@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 06/09/16 09:42, Dongli Zhang wrote: > This patch implemented parts of TODO left in commit id > a902c12ee45fc9389eb8fe54eeddaf267a555c58. It moved TLB-flush filtering out > into populate_physmap. > > Because of TLB-flush in alloc_heap_pages, it's very slow to create a guest > with memory size of more than 100GB on host with 100+ cpus. > > This patch introduced a "MEMF_no_tlbflush" bit to memflag to indicate > whether TLB-flush should be done in alloc_heap_pages or its caller > populate_physmap. Once this bit is set in memflag, alloc_heap_pages will > ignore TLB-flush. This makes pages accessible to the guest B, when guest A may still have a cached mapping to them. I think it is only safe to do this when guest B is being constructed. David > Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> > --- > xen/common/memory.c | 26 ++++++++++++++++++++++++++ > xen/common/page_alloc.c | 3 ++- > xen/include/xen/mm.h | 2 ++ > 3 files changed, 30 insertions(+), 1 deletion(-) > > diff --git a/xen/common/memory.c b/xen/common/memory.c > index f34dd56..50c1a07 100644 > --- a/xen/common/memory.c > +++ b/xen/common/memory.c > @@ -141,6 +141,8 @@ static void populate_physmap(struct memop_args *a) > unsigned int i, j; > xen_pfn_t gpfn, mfn; > struct domain *d = a->domain, *curr_d = current->domain; > + bool_t need_tlbflush = 0; > + uint32_t tlbflush_timestamp = 0; > > if ( !guest_handle_subrange_okay(a->extent_list, a->nr_done, > a->nr_extents-1) ) > @@ -150,6 +152,8 @@ static void populate_physmap(struct memop_args *a) > max_order(curr_d)) ) > return; > > + a->memflags |= MEMF_no_tlbflush; > + > for ( i = a->nr_done; i < a->nr_extents; i++ ) > { > if ( i != a->nr_done && hypercall_preempt_check() ) > @@ -214,6 +218,18 @@ static void populate_physmap(struct memop_args *a) > goto out; > } > > + for ( j = 0; j < (1U << a->extent_order); j++ ) > + { > + if ( page[j].u.free.need_tlbflush && > + (page[j].tlbflush_timestamp <= tlbflush_current_time()) && > + (!need_tlbflush || > + (page[j].tlbflush_timestamp > tlbflush_timestamp)) ) > + { > + need_tlbflush = 1; > + tlbflush_timestamp = page[j].tlbflush_timestamp; > + } > + } > + > mfn = page_to_mfn(page); > } > > @@ -232,6 +248,16 @@ static void populate_physmap(struct memop_args *a) > } > > out: > + if ( need_tlbflush ) > + { > + cpumask_t mask = cpu_online_map; > + tlbflush_filter(mask, tlbflush_timestamp); > + if ( !cpumask_empty(&mask) ) > + { > + perfc_incr(need_flush_tlb_flush); > + flush_tlb_mask(&mask); > + } > + } > a->nr_done = i; > } > > diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c > index 18ff6cf..e0283fc 100644 > --- a/xen/common/page_alloc.c > +++ b/xen/common/page_alloc.c > @@ -827,7 +827,8 @@ static struct page_info *alloc_heap_pages( > BUG_ON(pg[i].count_info != PGC_state_free); > pg[i].count_info = PGC_state_inuse; > > - if ( pg[i].u.free.need_tlbflush && > + if ( !(memflags & MEMF_no_tlbflush) && > + pg[i].u.free.need_tlbflush && > (pg[i].tlbflush_timestamp <= tlbflush_current_time()) && > (!need_tlbflush || > (pg[i].tlbflush_timestamp > tlbflush_timestamp)) ) > diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h > index 58bc0b8..880ca88 100644 > --- a/xen/include/xen/mm.h > +++ b/xen/include/xen/mm.h > @@ -221,6 +221,8 @@ struct npfec { > #define MEMF_exact_node (1U<<_MEMF_exact_node) > #define _MEMF_no_owner 5 > #define MEMF_no_owner (1U<<_MEMF_no_owner) > +#define _MEMF_no_tlbflush 6 > +#define MEMF_no_tlbflush (1U<<_MEMF_no_tlbflush) > #define _MEMF_node 8 > #define MEMF_node_mask ((1U << (8 * sizeof(nodeid_t))) - 1) > #define MEMF_node(n) ((((n) + 1) & MEMF_node_mask) << _MEMF_node) >
On 06/09/16 10:39, David Vrabel wrote: > On 06/09/16 09:42, Dongli Zhang wrote: >> This patch implemented parts of TODO left in commit id >> a902c12ee45fc9389eb8fe54eeddaf267a555c58. It moved TLB-flush filtering out >> into populate_physmap. >> >> Because of TLB-flush in alloc_heap_pages, it's very slow to create a guest >> with memory size of more than 100GB on host with 100+ cpus. >> >> This patch introduced a "MEMF_no_tlbflush" bit to memflag to indicate >> whether TLB-flush should be done in alloc_heap_pages or its caller >> populate_physmap. Once this bit is set in memflag, alloc_heap_pages will >> ignore TLB-flush. > > This makes pages accessible to the guest B, when guest A may still have > a cached mapping to them. > > I think it is only safe to do this when guest B is being constructed. Only before the populate_physmap hypercall returns, right? In order to run afoul of this, a second vcpu on the guest would have to map and then put something secret / sensitive in the memory (such as a private key which guest A could then read, or code it intended to execute that guest A could then modify) before the hypercall finished. That seems like something we should be able to document as undefined / unsafe behavior if we want. OTOH, unless we care about the ability to balloon up and down by hundreds of gigabytes at a time, it probably wouldn't hurt to limit the optimization only to domain build time. -George
On 06/09/16 10:52, George Dunlap wrote: > On 06/09/16 10:39, David Vrabel wrote: >> On 06/09/16 09:42, Dongli Zhang wrote: >>> This patch implemented parts of TODO left in commit id >>> a902c12ee45fc9389eb8fe54eeddaf267a555c58. It moved TLB-flush filtering out >>> into populate_physmap. >>> >>> Because of TLB-flush in alloc_heap_pages, it's very slow to create a guest >>> with memory size of more than 100GB on host with 100+ cpus. >>> >>> This patch introduced a "MEMF_no_tlbflush" bit to memflag to indicate >>> whether TLB-flush should be done in alloc_heap_pages or its caller >>> populate_physmap. Once this bit is set in memflag, alloc_heap_pages will >>> ignore TLB-flush. >> >> This makes pages accessible to the guest B, when guest A may still have >> a cached mapping to them. >> >> I think it is only safe to do this when guest B is being constructed. > > Only before the populate_physmap hypercall returns, right? In order to > run afoul of this, a second vcpu on the guest would have to map and then > put something secret / sensitive in the memory (such as a private key > which guest A could then read, or code it intended to execute that guest > A could then modify) before the hypercall finished. That seems like > something we should be able to document as undefined / unsafe behavior > if we want. I think populate_physmap can replace existing p2m entries, and thus guest B may already have mappings. David
On 06/09/16 10:55, David Vrabel wrote: > On 06/09/16 10:52, George Dunlap wrote: >> On 06/09/16 10:39, David Vrabel wrote: >>> On 06/09/16 09:42, Dongli Zhang wrote: >>>> This patch implemented parts of TODO left in commit id >>>> a902c12ee45fc9389eb8fe54eeddaf267a555c58. It moved TLB-flush filtering out >>>> into populate_physmap. >>>> >>>> Because of TLB-flush in alloc_heap_pages, it's very slow to create a guest >>>> with memory size of more than 100GB on host with 100+ cpus. >>>> >>>> This patch introduced a "MEMF_no_tlbflush" bit to memflag to indicate >>>> whether TLB-flush should be done in alloc_heap_pages or its caller >>>> populate_physmap. Once this bit is set in memflag, alloc_heap_pages will >>>> ignore TLB-flush. >>> >>> This makes pages accessible to the guest B, when guest A may still have >>> a cached mapping to them. >>> >>> I think it is only safe to do this when guest B is being constructed. >> >> Only before the populate_physmap hypercall returns, right? In order to >> run afoul of this, a second vcpu on the guest would have to map and then >> put something secret / sensitive in the memory (such as a private key >> which guest A could then read, or code it intended to execute that guest >> A could then modify) before the hypercall finished. That seems like >> something we should be able to document as undefined / unsafe behavior >> if we want. > > I think populate_physmap can replace existing p2m entries, and thus > guest B may already have mappings. Swapping an actively in-use gpfn out from under your feet without any form of synchronization seems like a pretty mad thing to do to me; you couldn't tell whether any writes to the page would end up on the old (soon-to-be-freed) page, or the newly allocated page, and executing on such a page might at any moment be filled with a random page from Xen. It used to be the case that pages freed to Xen other than at domain death weren't scrubbed by Xen; looking briefly through the put_page() call path it looks like that's still the case. If so, then in the scenario above, sensitive information written to the old page before the swap would be leaked to other guests anyway; and the newly allocated page might have rogue code written by another guest before it was freed. So the new "window" wouldn't be any less of a security risk than the old one. So I think one could still make the argument for documenting that behavior as undefined or unsafe, if we thought it would have an impact on the speed of large ballooning operations. But since what Dongli cares about at the moment is domain creation, it certainly won't hurt to limit this optimization to domain creation time; then we can discuss enabling it for ballooning when someone finds it to be an issue. -George
diff --git a/xen/common/memory.c b/xen/common/memory.c index f34dd56..50c1a07 100644 --- a/xen/common/memory.c +++ b/xen/common/memory.c @@ -141,6 +141,8 @@ static void populate_physmap(struct memop_args *a) unsigned int i, j; xen_pfn_t gpfn, mfn; struct domain *d = a->domain, *curr_d = current->domain; + bool_t need_tlbflush = 0; + uint32_t tlbflush_timestamp = 0; if ( !guest_handle_subrange_okay(a->extent_list, a->nr_done, a->nr_extents-1) ) @@ -150,6 +152,8 @@ static void populate_physmap(struct memop_args *a) max_order(curr_d)) ) return; + a->memflags |= MEMF_no_tlbflush; + for ( i = a->nr_done; i < a->nr_extents; i++ ) { if ( i != a->nr_done && hypercall_preempt_check() ) @@ -214,6 +218,18 @@ static void populate_physmap(struct memop_args *a) goto out; } + for ( j = 0; j < (1U << a->extent_order); j++ ) + { + if ( page[j].u.free.need_tlbflush && + (page[j].tlbflush_timestamp <= tlbflush_current_time()) && + (!need_tlbflush || + (page[j].tlbflush_timestamp > tlbflush_timestamp)) ) + { + need_tlbflush = 1; + tlbflush_timestamp = page[j].tlbflush_timestamp; + } + } + mfn = page_to_mfn(page); } @@ -232,6 +248,16 @@ static void populate_physmap(struct memop_args *a) } out: + if ( need_tlbflush ) + { + cpumask_t mask = cpu_online_map; + tlbflush_filter(mask, tlbflush_timestamp); + if ( !cpumask_empty(&mask) ) + { + perfc_incr(need_flush_tlb_flush); + flush_tlb_mask(&mask); + } + } a->nr_done = i; } diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c index 18ff6cf..e0283fc 100644 --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -827,7 +827,8 @@ static struct page_info *alloc_heap_pages( BUG_ON(pg[i].count_info != PGC_state_free); pg[i].count_info = PGC_state_inuse; - if ( pg[i].u.free.need_tlbflush && + if ( !(memflags & MEMF_no_tlbflush) && + pg[i].u.free.need_tlbflush && (pg[i].tlbflush_timestamp <= tlbflush_current_time()) && (!need_tlbflush || (pg[i].tlbflush_timestamp > tlbflush_timestamp)) ) diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h index 58bc0b8..880ca88 100644 --- a/xen/include/xen/mm.h +++ b/xen/include/xen/mm.h @@ -221,6 +221,8 @@ struct npfec { #define MEMF_exact_node (1U<<_MEMF_exact_node) #define _MEMF_no_owner 5 #define MEMF_no_owner (1U<<_MEMF_no_owner) +#define _MEMF_no_tlbflush 6 +#define MEMF_no_tlbflush (1U<<_MEMF_no_tlbflush) #define _MEMF_node 8 #define MEMF_node_mask ((1U << (8 * sizeof(nodeid_t))) - 1) #define MEMF_node(n) ((((n) + 1) & MEMF_node_mask) << _MEMF_node)
This patch implemented parts of TODO left in commit id a902c12ee45fc9389eb8fe54eeddaf267a555c58. It moved TLB-flush filtering out into populate_physmap. Because of TLB-flush in alloc_heap_pages, it's very slow to create a guest with memory size of more than 100GB on host with 100+ cpus. This patch introduced a "MEMF_no_tlbflush" bit to memflag to indicate whether TLB-flush should be done in alloc_heap_pages or its caller populate_physmap. Once this bit is set in memflag, alloc_heap_pages will ignore TLB-flush. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> --- xen/common/memory.c | 26 ++++++++++++++++++++++++++ xen/common/page_alloc.c | 3 ++- xen/include/xen/mm.h | 2 ++ 3 files changed, 30 insertions(+), 1 deletion(-)