Message ID | 20210130025852.12430-9-andrew.cooper3@citrix.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | acquire_resource size and external IPT monitoring | expand |
On Sat, Jan 30, 2021 at 02:58:44AM +0000, Andrew Cooper wrote: > From: Michał Leszczyński <michal.leszczynski@cert.pl> > > To use vmtrace, buffers of a suitable size need allocating, and different > tasks will want different sizes. > > Add a domain creation parameter, and audit it appropriately in the > {arch_,}sanitise_domain_config() functions. > > For now, the x86 specific auditing is tuned to Processor Trace running in > Single Output mode, which requires a single contiguous range of memory. > > The size is given an arbitrary limit of 64M which is expected to be enough for > anticipated usecases, but not large enough to get into long-running-hypercall > problems. > > Signed-off-by: Michał Leszczyński <michal.leszczynski@cert.pl> > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> > diff --git a/xen/common/domain.c b/xen/common/domain.c > index d1e94d88cf..491b32812e 100644 > --- a/xen/common/domain.c > +++ b/xen/common/domain.c > @@ -132,6 +132,71 @@ static void vcpu_info_reset(struct vcpu *v) > v->vcpu_info_mfn = INVALID_MFN; > } > > +static void vmtrace_free_buffer(struct vcpu *v) > +{ > + const struct domain *d = v->domain; > + struct page_info *pg = v->vmtrace.pg; > + unsigned int i; > + > + if ( !pg ) > + return; > + > + v->vmtrace.pg = NULL; > + > + for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ ) > + { > + put_page_alloc_ref(&pg[i]); > + put_page_and_type(&pg[i]); > + } > +} > + > +static int vmtrace_alloc_buffer(struct vcpu *v) You might as well make this return true/false, as the error code is ignored by the caller (at least in this patch). Thanks, Roger.
On 01/02/2021 10:51, Roger Pau Monné wrote: > On Sat, Jan 30, 2021 at 02:58:44AM +0000, Andrew Cooper wrote: >> From: Michał Leszczyński <michal.leszczynski@cert.pl> >> >> To use vmtrace, buffers of a suitable size need allocating, and different >> tasks will want different sizes. >> >> Add a domain creation parameter, and audit it appropriately in the >> {arch_,}sanitise_domain_config() functions. >> >> For now, the x86 specific auditing is tuned to Processor Trace running in >> Single Output mode, which requires a single contiguous range of memory. >> >> The size is given an arbitrary limit of 64M which is expected to be enough for >> anticipated usecases, but not large enough to get into long-running-hypercall >> problems. >> >> Signed-off-by: Michał Leszczyński <michal.leszczynski@cert.pl> >> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> > Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Thanks. >> diff --git a/xen/common/domain.c b/xen/common/domain.c >> index d1e94d88cf..491b32812e 100644 >> --- a/xen/common/domain.c >> +++ b/xen/common/domain.c >> @@ -132,6 +132,71 @@ static void vcpu_info_reset(struct vcpu *v) >> v->vcpu_info_mfn = INVALID_MFN; >> } >> >> +static void vmtrace_free_buffer(struct vcpu *v) >> +{ >> + const struct domain *d = v->domain; >> + struct page_info *pg = v->vmtrace.pg; >> + unsigned int i; >> + >> + if ( !pg ) >> + return; >> + >> + v->vmtrace.pg = NULL; >> + >> + for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ ) >> + { >> + put_page_alloc_ref(&pg[i]); >> + put_page_and_type(&pg[i]); >> + } >> +} >> + >> +static int vmtrace_alloc_buffer(struct vcpu *v) > You might as well make this return true/false, as the error code is > ignored by the caller (at least in this patch). At some point vcpu_create() needs to be fixed not to lose the error code. That is the real bug here. ~Andrew
On 30.01.2021 03:58, Andrew Cooper wrote: > +static int vmtrace_alloc_buffer(struct vcpu *v) > +{ > + struct domain *d = v->domain; > + struct page_info *pg; > + unsigned int i; > + > + if ( !d->vmtrace_size ) > + return 0; > + > + pg = alloc_domheap_pages(d, get_order_from_bytes(d->vmtrace_size), > + MEMF_no_refcount); > + if ( !pg ) > + return -ENOMEM; > + > + /* > + * Getting the reference counting correct here is hard. > + * > + * All pages are now on the domlist. They, or subranges within, will be "domlist" is too imprecise, as there's no list with this name. It's extra_page_list in this case (see also below). > + * freed when their reference count drops to zero, which may any time > + * between now and the domain teardown path. > + */ > + > + for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ ) > + if ( unlikely(!get_page_and_type(&pg[i], d, PGT_writable_page)) ) > + goto refcnt_err; > + > + /* > + * We must only let vmtrace_free_buffer() take any action in the success > + * case when we've taken all the refs it intends to drop. > + */ > + v->vmtrace.pg = pg; > + > + return 0; > + > + refcnt_err: > + /* > + * In the failure case, we must drop all the acquired typerefs thus far, > + * skip vmtrace_free_buffer(), and leave domain_relinquish_resources() to > + * drop the alloc refs on any remaining pages - some pages could already > + * have been freed behind our backs. > + */ > + while ( i-- ) > + put_page_and_type(&pg[i]); > + > + return -ENODATA; > +} As said in reply on the other thread, PGC_extra pages don't get freed automatically. I too initially thought they would, but (re-)learned otherwise when trying to repro your claims on that other thread. For all pages you've managed to get the writable ref, freeing is easily done by prefixing the loop body above by put_page_alloc_ref(). For all other pages best you can do (I think; see the debugging patches I had sent on that other thread) is to try get_page() - if it succeeds, calling put_page_alloc_ref() is allowed. Otherwise we can only leak the respective page (unless going to further extents with trying to recover from the "impossible"), or assume the failure here was because it did get freed already. Jan
On 01/02/2021 13:18, Jan Beulich wrote: > On 30.01.2021 03:58, Andrew Cooper wrote: >> +static int vmtrace_alloc_buffer(struct vcpu *v) >> +{ >> + struct domain *d = v->domain; >> + struct page_info *pg; >> + unsigned int i; >> + >> + if ( !d->vmtrace_size ) >> + return 0; >> + >> + pg = alloc_domheap_pages(d, get_order_from_bytes(d->vmtrace_size), >> + MEMF_no_refcount); >> + if ( !pg ) >> + return -ENOMEM; >> + >> + /* >> + * Getting the reference counting correct here is hard. >> + * >> + * All pages are now on the domlist. They, or subranges within, will be > "domlist" is too imprecise, as there's no list with this name. It's > extra_page_list in this case (see also below). > >> + * freed when their reference count drops to zero, which may any time >> + * between now and the domain teardown path. >> + */ >> + >> + for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ ) >> + if ( unlikely(!get_page_and_type(&pg[i], d, PGT_writable_page)) ) >> + goto refcnt_err; >> + >> + /* >> + * We must only let vmtrace_free_buffer() take any action in the success >> + * case when we've taken all the refs it intends to drop. >> + */ >> + v->vmtrace.pg = pg; >> + >> + return 0; >> + >> + refcnt_err: >> + /* >> + * In the failure case, we must drop all the acquired typerefs thus far, >> + * skip vmtrace_free_buffer(), and leave domain_relinquish_resources() to >> + * drop the alloc refs on any remaining pages - some pages could already >> + * have been freed behind our backs. >> + */ >> + while ( i-- ) >> + put_page_and_type(&pg[i]); >> + >> + return -ENODATA; >> +} > As said in reply on the other thread, PGC_extra pages don't get > freed automatically. I too initially thought they would, but > (re-)learned otherwise when trying to repro your claims on that > other thread. For all pages you've managed to get the writable > ref, freeing is easily done by prefixing the loop body above by > put_page_alloc_ref(). For all other pages best you can do (I > think; see the debugging patches I had sent on that other > thread) is to try get_page() - if it succeeds, calling > put_page_alloc_ref() is allowed. Otherwise we can only leak the > respective page (unless going to further extents with trying to > recover from the "impossible"), or assume the failure here was > because it did get freed already. Right - I'm going to insist on breaking apart orthogonal issues. This refcounting issue isn't introduced by this series - this series uses an established pattern, in which we've found a corner case. The corner case is theoretical, not practical - it is not possible for a malicious PV domain to take 2^43 refs on any of the pages in this allocation. Doing so would require an hours-long SMI, or equivalent, and even then all malicious activity would be paused after 1s for the time calibration rendezvous which would livelock the system until the watchdog kicked in. I will drop the comments, because in light of this discovery, they're not correct. We should fix the corner case, but that should be a separate patch. Whatever we do needs to start by writing down the the refcounting rules first because its totally clear that noone understands them, and then a change to all examples of this pattern adjusting (if necessary). ~Andrew
On 01.02.2021 15:22, Andrew Cooper wrote: > On 01/02/2021 13:18, Jan Beulich wrote: >> On 30.01.2021 03:58, Andrew Cooper wrote: >>> +static int vmtrace_alloc_buffer(struct vcpu *v) >>> +{ >>> + struct domain *d = v->domain; >>> + struct page_info *pg; >>> + unsigned int i; >>> + >>> + if ( !d->vmtrace_size ) >>> + return 0; >>> + >>> + pg = alloc_domheap_pages(d, get_order_from_bytes(d->vmtrace_size), >>> + MEMF_no_refcount); >>> + if ( !pg ) >>> + return -ENOMEM; >>> + >>> + /* >>> + * Getting the reference counting correct here is hard. >>> + * >>> + * All pages are now on the domlist. They, or subranges within, will be >> "domlist" is too imprecise, as there's no list with this name. It's >> extra_page_list in this case (see also below). >> >>> + * freed when their reference count drops to zero, which may any time >>> + * between now and the domain teardown path. >>> + */ >>> + >>> + for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ ) >>> + if ( unlikely(!get_page_and_type(&pg[i], d, PGT_writable_page)) ) >>> + goto refcnt_err; >>> + >>> + /* >>> + * We must only let vmtrace_free_buffer() take any action in the success >>> + * case when we've taken all the refs it intends to drop. >>> + */ >>> + v->vmtrace.pg = pg; >>> + >>> + return 0; >>> + >>> + refcnt_err: >>> + /* >>> + * In the failure case, we must drop all the acquired typerefs thus far, >>> + * skip vmtrace_free_buffer(), and leave domain_relinquish_resources() to >>> + * drop the alloc refs on any remaining pages - some pages could already >>> + * have been freed behind our backs. >>> + */ >>> + while ( i-- ) >>> + put_page_and_type(&pg[i]); >>> + >>> + return -ENODATA; >>> +} >> As said in reply on the other thread, PGC_extra pages don't get >> freed automatically. I too initially thought they would, but >> (re-)learned otherwise when trying to repro your claims on that >> other thread. For all pages you've managed to get the writable >> ref, freeing is easily done by prefixing the loop body above by >> put_page_alloc_ref(). For all other pages best you can do (I >> think; see the debugging patches I had sent on that other >> thread) is to try get_page() - if it succeeds, calling >> put_page_alloc_ref() is allowed. Otherwise we can only leak the >> respective page (unless going to further extents with trying to >> recover from the "impossible"), or assume the failure here was >> because it did get freed already. > > Right - I'm going to insist on breaking apart orthogonal issues. > > This refcounting issue isn't introduced by this series - this series > uses an established pattern, in which we've found a corner case. > > The corner case is theoretical, not practical - it is not possible for a > malicious PV domain to take 2^43 refs on any of the pages in this > allocation. Doing so would require an hours-long SMI, or equivalent, > and even then all malicious activity would be paused after 1s for the > time calibration rendezvous which would livelock the system until the > watchdog kicked in. Actually an overflow is only one of the possible reasons here. Another, which may be more "practical", is that another entity has already managed to free the page (by dropping its alloc-ref, and of course implying it did guess at the MFN). Jan
On 01/02/2021 14:36, Jan Beulich wrote: > On 01.02.2021 15:22, Andrew Cooper wrote: >> On 01/02/2021 13:18, Jan Beulich wrote: >>> On 30.01.2021 03:58, Andrew Cooper wrote: >>>> +static int vmtrace_alloc_buffer(struct vcpu *v) >>>> +{ >>>> + struct domain *d = v->domain; >>>> + struct page_info *pg; >>>> + unsigned int i; >>>> + >>>> + if ( !d->vmtrace_size ) >>>> + return 0; >>>> + >>>> + pg = alloc_domheap_pages(d, get_order_from_bytes(d->vmtrace_size), >>>> + MEMF_no_refcount); >>>> + if ( !pg ) >>>> + return -ENOMEM; >>>> + >>>> + /* >>>> + * Getting the reference counting correct here is hard. >>>> + * >>>> + * All pages are now on the domlist. They, or subranges within, will be >>> "domlist" is too imprecise, as there's no list with this name. It's >>> extra_page_list in this case (see also below). >>> >>>> + * freed when their reference count drops to zero, which may any time >>>> + * between now and the domain teardown path. >>>> + */ >>>> + >>>> + for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ ) >>>> + if ( unlikely(!get_page_and_type(&pg[i], d, PGT_writable_page)) ) >>>> + goto refcnt_err; >>>> + >>>> + /* >>>> + * We must only let vmtrace_free_buffer() take any action in the success >>>> + * case when we've taken all the refs it intends to drop. >>>> + */ >>>> + v->vmtrace.pg = pg; >>>> + >>>> + return 0; >>>> + >>>> + refcnt_err: >>>> + /* >>>> + * In the failure case, we must drop all the acquired typerefs thus far, >>>> + * skip vmtrace_free_buffer(), and leave domain_relinquish_resources() to >>>> + * drop the alloc refs on any remaining pages - some pages could already >>>> + * have been freed behind our backs. >>>> + */ >>>> + while ( i-- ) >>>> + put_page_and_type(&pg[i]); >>>> + >>>> + return -ENODATA; >>>> +} >>> As said in reply on the other thread, PGC_extra pages don't get >>> freed automatically. I too initially thought they would, but >>> (re-)learned otherwise when trying to repro your claims on that >>> other thread. For all pages you've managed to get the writable >>> ref, freeing is easily done by prefixing the loop body above by >>> put_page_alloc_ref(). For all other pages best you can do (I >>> think; see the debugging patches I had sent on that other >>> thread) is to try get_page() - if it succeeds, calling >>> put_page_alloc_ref() is allowed. Otherwise we can only leak the >>> respective page (unless going to further extents with trying to >>> recover from the "impossible"), or assume the failure here was >>> because it did get freed already. >> Right - I'm going to insist on breaking apart orthogonal issues. >> >> This refcounting issue isn't introduced by this series - this series >> uses an established pattern, in which we've found a corner case. >> >> The corner case is theoretical, not practical - it is not possible for a >> malicious PV domain to take 2^43 refs on any of the pages in this >> allocation. Doing so would require an hours-long SMI, or equivalent, >> and even then all malicious activity would be paused after 1s for the >> time calibration rendezvous which would livelock the system until the >> watchdog kicked in. > Actually an overflow is only one of the possible reasons here. > Another, which may be more "practical", is that another entity > has already managed to free the page (by dropping its alloc-ref, > and of course implying it did guess at the MFN). Yes, but in this case it did get dropped from the extra page list, in which case looping over the remaining ones in relinquish_resource would be safe. ~Andrew
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index b9ba04633e..6c7ee25f3b 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -660,6 +660,29 @@ int arch_sanitise_domain_config(struct xen_domctl_createdomain *config) return -EINVAL; } + if ( config->vmtrace_size ) + { + unsigned int size = config->vmtrace_size; + + ASSERT(vmtrace_available); /* Checked by common code. */ + + /* + * For now, vmtrace is restricted to HVM guests, and using a + * power-of-2 buffer between 4k and 64M in size. + */ + if ( !hvm ) + { + dprintk(XENLOG_INFO, "vmtrace not supported for PV\n"); + return -EINVAL; + } + + if ( size < PAGE_SIZE || size > MB(64) || (size & (size - 1)) ) + { + dprintk(XENLOG_INFO, "Unsupported vmtrace size: %#x\n", size); + return -EINVAL; + } + } + return 0; } diff --git a/xen/common/domain.c b/xen/common/domain.c index d1e94d88cf..491b32812e 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -132,6 +132,71 @@ static void vcpu_info_reset(struct vcpu *v) v->vcpu_info_mfn = INVALID_MFN; } +static void vmtrace_free_buffer(struct vcpu *v) +{ + const struct domain *d = v->domain; + struct page_info *pg = v->vmtrace.pg; + unsigned int i; + + if ( !pg ) + return; + + v->vmtrace.pg = NULL; + + for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ ) + { + put_page_alloc_ref(&pg[i]); + put_page_and_type(&pg[i]); + } +} + +static int vmtrace_alloc_buffer(struct vcpu *v) +{ + struct domain *d = v->domain; + struct page_info *pg; + unsigned int i; + + if ( !d->vmtrace_size ) + return 0; + + pg = alloc_domheap_pages(d, get_order_from_bytes(d->vmtrace_size), + MEMF_no_refcount); + if ( !pg ) + return -ENOMEM; + + /* + * Getting the reference counting correct here is hard. + * + * All pages are now on the domlist. They, or subranges within, will be + * freed when their reference count drops to zero, which may any time + * between now and the domain teardown path. + */ + + for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ ) + if ( unlikely(!get_page_and_type(&pg[i], d, PGT_writable_page)) ) + goto refcnt_err; + + /* + * We must only let vmtrace_free_buffer() take any action in the success + * case when we've taken all the refs it intends to drop. + */ + v->vmtrace.pg = pg; + + return 0; + + refcnt_err: + /* + * In the failure case, we must drop all the acquired typerefs thus far, + * skip vmtrace_free_buffer(), and leave domain_relinquish_resources() to + * drop the alloc refs on any remaining pages - some pages could already + * have been freed behind our backs. + */ + while ( i-- ) + put_page_and_type(&pg[i]); + + return -ENODATA; +} + /* * Release resources held by a vcpu. There may or may not be live references * to the vcpu, and it may or may not be fully constructed. @@ -140,6 +205,8 @@ static void vcpu_info_reset(struct vcpu *v) */ static int vcpu_teardown(struct vcpu *v) { + vmtrace_free_buffer(v); + return 0; } @@ -201,6 +268,9 @@ struct vcpu *vcpu_create(struct domain *d, unsigned int vcpu_id) if ( sched_init_vcpu(v) != 0 ) goto fail_wq; + if ( vmtrace_alloc_buffer(v) != 0 ) + goto fail_wq; + if ( arch_vcpu_create(v) != 0 ) goto fail_sched; @@ -449,6 +519,12 @@ static int sanitise_domain_config(struct xen_domctl_createdomain *config) } } + if ( config->vmtrace_size && !vmtrace_available ) + { + dprintk(XENLOG_INFO, "vmtrace requested but not available\n"); + return -EINVAL; + } + return arch_sanitise_domain_config(config); } @@ -474,7 +550,10 @@ struct domain *domain_create(domid_t domid, ASSERT(is_system_domain(d) ? config == NULL : config != NULL); if ( config ) + { d->options = config->flags; + d->vmtrace_size = config->vmtrace_size; + } /* Sort out our idea of is_control_domain(). */ d->is_privileged = is_priv; diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h index 666aeb71bf..88a5b1ef5d 100644 --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -95,6 +95,9 @@ struct xen_domctl_createdomain { int32_t max_grant_frames; int32_t max_maptrack_frames; + /* Per-vCPU buffer size in bytes. 0 to disable. */ + uint32_t vmtrace_size; + struct xen_arch_domainconfig arch; }; diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 06dba1a397..bc78a09a53 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -272,6 +272,10 @@ struct vcpu /* vPCI per-vCPU area, used to store data for long running operations. */ struct vpci_vcpu vpci; + struct { + struct page_info *pg; /* One contiguous allocation of d->vmtrace_size */ + } vmtrace; + struct arch_vcpu arch; #ifdef CONFIG_IOREQ_SERVER @@ -547,6 +551,8 @@ struct domain unsigned int guest_request_sync : 1; } monitor; + unsigned int vmtrace_size; /* Buffer size in bytes, or 0 to disable. */ + #ifdef CONFIG_ARGO /* Argo interdomain communication support */ struct argo_domain *argo;