diff mbox series

[v8,08/16] xen/domain: Add vmtrace_size domain creation parameter

Message ID 20210130025852.12430-9-andrew.cooper3@citrix.com (mailing list archive)
State Superseded
Headers show
Series acquire_resource size and external IPT monitoring | expand

Commit Message

Andrew Cooper Jan. 30, 2021, 2:58 a.m. UTC
From: Michał Leszczyński <michal.leszczynski@cert.pl>

To use vmtrace, buffers of a suitable size need allocating, and different
tasks will want different sizes.

Add a domain creation parameter, and audit it appropriately in the
{arch_,}sanitise_domain_config() functions.

For now, the x86 specific auditing is tuned to Processor Trace running in
Single Output mode, which requires a single contiguous range of memory.

The size is given an arbitrary limit of 64M which is expected to be enough for
anticipated usecases, but not large enough to get into long-running-hypercall
problems.

Signed-off-by: Michał Leszczyński <michal.leszczynski@cert.pl>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Anthony PERARD <anthony.perard@citrix.com>
CC: Michał Leszczyński <michal.leszczynski@cert.pl>
CC: Tamas K Lengyel <tamas@tklengyel.com>

When support for later generations of IPT get added, we can in principle start
to use ToTP which is a scatter list of smaller trace regions to use, if we
need to massively up the buffer size available.

v8:
 * Rename vmtrace_frames to vmtrace_size.  Reposition to fill a hole.
 * Rename vmtrace.buf to vmtrace.pg.
 * Rework the refcounting logic and comment it *very* clearly.

v7:
 * Major chop&change within the series.
 * Use the name 'vmtrace' consistently.
 * Use the (new) common vcpu_teardown() functionality, rather than leaving a
   latent memory leak on ARM.
---
 xen/arch/x86/domain.c       | 23 +++++++++++++
 xen/common/domain.c         | 79 +++++++++++++++++++++++++++++++++++++++++++++
 xen/include/public/domctl.h |  3 ++
 xen/include/xen/sched.h     |  6 ++++
 4 files changed, 111 insertions(+)

Comments

Roger Pau Monne Feb. 1, 2021, 10:51 a.m. UTC | #1
On Sat, Jan 30, 2021 at 02:58:44AM +0000, Andrew Cooper wrote:
> From: Michał Leszczyński <michal.leszczynski@cert.pl>
> 
> To use vmtrace, buffers of a suitable size need allocating, and different
> tasks will want different sizes.
> 
> Add a domain creation parameter, and audit it appropriately in the
> {arch_,}sanitise_domain_config() functions.
> 
> For now, the x86 specific auditing is tuned to Processor Trace running in
> Single Output mode, which requires a single contiguous range of memory.
> 
> The size is given an arbitrary limit of 64M which is expected to be enough for
> anticipated usecases, but not large enough to get into long-running-hypercall
> problems.
> 
> Signed-off-by: Michał Leszczyński <michal.leszczynski@cert.pl>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index d1e94d88cf..491b32812e 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -132,6 +132,71 @@ static void vcpu_info_reset(struct vcpu *v)
>      v->vcpu_info_mfn = INVALID_MFN;
>  }
>  
> +static void vmtrace_free_buffer(struct vcpu *v)
> +{
> +    const struct domain *d = v->domain;
> +    struct page_info *pg = v->vmtrace.pg;
> +    unsigned int i;
> +
> +    if ( !pg )
> +        return;
> +
> +    v->vmtrace.pg = NULL;
> +
> +    for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ )
> +    {
> +        put_page_alloc_ref(&pg[i]);
> +        put_page_and_type(&pg[i]);
> +    }
> +}
> +
> +static int vmtrace_alloc_buffer(struct vcpu *v)

You might as well make this return true/false, as the error code is
ignored by the caller (at least in this patch).

Thanks, Roger.
Andrew Cooper Feb. 1, 2021, 11:19 a.m. UTC | #2
On 01/02/2021 10:51, Roger Pau Monné wrote:
> On Sat, Jan 30, 2021 at 02:58:44AM +0000, Andrew Cooper wrote:
>> From: Michał Leszczyński <michal.leszczynski@cert.pl>
>>
>> To use vmtrace, buffers of a suitable size need allocating, and different
>> tasks will want different sizes.
>>
>> Add a domain creation parameter, and audit it appropriately in the
>> {arch_,}sanitise_domain_config() functions.
>>
>> For now, the x86 specific auditing is tuned to Processor Trace running in
>> Single Output mode, which requires a single contiguous range of memory.
>>
>> The size is given an arbitrary limit of 64M which is expected to be enough for
>> anticipated usecases, but not large enough to get into long-running-hypercall
>> problems.
>>
>> Signed-off-by: Michał Leszczyński <michal.leszczynski@cert.pl>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

>> diff --git a/xen/common/domain.c b/xen/common/domain.c
>> index d1e94d88cf..491b32812e 100644
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -132,6 +132,71 @@ static void vcpu_info_reset(struct vcpu *v)
>>      v->vcpu_info_mfn = INVALID_MFN;
>>  }
>>  
>> +static void vmtrace_free_buffer(struct vcpu *v)
>> +{
>> +    const struct domain *d = v->domain;
>> +    struct page_info *pg = v->vmtrace.pg;
>> +    unsigned int i;
>> +
>> +    if ( !pg )
>> +        return;
>> +
>> +    v->vmtrace.pg = NULL;
>> +
>> +    for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ )
>> +    {
>> +        put_page_alloc_ref(&pg[i]);
>> +        put_page_and_type(&pg[i]);
>> +    }
>> +}
>> +
>> +static int vmtrace_alloc_buffer(struct vcpu *v)
> You might as well make this return true/false, as the error code is
> ignored by the caller (at least in this patch).

At some point vcpu_create() needs to be fixed not to lose the error
code.  That is the real bug here.

~Andrew
Jan Beulich Feb. 1, 2021, 1:18 p.m. UTC | #3
On 30.01.2021 03:58, Andrew Cooper wrote:
> +static int vmtrace_alloc_buffer(struct vcpu *v)
> +{
> +    struct domain *d = v->domain;
> +    struct page_info *pg;
> +    unsigned int i;
> +
> +    if ( !d->vmtrace_size )
> +        return 0;
> +
> +    pg = alloc_domheap_pages(d, get_order_from_bytes(d->vmtrace_size),
> +                             MEMF_no_refcount);
> +    if ( !pg )
> +        return -ENOMEM;
> +
> +    /*
> +     * Getting the reference counting correct here is hard.
> +     *
> +     * All pages are now on the domlist.  They, or subranges within, will be

"domlist" is too imprecise, as there's no list with this name. It's
extra_page_list in this case (see also below).

> +     * freed when their reference count drops to zero, which may any time
> +     * between now and the domain teardown path.
> +     */
> +
> +    for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ )
> +        if ( unlikely(!get_page_and_type(&pg[i], d, PGT_writable_page)) )
> +            goto refcnt_err;
> +
> +    /*
> +     * We must only let vmtrace_free_buffer() take any action in the success
> +     * case when we've taken all the refs it intends to drop.
> +     */
> +    v->vmtrace.pg = pg;
> +
> +    return 0;
> +
> + refcnt_err:
> +    /*
> +     * In the failure case, we must drop all the acquired typerefs thus far,
> +     * skip vmtrace_free_buffer(), and leave domain_relinquish_resources() to
> +     * drop the alloc refs on any remaining pages - some pages could already
> +     * have been freed behind our backs.
> +     */
> +    while ( i-- )
> +        put_page_and_type(&pg[i]);
> +
> +    return -ENODATA;
> +}

As said in reply on the other thread, PGC_extra pages don't get
freed automatically. I too initially thought they would, but
(re-)learned otherwise when trying to repro your claims on that
other thread. For all pages you've managed to get the writable
ref, freeing is easily done by prefixing the loop body above by
put_page_alloc_ref(). For all other pages best you can do (I
think; see the debugging patches I had sent on that other
thread) is to try get_page() - if it succeeds, calling
put_page_alloc_ref() is allowed. Otherwise we can only leak the
respective page (unless going to further extents with trying to
recover from the "impossible"), or assume the failure here was
because it did get freed already.

Jan
Andrew Cooper Feb. 1, 2021, 2:22 p.m. UTC | #4
On 01/02/2021 13:18, Jan Beulich wrote:
> On 30.01.2021 03:58, Andrew Cooper wrote:
>> +static int vmtrace_alloc_buffer(struct vcpu *v)
>> +{
>> +    struct domain *d = v->domain;
>> +    struct page_info *pg;
>> +    unsigned int i;
>> +
>> +    if ( !d->vmtrace_size )
>> +        return 0;
>> +
>> +    pg = alloc_domheap_pages(d, get_order_from_bytes(d->vmtrace_size),
>> +                             MEMF_no_refcount);
>> +    if ( !pg )
>> +        return -ENOMEM;
>> +
>> +    /*
>> +     * Getting the reference counting correct here is hard.
>> +     *
>> +     * All pages are now on the domlist.  They, or subranges within, will be
> "domlist" is too imprecise, as there's no list with this name. It's
> extra_page_list in this case (see also below).
>
>> +     * freed when their reference count drops to zero, which may any time
>> +     * between now and the domain teardown path.
>> +     */
>> +
>> +    for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ )
>> +        if ( unlikely(!get_page_and_type(&pg[i], d, PGT_writable_page)) )
>> +            goto refcnt_err;
>> +
>> +    /*
>> +     * We must only let vmtrace_free_buffer() take any action in the success
>> +     * case when we've taken all the refs it intends to drop.
>> +     */
>> +    v->vmtrace.pg = pg;
>> +
>> +    return 0;
>> +
>> + refcnt_err:
>> +    /*
>> +     * In the failure case, we must drop all the acquired typerefs thus far,
>> +     * skip vmtrace_free_buffer(), and leave domain_relinquish_resources() to
>> +     * drop the alloc refs on any remaining pages - some pages could already
>> +     * have been freed behind our backs.
>> +     */
>> +    while ( i-- )
>> +        put_page_and_type(&pg[i]);
>> +
>> +    return -ENODATA;
>> +}
> As said in reply on the other thread, PGC_extra pages don't get
> freed automatically. I too initially thought they would, but
> (re-)learned otherwise when trying to repro your claims on that
> other thread. For all pages you've managed to get the writable
> ref, freeing is easily done by prefixing the loop body above by
> put_page_alloc_ref(). For all other pages best you can do (I
> think; see the debugging patches I had sent on that other
> thread) is to try get_page() - if it succeeds, calling
> put_page_alloc_ref() is allowed. Otherwise we can only leak the
> respective page (unless going to further extents with trying to
> recover from the "impossible"), or assume the failure here was
> because it did get freed already.

Right - I'm going to insist on breaking apart orthogonal issues.

This refcounting issue isn't introduced by this series - this series
uses an established pattern, in which we've found a corner case.

The corner case is theoretical, not practical - it is not possible for a
malicious PV domain to take 2^43 refs on any of the pages in this
allocation.  Doing so would require an hours-long SMI, or equivalent,
and even then all malicious activity would be paused after 1s for the
time calibration rendezvous which would livelock the system until the
watchdog kicked in.


I will drop the comments, because in light of this discovery, they're
not correct.

We should fix the corner case, but that should be a separate patch. 
Whatever we do needs to start by writing down the the refcounting rules
first because its totally clear that noone understands them, and then a
change to all examples of this pattern adjusting (if necessary).

~Andrew
Jan Beulich Feb. 1, 2021, 2:36 p.m. UTC | #5
On 01.02.2021 15:22, Andrew Cooper wrote:
> On 01/02/2021 13:18, Jan Beulich wrote:
>> On 30.01.2021 03:58, Andrew Cooper wrote:
>>> +static int vmtrace_alloc_buffer(struct vcpu *v)
>>> +{
>>> +    struct domain *d = v->domain;
>>> +    struct page_info *pg;
>>> +    unsigned int i;
>>> +
>>> +    if ( !d->vmtrace_size )
>>> +        return 0;
>>> +
>>> +    pg = alloc_domheap_pages(d, get_order_from_bytes(d->vmtrace_size),
>>> +                             MEMF_no_refcount);
>>> +    if ( !pg )
>>> +        return -ENOMEM;
>>> +
>>> +    /*
>>> +     * Getting the reference counting correct here is hard.
>>> +     *
>>> +     * All pages are now on the domlist.  They, or subranges within, will be
>> "domlist" is too imprecise, as there's no list with this name. It's
>> extra_page_list in this case (see also below).
>>
>>> +     * freed when their reference count drops to zero, which may any time
>>> +     * between now and the domain teardown path.
>>> +     */
>>> +
>>> +    for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ )
>>> +        if ( unlikely(!get_page_and_type(&pg[i], d, PGT_writable_page)) )
>>> +            goto refcnt_err;
>>> +
>>> +    /*
>>> +     * We must only let vmtrace_free_buffer() take any action in the success
>>> +     * case when we've taken all the refs it intends to drop.
>>> +     */
>>> +    v->vmtrace.pg = pg;
>>> +
>>> +    return 0;
>>> +
>>> + refcnt_err:
>>> +    /*
>>> +     * In the failure case, we must drop all the acquired typerefs thus far,
>>> +     * skip vmtrace_free_buffer(), and leave domain_relinquish_resources() to
>>> +     * drop the alloc refs on any remaining pages - some pages could already
>>> +     * have been freed behind our backs.
>>> +     */
>>> +    while ( i-- )
>>> +        put_page_and_type(&pg[i]);
>>> +
>>> +    return -ENODATA;
>>> +}
>> As said in reply on the other thread, PGC_extra pages don't get
>> freed automatically. I too initially thought they would, but
>> (re-)learned otherwise when trying to repro your claims on that
>> other thread. For all pages you've managed to get the writable
>> ref, freeing is easily done by prefixing the loop body above by
>> put_page_alloc_ref(). For all other pages best you can do (I
>> think; see the debugging patches I had sent on that other
>> thread) is to try get_page() - if it succeeds, calling
>> put_page_alloc_ref() is allowed. Otherwise we can only leak the
>> respective page (unless going to further extents with trying to
>> recover from the "impossible"), or assume the failure here was
>> because it did get freed already.
> 
> Right - I'm going to insist on breaking apart orthogonal issues.
> 
> This refcounting issue isn't introduced by this series - this series
> uses an established pattern, in which we've found a corner case.
> 
> The corner case is theoretical, not practical - it is not possible for a
> malicious PV domain to take 2^43 refs on any of the pages in this
> allocation.  Doing so would require an hours-long SMI, or equivalent,
> and even then all malicious activity would be paused after 1s for the
> time calibration rendezvous which would livelock the system until the
> watchdog kicked in.

Actually an overflow is only one of the possible reasons here.
Another, which may be more "practical", is that another entity
has already managed to free the page (by dropping its alloc-ref,
and of course implying it did guess at the MFN).

Jan
Andrew Cooper Feb. 1, 2021, 10:14 p.m. UTC | #6
On 01/02/2021 14:36, Jan Beulich wrote:
> On 01.02.2021 15:22, Andrew Cooper wrote:
>> On 01/02/2021 13:18, Jan Beulich wrote:
>>> On 30.01.2021 03:58, Andrew Cooper wrote:
>>>> +static int vmtrace_alloc_buffer(struct vcpu *v)
>>>> +{
>>>> +    struct domain *d = v->domain;
>>>> +    struct page_info *pg;
>>>> +    unsigned int i;
>>>> +
>>>> +    if ( !d->vmtrace_size )
>>>> +        return 0;
>>>> +
>>>> +    pg = alloc_domheap_pages(d, get_order_from_bytes(d->vmtrace_size),
>>>> +                             MEMF_no_refcount);
>>>> +    if ( !pg )
>>>> +        return -ENOMEM;
>>>> +
>>>> +    /*
>>>> +     * Getting the reference counting correct here is hard.
>>>> +     *
>>>> +     * All pages are now on the domlist.  They, or subranges within, will be
>>> "domlist" is too imprecise, as there's no list with this name. It's
>>> extra_page_list in this case (see also below).
>>>
>>>> +     * freed when their reference count drops to zero, which may any time
>>>> +     * between now and the domain teardown path.
>>>> +     */
>>>> +
>>>> +    for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ )
>>>> +        if ( unlikely(!get_page_and_type(&pg[i], d, PGT_writable_page)) )
>>>> +            goto refcnt_err;
>>>> +
>>>> +    /*
>>>> +     * We must only let vmtrace_free_buffer() take any action in the success
>>>> +     * case when we've taken all the refs it intends to drop.
>>>> +     */
>>>> +    v->vmtrace.pg = pg;
>>>> +
>>>> +    return 0;
>>>> +
>>>> + refcnt_err:
>>>> +    /*
>>>> +     * In the failure case, we must drop all the acquired typerefs thus far,
>>>> +     * skip vmtrace_free_buffer(), and leave domain_relinquish_resources() to
>>>> +     * drop the alloc refs on any remaining pages - some pages could already
>>>> +     * have been freed behind our backs.
>>>> +     */
>>>> +    while ( i-- )
>>>> +        put_page_and_type(&pg[i]);
>>>> +
>>>> +    return -ENODATA;
>>>> +}
>>> As said in reply on the other thread, PGC_extra pages don't get
>>> freed automatically. I too initially thought they would, but
>>> (re-)learned otherwise when trying to repro your claims on that
>>> other thread. For all pages you've managed to get the writable
>>> ref, freeing is easily done by prefixing the loop body above by
>>> put_page_alloc_ref(). For all other pages best you can do (I
>>> think; see the debugging patches I had sent on that other
>>> thread) is to try get_page() - if it succeeds, calling
>>> put_page_alloc_ref() is allowed. Otherwise we can only leak the
>>> respective page (unless going to further extents with trying to
>>> recover from the "impossible"), or assume the failure here was
>>> because it did get freed already.
>> Right - I'm going to insist on breaking apart orthogonal issues.
>>
>> This refcounting issue isn't introduced by this series - this series
>> uses an established pattern, in which we've found a corner case.
>>
>> The corner case is theoretical, not practical - it is not possible for a
>> malicious PV domain to take 2^43 refs on any of the pages in this
>> allocation.  Doing so would require an hours-long SMI, or equivalent,
>> and even then all malicious activity would be paused after 1s for the
>> time calibration rendezvous which would livelock the system until the
>> watchdog kicked in.
> Actually an overflow is only one of the possible reasons here.
> Another, which may be more "practical", is that another entity
> has already managed to free the page (by dropping its alloc-ref,
> and of course implying it did guess at the MFN).

Yes, but in this case it did get dropped from the extra page list, in
which case looping over the remaining ones in relinquish_resource would
be safe.

~Andrew
diff mbox series

Patch

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index b9ba04633e..6c7ee25f3b 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -660,6 +660,29 @@  int arch_sanitise_domain_config(struct xen_domctl_createdomain *config)
         return -EINVAL;
     }
 
+    if ( config->vmtrace_size )
+    {
+        unsigned int size = config->vmtrace_size;
+
+        ASSERT(vmtrace_available); /* Checked by common code. */
+
+        /*
+         * For now, vmtrace is restricted to HVM guests, and using a
+         * power-of-2 buffer between 4k and 64M in size.
+         */
+        if ( !hvm )
+        {
+            dprintk(XENLOG_INFO, "vmtrace not supported for PV\n");
+            return -EINVAL;
+        }
+
+        if ( size < PAGE_SIZE || size > MB(64) || (size & (size - 1)) )
+        {
+            dprintk(XENLOG_INFO, "Unsupported vmtrace size: %#x\n", size);
+            return -EINVAL;
+        }
+    }
+
     return 0;
 }
 
diff --git a/xen/common/domain.c b/xen/common/domain.c
index d1e94d88cf..491b32812e 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -132,6 +132,71 @@  static void vcpu_info_reset(struct vcpu *v)
     v->vcpu_info_mfn = INVALID_MFN;
 }
 
+static void vmtrace_free_buffer(struct vcpu *v)
+{
+    const struct domain *d = v->domain;
+    struct page_info *pg = v->vmtrace.pg;
+    unsigned int i;
+
+    if ( !pg )
+        return;
+
+    v->vmtrace.pg = NULL;
+
+    for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ )
+    {
+        put_page_alloc_ref(&pg[i]);
+        put_page_and_type(&pg[i]);
+    }
+}
+
+static int vmtrace_alloc_buffer(struct vcpu *v)
+{
+    struct domain *d = v->domain;
+    struct page_info *pg;
+    unsigned int i;
+
+    if ( !d->vmtrace_size )
+        return 0;
+
+    pg = alloc_domheap_pages(d, get_order_from_bytes(d->vmtrace_size),
+                             MEMF_no_refcount);
+    if ( !pg )
+        return -ENOMEM;
+
+    /*
+     * Getting the reference counting correct here is hard.
+     *
+     * All pages are now on the domlist.  They, or subranges within, will be
+     * freed when their reference count drops to zero, which may any time
+     * between now and the domain teardown path.
+     */
+
+    for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ )
+        if ( unlikely(!get_page_and_type(&pg[i], d, PGT_writable_page)) )
+            goto refcnt_err;
+
+    /*
+     * We must only let vmtrace_free_buffer() take any action in the success
+     * case when we've taken all the refs it intends to drop.
+     */
+    v->vmtrace.pg = pg;
+
+    return 0;
+
+ refcnt_err:
+    /*
+     * In the failure case, we must drop all the acquired typerefs thus far,
+     * skip vmtrace_free_buffer(), and leave domain_relinquish_resources() to
+     * drop the alloc refs on any remaining pages - some pages could already
+     * have been freed behind our backs.
+     */
+    while ( i-- )
+        put_page_and_type(&pg[i]);
+
+    return -ENODATA;
+}
+
 /*
  * Release resources held by a vcpu.  There may or may not be live references
  * to the vcpu, and it may or may not be fully constructed.
@@ -140,6 +205,8 @@  static void vcpu_info_reset(struct vcpu *v)
  */
 static int vcpu_teardown(struct vcpu *v)
 {
+    vmtrace_free_buffer(v);
+
     return 0;
 }
 
@@ -201,6 +268,9 @@  struct vcpu *vcpu_create(struct domain *d, unsigned int vcpu_id)
     if ( sched_init_vcpu(v) != 0 )
         goto fail_wq;
 
+    if ( vmtrace_alloc_buffer(v) != 0 )
+        goto fail_wq;
+
     if ( arch_vcpu_create(v) != 0 )
         goto fail_sched;
 
@@ -449,6 +519,12 @@  static int sanitise_domain_config(struct xen_domctl_createdomain *config)
         }
     }
 
+    if ( config->vmtrace_size && !vmtrace_available )
+    {
+        dprintk(XENLOG_INFO, "vmtrace requested but not available\n");
+        return -EINVAL;
+    }
+
     return arch_sanitise_domain_config(config);
 }
 
@@ -474,7 +550,10 @@  struct domain *domain_create(domid_t domid,
     ASSERT(is_system_domain(d) ? config == NULL : config != NULL);
 
     if ( config )
+    {
         d->options = config->flags;
+        d->vmtrace_size = config->vmtrace_size;
+    }
 
     /* Sort out our idea of is_control_domain(). */
     d->is_privileged = is_priv;
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 666aeb71bf..88a5b1ef5d 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -95,6 +95,9 @@  struct xen_domctl_createdomain {
     int32_t max_grant_frames;
     int32_t max_maptrack_frames;
 
+    /* Per-vCPU buffer size in bytes.  0 to disable. */
+    uint32_t vmtrace_size;
+
     struct xen_arch_domainconfig arch;
 };
 
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 06dba1a397..bc78a09a53 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -272,6 +272,10 @@  struct vcpu
     /* vPCI per-vCPU area, used to store data for long running operations. */
     struct vpci_vcpu vpci;
 
+    struct {
+        struct page_info *pg; /* One contiguous allocation of d->vmtrace_size */
+    } vmtrace;
+
     struct arch_vcpu arch;
 
 #ifdef CONFIG_IOREQ_SERVER
@@ -547,6 +551,8 @@  struct domain
         unsigned int guest_request_sync          : 1;
     } monitor;
 
+    unsigned int vmtrace_size; /* Buffer size in bytes, or 0 to disable. */
+
 #ifdef CONFIG_ARGO
     /* Argo interdomain communication support */
     struct argo_domain *argo;