diff mbox series

[RFC,v3,1/3] xen/memory : Add a stats_table resource type

Message ID ec106b56640c92f91473c8855c8dccc20e3eb5df.1698589351.git.matias.vara@vates.fr (mailing list archive)
State New, archived
Headers show
Series Add a new acquire resource to query vcpu statistics | expand

Commit Message

Matias Ezequiel Vara Larsen Oct. 31, 2023, 4:21 p.m. UTC
This commit proposes a new mechanism to query the RUNSTATE_running counter for
a given vcpu from a dom0 userspace application. This commit proposes to expose
that counter by using the acquire_resource interface. For this purpose, the
commit adds a new resource named XENMEM_resource_stats_table and a
type-specific resource named XENMEM_resource_stats_table_id_vcpustats. This
type-specific resource is aiming at exposing per-VCPU counters.

The current mechanism relies on the XEN_DOMCTL_getvcpuinfo and holds a single
global domctl_lock for the entire hypercall; and iterate over every vcpu in the
system for every update thus impacting operations that share that lock. This
commit proposes to expose vcpu RUNSTATE_running via the xenforeignmemory
interface thus preventing to issue the hypercall and holding the lock.

For that purpose, a new resource type named XENMEM_resource_stats_table is
added. In particular, the type-specific resource
XENMEM_resource_stats_table_id_vcpustats is added to host per-VCPU counters.
These counters are mapped in frames. The allocation of these frames only
happens if the resource is requested. The number of allocated frames is equal
to the domain's max_vcpus. Frames are released after the domain is destroyed.

To expose this information to a consumer, two structures are required:
1. vcpu_shmem_stats
2. vcpu_stats[]

The memory layout has been discussed at [1]. The laylout looks like:

struct vcpu_shmem_stats {
#define VCPU_STATS_MAGIC 0xaabbccdd
    uint32_t magic;
    uint32_t offset;  // roundup(sizeof(vcpu_shmem_stats), cacheline_size)
    uint32_t size;    // sizeof(vcpu_stats)
    uint32_t stride;  // roundup(sizeof(vcpu_stats), cacheline_size)
};

struct vcpu_stats {
    /*
     * If the least-significant bit of the seq number is set then an update
     * is in progress and the consumer must wait to read a consistent set of
     * values. This mechanism is similar to Linux's seqlock.
     */
    uint32_t seq;
    uint32 _pad;
    /*
     * If the most-significant bit of a counter is set then the counter
     * is inactive and the consumer must ignore its value. Note that this
     * could also indicate that the counter has overflowed.
     */
    uint64_t stats_a; // e.g., runstate_running_time
    uint64_t stats_b; // e.g.,
    uint64_t stats_c; // e.g.,
    ...
};

All padding fields shall be marked as "inactive". The consumer can't
distinguish inactive from overflowed. Also, the consumer shall always
verify before reading that:
        
          offsetof(struct vcpu_stats, stats_y) < size.

in case the consumer knows about a counter, e.g., stats_y, that Xen does
not it.

The vcpu_shmem_stats structure exposes a magic number, the size of the
vcpu_stats structure, the offset in which the vcpu_stats structures begin and
the stride for each instance. The address of the vcpu_shmem_stats and
vcpu_stats instances are cache line aligned to prevent cache ping-pong when
accessing per-vcpu elements. In the vcpu_stats structure, most-significant bit
of a counter is set to indicate that either the counter has overflowed or it is
inactive so the consumer must ignore it.

Note that the updating of vcpu's counters is in a hot path, thus, in this commit,
copying only happens if it is specifically required.

Note that the resource is extensible in two ways. First, the structure
vcpu_stats can be extended with new per-vcpu counters. To do so, new per-vcpu
counters shall be added after the last element of the structure definition to
not break existing consumers. Second, new type-specific resources can be added
to host a different sort of counters.

[1]
https://lists.xenproject.org/archives/html/xen-devel/2023-03/msg00383.html

Signed-off-by: Matias Ezequiel Vara Larsen <matias.vara@vates.fr>
---
Changes in v3:
- allow to host an arbitrary nuumber of vcpus
- release resource during domain_kill() since it is guest-type
  independent and arch agnostic
- rework stats_table_max_frames()
- use memory layout as discussed at
  https://lists.xenproject.org/archives/html/xen-devel/2023-03/msg00383.html

Changes in v2:
- rework to ensure that guest reads a coherent value by using a version
  number in the vcpu_stats structure
- add version to the vcpu_stats structure

Changes in v1:
- rework the allocation and releasing of the frames
- use the zero frame for per-vcpu counters that are listed as an array
- allocate vcpu stats frames only when the resource is requested
- rewrite commit message
- add the vcpu_stats structure to keep per-vcpu counters
- add the shared_vcpustatspage to keep an array of per-vcpu counters for a
  given domain
- declare the structures in a public header 
- define the vcpustats_page in the domain structure
---
 xen/common/domain.c         |   1 +
 xen/common/memory.c         | 167 ++++++++++++++++++++++++++++++++++++
 xen/common/sched/core.c     |  20 +++++
 xen/include/public/memory.h |   3 +
 xen/include/public/vcpu.h   |  27 ++++++
 xen/include/xen/mm.h        |   2 +
 xen/include/xen/sched.h     |   5 ++
 7 files changed, 225 insertions(+)

Comments

Jan Beulich April 22, 2024, 4:07 p.m. UTC | #1
On 31.10.2023 17:21, Matias Ezequiel Vara Larsen wrote:
> This commit proposes a new mechanism to query the RUNSTATE_running counter for
> a given vcpu from a dom0 userspace application. This commit proposes to expose
> that counter by using the acquire_resource interface. For this purpose, the
> commit adds a new resource named XENMEM_resource_stats_table and a
> type-specific resource named XENMEM_resource_stats_table_id_vcpustats. This
> type-specific resource is aiming at exposing per-VCPU counters.
> 
> The current mechanism relies on the XEN_DOMCTL_getvcpuinfo and holds a single
> global domctl_lock for the entire hypercall; and iterate over every vcpu in the
> system for every update thus impacting operations that share that lock. This
> commit proposes to expose vcpu RUNSTATE_running via the xenforeignmemory
> interface thus preventing to issue the hypercall and holding the lock.

Throughout can you please avoid "this commit" and "proposes" and alike? You
want to plainly describe what you do and why.

> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -1092,6 +1092,27 @@ unsigned int ioreq_server_max_frames(const struct domain *d)
>      return nr;
>  }
>  
> +unsigned int stats_table_max_frames(const struct domain *d, unsigned int id)

Any particular reason this isn't next to its sibling acquire_stats_table()?

> +{
> +    unsigned int nr = 0;
> +    unsigned int size;
> +
> +    switch ( id )
> +    {
> +    case XENMEM_resource_stats_table_id_vcpustats:
> +        size = DIV_ROUND_UP(sizeof(struct vcpu_shmem_stats), SMP_CACHE_BYTES) +
> +               DIV_ROUND_UP(sizeof(struct vcpu_stats), SMP_CACHE_BYTES) * d->max_vcpus;
> +
> +        nr = DIV_ROUND_UP(size, PAGE_SIZE);
> +        break;
> +
> +    default:
> +        break;
> +    }
> +
> +    return nr;
> +}
> +
>  /*
>   * Return 0 on any kind of error.  Caller converts to -EINVAL.
>   *
> @@ -1113,6 +1134,9 @@ static unsigned int resource_max_frames(const struct domain *d,
>      case XENMEM_resource_vmtrace_buf:
>          return d->vmtrace_size >> PAGE_SHIFT;
>  
> +    case XENMEM_resource_stats_table:
> +        return stats_table_max_frames(d, id);
> +
>      default:
>          return -EOPNOTSUPP;
>      }
> @@ -1176,6 +1200,146 @@ static int acquire_vmtrace_buf(
>      return nr_frames;
>  }
>  
> +void stats_free_vcpu_mfn(struct domain * d)
> +{
> +    struct page_info *pg = d->vcpustats_page.pg;
> +    void * _va = d->vcpustats_page.va;

What use is the leading underscore here? Also (nit) stray blank after *.

> +    unsigned int i;
> +    unsigned int size;
> +    unsigned int nr_frames;
> +
> +    if ( !pg )
> +        return;
> +
> +    d->vcpustats_page.pg = NULL;
> +    d->vcpustats_page.va = NULL;
> +
> +    size = DIV_ROUND_UP(sizeof(struct vcpu_shmem_stats), SMP_CACHE_BYTES) +
> +           DIV_ROUND_UP(sizeof(struct vcpu_stats), SMP_CACHE_BYTES)
> +           * d->max_vcpus;
> +
> +    nr_frames = DIV_ROUND_UP(size, PAGE_SIZE);
> +
> +    vunmap(_va);
> +
> +    for ( i = 0; i < nr_frames; i++ )
> +    {
> +        put_page_alloc_ref(&pg[i]);
> +        put_page_and_type(&pg[i]);
> +    }
> +}
> +
> +static int stats_vcpu_alloc_mfn(struct domain *d)

Both functions' names suggest a single MFN is being allocated / freed.

> +{
> +    struct page_info *pg;
> +    int order;

This can't be negative, can it?

> +    unsigned int i;
> +    unsigned int size;
> +    unsigned int nr_frames;
> +    void *_va;
> +    mfn_t mfn;
> +    struct vcpu_shmem_stats *hd;
> +
> +    size = DIV_ROUND_UP(sizeof(struct vcpu_shmem_stats), SMP_CACHE_BYTES) +

Nit: Style is correct here, ...

> +           DIV_ROUND_UP(sizeof(struct vcpu_stats), SMP_CACHE_BYTES)
> +           * d->max_vcpus;

... but then not here (* belongs on the earlier line). Also this is the 3rd
instance of all the same expression, which thus likely wants putting in a
helper function.

> +    nr_frames = DIV_ROUND_UP(size, PAGE_SIZE);
> +
> +    order = get_order_from_bytes(size);
> +    pg = alloc_domheap_pages(d, order, MEMF_no_refcount);
> +
> +    if ( !pg )
> +        return -ENOMEM;
> +
> +    for ( i = 0; i < nr_frames; i++ )
> +    {
> +        if ( unlikely(!get_page_and_type(&pg[i], d, PGT_writable_page)) )
> +            /*
> +             * The domain can't possibly know about this page yet, so failure
> +             * here is a clear indication of something fishy going on.
> +             */
> +            goto refcnt_err;
> +    }
> +
> +    mfn = page_to_mfn(pg);
> +    _va = vmap(&mfn, nr_frames);

This won't work when nr_frames > 1. The first parameter of this function
points to an array of nr_frames elements. I think you mean vmap_contig(),
yet even better would be if you didn't needlessly allocate non-order-0
pages (and perhaps excess space) here.

> +    if ( !_va )
> +        goto refcnt_err;
> +
> +    for ( i = 0; i < nr_frames; i++ )
> +        clear_page(_va + i * PAGE_SIZE);
> +
> +    /* Initialize vcpu_shmem_stats header */
> +    hd = (struct vcpu_shmem_stats*)_va;

Please avoid casts when they're not needed.

> +    hd->magic = VCPU_STATS_MAGIC;
> +    hd->offset = ROUNDUP(sizeof(struct vcpu_shmem_stats), SMP_CACHE_BYTES);
> +    hd->size = sizeof(struct vcpu_stats);
> +    hd->stride = ROUNDUP(sizeof(struct vcpu_stats), SMP_CACHE_BYTES);

Don't you need an smp_wmb() here, to make sure ...

> +    d->vcpustats_page.va  = _va;
> +    d->vcpustats_page.pg = pg;

.. the global announcement of these pointers becomes visible only with the
page properly seeded?

> +    return 0;
> +
> + refcnt_err:
> +    /*
> +     * We can theoretically reach this point if someone has taken 2^43 refs on
> +     * the frames in the time the above loop takes to execute, or someone has
> +     * made a blind decrease reservation hypercall and managed to pick the
> +     * right mfn.  Free the memory we safely can, and leak the rest.
> +     */
> +    while ( i-- )
> +    {
> +        put_page_alloc_ref(&pg[i]);
> +        put_page_and_type(&pg[i]);
> +    }
> +
> +    return -ENODATA;
> +}
> +
> +static int acquire_stats_table(struct domain *d,
> +                               unsigned int id,
> +                               unsigned int frame,
> +                               unsigned int nr_frames,
> +                               xen_pfn_t mfn_list[])
> +{
> +    mfn_t mfn;
> +    int rc;
> +    unsigned int i;
> +    unsigned int max_frames;
> +
> +    if ( !d )
> +        return -ENOENT;
> +
> +    switch ( id )
> +    {
> +    case XENMEM_resource_stats_table_id_vcpustats:
> +        max_frames = DIV_ROUND_UP(d->max_vcpus * sizeof(struct vcpu_stats),
> +                                  PAGE_SIZE);

What about the sizeof(struct vcpu_shmem_stats) part? And what about the
SMP_CACHE_BYTES alignment enforced elsewhere?

> +        if ( (frame + nr_frames) > max_frames )
> +            return -EINVAL;
> +
> +        if ( !d->vcpustats_page.pg )
> +        {
> +            rc = stats_vcpu_alloc_mfn(d);
> +            if ( rc )
> +                return rc;
> +        }

I might guess you have taken acquire_vmtrace_buf() for reference, but
that looks to suffer the same issue as this code: What if two racing
requests appear? Both will allocate new memory, and either of the two
blocks will be leaked. With the "right" timing you may even end up with
inconsistent .va and .pg fields.

> --- a/xen/common/sched/core.c
> +++ b/xen/common/sched/core.c
> @@ -264,6 +264,11 @@ static inline void vcpu_runstate_change(
>  {
>      s_time_t delta;
>      struct sched_unit *unit = v->sched_unit;
> +    struct vcpu_shmem_stats *vcpu_shmem;
> +    struct vcpu_stats *vcpu_info;
> +    void *_va;
> +    struct domain *d = v->domain;
> +    int offset;

Once again - this can't be negative, can it? (I question the need for the
variable in the first place, though.)

> @@ -287,6 +292,21 @@ static inline void vcpu_runstate_change(
>      }
>  
>      v->runstate.state = new_state;
> +
> +    _va = d->vcpustats_page.va;
> +
> +    if ( !_va )
> +        return;
> +
> +    vcpu_shmem = (struct vcpu_shmem_stats*)_va;
> +    vcpu_info = (struct vcpu_stats*)((void*)vcpu_shmem + vcpu_shmem->offset);

You just cast _va from void *. And here you cast it back right away. As before,
please limit casts to ones actually necessary (and sensible).

> --- a/xen/include/public/vcpu.h
> +++ b/xen/include/public/vcpu.h
> @@ -235,6 +235,33 @@ struct vcpu_register_time_memory_area {
>  typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
>  DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
>  
> +struct vcpu_shmem_stats {
> +#define VCPU_STATS_MAGIC 0xaabbccdd
> +    uint32_t magic;
> +    uint32_t offset;
> +    uint32_t size;
> +    uint32_t stride;
> +};
> +
> +struct vcpu_stats {
> +    /*
> +     * If the least-significant bit of the seq number is set then an update
> +     * is in progress and the consumer must wait to read a consistent set of
> +     * values. This mechanism is similar to Linux's seqlock.
> +     */
> +    uint32_t seq;
> +    uint32_t pad0;
> +    /*
> +     * If the most-significant bit of a counter is set then the counter
> +     * is inactive and the consumer must ignore its value. Note that this
> +     * could also indicate that the counter has overflowed.
> +     */
> +     uint64_t runstate_running_time;

Nit: Indentation.

> +};
> +
> +typedef struct vcpu_shmem_stats xen_vcpu_shmemstats_t;
> +typedef struct vcpu_stats xen_shared_vcpustats_t;

When you properly use xen_ prefixes for the typedef names, why not also for
the structure tags and (then XEN_) for the magic number?

Also - no blank line above here please.

Jan
diff mbox series

Patch

diff --git a/xen/common/domain.c b/xen/common/domain.c
index e3afcacb6c..27d8961a21 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -948,6 +948,7 @@  int domain_kill(struct domain *d)
             break;
         if ( cpupool_move_domain(d, cpupool0) )
             return -ERESTART;
+        stats_free_vcpu_mfn(d);
         for_each_vcpu ( d, v )
             unmap_vcpu_info(v);
         d->is_dying = DOMDYING_dead;
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 76f8858cc3..2acac40c63 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -1092,6 +1092,27 @@  unsigned int ioreq_server_max_frames(const struct domain *d)
     return nr;
 }
 
+unsigned int stats_table_max_frames(const struct domain *d, unsigned int id)
+{
+    unsigned int nr = 0;
+    unsigned int size;
+
+    switch ( id )
+    {
+    case XENMEM_resource_stats_table_id_vcpustats:
+        size = DIV_ROUND_UP(sizeof(struct vcpu_shmem_stats), SMP_CACHE_BYTES) +
+               DIV_ROUND_UP(sizeof(struct vcpu_stats), SMP_CACHE_BYTES) * d->max_vcpus;
+
+        nr = DIV_ROUND_UP(size, PAGE_SIZE);
+        break;
+
+    default:
+        break;
+    }
+
+    return nr;
+}
+
 /*
  * Return 0 on any kind of error.  Caller converts to -EINVAL.
  *
@@ -1113,6 +1134,9 @@  static unsigned int resource_max_frames(const struct domain *d,
     case XENMEM_resource_vmtrace_buf:
         return d->vmtrace_size >> PAGE_SHIFT;
 
+    case XENMEM_resource_stats_table:
+        return stats_table_max_frames(d, id);
+
     default:
         return -EOPNOTSUPP;
     }
@@ -1176,6 +1200,146 @@  static int acquire_vmtrace_buf(
     return nr_frames;
 }
 
+void stats_free_vcpu_mfn(struct domain * d)
+{
+    struct page_info *pg = d->vcpustats_page.pg;
+    void * _va = d->vcpustats_page.va;
+    unsigned int i;
+    unsigned int size;
+    unsigned int nr_frames;
+
+    if ( !pg )
+        return;
+
+    d->vcpustats_page.pg = NULL;
+    d->vcpustats_page.va = NULL;
+
+    size = DIV_ROUND_UP(sizeof(struct vcpu_shmem_stats), SMP_CACHE_BYTES) +
+           DIV_ROUND_UP(sizeof(struct vcpu_stats), SMP_CACHE_BYTES)
+           * d->max_vcpus;
+
+    nr_frames = DIV_ROUND_UP(size, PAGE_SIZE);
+
+    vunmap(_va);
+
+    for ( i = 0; i < nr_frames; i++ )
+    {
+        put_page_alloc_ref(&pg[i]);
+        put_page_and_type(&pg[i]);
+    }
+}
+
+static int stats_vcpu_alloc_mfn(struct domain *d)
+{
+    struct page_info *pg;
+    int order;
+    unsigned int i;
+    unsigned int size;
+    unsigned int nr_frames;
+    void *_va;
+    mfn_t mfn;
+    struct vcpu_shmem_stats *hd;
+
+    size = DIV_ROUND_UP(sizeof(struct vcpu_shmem_stats), SMP_CACHE_BYTES) +
+           DIV_ROUND_UP(sizeof(struct vcpu_stats), SMP_CACHE_BYTES)
+           * d->max_vcpus;
+
+    nr_frames = DIV_ROUND_UP(size, PAGE_SIZE);
+
+    order = get_order_from_bytes(size);
+    pg = alloc_domheap_pages(d, order, MEMF_no_refcount);
+
+    if ( !pg )
+        return -ENOMEM;
+
+    for ( i = 0; i < nr_frames; i++ )
+    {
+        if ( unlikely(!get_page_and_type(&pg[i], d, PGT_writable_page)) )
+            /*
+             * The domain can't possibly know about this page yet, so failure
+             * here is a clear indication of something fishy going on.
+             */
+            goto refcnt_err;
+    }
+
+    mfn = page_to_mfn(pg);
+    _va = vmap(&mfn, nr_frames);
+    if ( !_va )
+        goto refcnt_err;
+
+    for ( i = 0; i < nr_frames; i++ )
+        clear_page(_va + i * PAGE_SIZE);
+
+    /* Initialize vcpu_shmem_stats header */
+    hd = (struct vcpu_shmem_stats*)_va;
+    hd->magic = VCPU_STATS_MAGIC;
+    hd->offset = ROUNDUP(sizeof(struct vcpu_shmem_stats), SMP_CACHE_BYTES);
+    hd->size = sizeof(struct vcpu_stats);
+    hd->stride = ROUNDUP(sizeof(struct vcpu_stats), SMP_CACHE_BYTES);
+
+    d->vcpustats_page.va  = _va;
+    d->vcpustats_page.pg = pg;
+
+    return 0;
+
+ refcnt_err:
+    /*
+     * We can theoretically reach this point if someone has taken 2^43 refs on
+     * the frames in the time the above loop takes to execute, or someone has
+     * made a blind decrease reservation hypercall and managed to pick the
+     * right mfn.  Free the memory we safely can, and leak the rest.
+     */
+    while ( i-- )
+    {
+        put_page_alloc_ref(&pg[i]);
+        put_page_and_type(&pg[i]);
+    }
+
+    return -ENODATA;
+}
+
+static int acquire_stats_table(struct domain *d,
+                               unsigned int id,
+                               unsigned int frame,
+                               unsigned int nr_frames,
+                               xen_pfn_t mfn_list[])
+{
+    mfn_t mfn;
+    int rc;
+    unsigned int i;
+    unsigned int max_frames;
+
+    if ( !d )
+        return -ENOENT;
+
+    switch ( id )
+    {
+    case XENMEM_resource_stats_table_id_vcpustats:
+        max_frames = DIV_ROUND_UP(d->max_vcpus * sizeof(struct vcpu_stats),
+                                  PAGE_SIZE);
+
+        if ( (frame + nr_frames) > max_frames )
+            return -EINVAL;
+
+        if ( !d->vcpustats_page.pg )
+        {
+            rc = stats_vcpu_alloc_mfn(d);
+            if ( rc )
+                return rc;
+        }
+
+        mfn = page_to_mfn(d->vcpustats_page.pg);
+        for ( i = 0; i < nr_frames; i++ )
+            mfn_list[i] = mfn_x(mfn) + frame + i;
+        break;
+
+    default:
+        return -EINVAL;
+    }
+
+    return nr_frames;
+}
+
 /*
  * Returns -errno on error, or positive in the range [1, nr_frames] on
  * success.  Returning less than nr_frames contitutes a request for a
@@ -1196,6 +1360,9 @@  static int _acquire_resource(
     case XENMEM_resource_vmtrace_buf:
         return acquire_vmtrace_buf(d, id, frame, nr_frames, mfn_list);
 
+    case XENMEM_resource_stats_table:
+        return acquire_stats_table(d, id, frame, nr_frames, mfn_list);
+
     default:
         return -EOPNOTSUPP;
     }
diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
index 03ace41540..a05891fd32 100644
--- a/xen/common/sched/core.c
+++ b/xen/common/sched/core.c
@@ -264,6 +264,11 @@  static inline void vcpu_runstate_change(
 {
     s_time_t delta;
     struct sched_unit *unit = v->sched_unit;
+    struct vcpu_shmem_stats *vcpu_shmem;
+    struct vcpu_stats *vcpu_info;
+    void *_va;
+    struct domain *d = v->domain;
+    int offset;
 
     ASSERT(spin_is_locked(get_sched_res(v->processor)->schedule_lock));
     if ( v->runstate.state == new_state )
@@ -287,6 +292,21 @@  static inline void vcpu_runstate_change(
     }
 
     v->runstate.state = new_state;
+
+    _va = d->vcpustats_page.va;
+
+    if ( !_va )
+        return;
+
+    vcpu_shmem = (struct vcpu_shmem_stats*)_va;
+    vcpu_info = (struct vcpu_stats*)((void*)vcpu_shmem + vcpu_shmem->offset);
+    offset = (vcpu_shmem->stride * v->vcpu_id);
+    vcpu_info += offset;
+    vcpu_info->seq = version_update_begin(vcpu_info->seq);
+    smp_wmb();
+    vcpu_info->runstate_running_time = v->runstate.time[RUNSTATE_running];
+    smp_wmb();
+    vcpu_info->seq = version_update_end(vcpu_info->seq);
 }
 
 void sched_guest_idle(void (*idle) (void), unsigned int cpu)
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 383a9468c3..93b58a9ccc 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -633,6 +633,7 @@  struct xen_mem_acquire_resource {
 #define XENMEM_resource_ioreq_server 0
 #define XENMEM_resource_grant_table 1
 #define XENMEM_resource_vmtrace_buf 2
+#define XENMEM_resource_stats_table 3
 
     /*
      * IN - a type-specific resource identifier, which must be zero
@@ -640,11 +641,13 @@  struct xen_mem_acquire_resource {
      *
      * type == XENMEM_resource_ioreq_server -> id == ioreq server id
      * type == XENMEM_resource_grant_table -> id defined below
+     * type == XENMEM_resource_stats_table -> id defined below
      */
     uint32_t id;
 
 #define XENMEM_resource_grant_table_id_shared 0
 #define XENMEM_resource_grant_table_id_status 1
+#define XENMEM_resource_stats_table_id_vcpustats 0
 
     /*
      * IN/OUT
diff --git a/xen/include/public/vcpu.h b/xen/include/public/vcpu.h
index 3623af932f..f61da90f1d 100644
--- a/xen/include/public/vcpu.h
+++ b/xen/include/public/vcpu.h
@@ -235,6 +235,33 @@  struct vcpu_register_time_memory_area {
 typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
 DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
 
+struct vcpu_shmem_stats {
+#define VCPU_STATS_MAGIC 0xaabbccdd
+    uint32_t magic;
+    uint32_t offset;
+    uint32_t size;
+    uint32_t stride;
+};
+
+struct vcpu_stats {
+    /*
+     * If the least-significant bit of the seq number is set then an update
+     * is in progress and the consumer must wait to read a consistent set of
+     * values. This mechanism is similar to Linux's seqlock.
+     */
+    uint32_t seq;
+    uint32_t pad0;
+    /*
+     * If the most-significant bit of a counter is set then the counter
+     * is inactive and the consumer must ignore its value. Note that this
+     * could also indicate that the counter has overflowed.
+     */
+     uint64_t runstate_running_time;
+};
+
+typedef struct vcpu_shmem_stats xen_vcpu_shmemstats_t;
+typedef struct vcpu_stats xen_shared_vcpustats_t;
+
 #endif /* __XEN_PUBLIC_VCPU_H__ */
 
 /*
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 5db26ed477..c96d18f3a5 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -148,6 +148,8 @@  int assign_page(
 /* Dump info to serial console */
 void arch_dump_shared_mem_info(void);
 
+void stats_free_vcpu_mfn(struct domain *d);
+
 /*
  * Extra fault info types which are used to further describe
  * the source of an access violation.
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 1a1fab5239..a0c1b5955e 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -588,6 +588,11 @@  struct domain
         struct ioreq_server     *server[MAX_NR_IOREQ_SERVERS];
     } ioreq_server;
 #endif
+    /* Pages that host vcpu stats */
+    struct {
+        struct page_info *pg;
+        void *va;
+    } vcpustats_page;
 };
 
 static inline struct page_list_head *page_to_list(