diff mbox series

[RFC,v2,1/2] xen/memory : Add a stats_table resource type

Message ID af6032c9f5863b7e6fb183a0a197407ec92bb067.1665138677.git.matias.vara@vates.fr (mailing list archive)
State New, archived
Headers show
Series Add a new acquire resource to query vcpu statistics | expand

Commit Message

Matias Ezequiel Vara Larsen Oct. 7, 2022, 12:39 p.m. UTC
This commit proposes a new mechanism to query the RUNSTATE_running counter for
a given vcpu from a dom0 userspace application. This commit proposes to expose
that counter by using the acquire_resource interface. The current mechanism
relies on the XEN_DOMCTL_getvcpuinfo and holds a single global domctl_lock for
the entire hypercall; and iterate over every vcpu in the system for every
update thus impacting operations that share that lock.

This commit proposes to expose vcpu RUNSTATE_running via the
xenforeignmemory interface thus preventing to issue the hypercall and holding
the lock. For that purpose, a new resource type named stats_table is added. The
first frame of this resource stores per-vcpu counters. The frame has one entry
of type struct vcpu_stats per vcpu. The allocation of this frame only happens
if the resource is requested. The frame is released after the domain is
destroyed.

Note that the updating of this counter is in a hot path, thus, in this commit,
copying only happens if it is specifically required.

Note that the exposed structure is extensible in two ways. First, the structure
vcpu_stats can be extended with new per-vcpu counters while it fits in a frame.
Second, new frames can be added in case new counters are required.

Signed-off-by: Matias Ezequiel Vara Larsen <matias.vara@vates.fr>
---
Changes in v2:
- rework to ensure that guest reads a coherent value by using a version
  number in the vcpu_stats structure
- add version to the vcpu_stats structure

Changes in v1:
- rework the allocation and releasing of the frames
- use the zero frame for per-vcpu counters that are listed as an array
- allocate vcpu stats frames only when the resource is requested
- rewrite commit message
- add the vcpu_stats structure to keep per-vcpu counters
- add the shared_vcpustatspage to keep an array of per-vcpu counters for a
  given domain
- declare the structures in a public header 
- define the vcpustats_page in the domain structure
---
 xen/arch/x86/hvm/hvm.c      |  2 +
 xen/common/memory.c         | 94 +++++++++++++++++++++++++++++++++++++
 xen/common/sched/core.c     | 16 +++++++
 xen/include/public/memory.h |  3 ++
 xen/include/public/vcpu.h   | 16 +++++++
 xen/include/xen/mm.h        |  2 +
 xen/include/xen/sched.h     |  5 ++
 7 files changed, 138 insertions(+)

Comments

Jan Beulich Dec. 13, 2022, 5:02 p.m. UTC | #1
On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
> This commit proposes a new mechanism to query the RUNSTATE_running counter for
> a given vcpu from a dom0 userspace application. This commit proposes to expose
> that counter by using the acquire_resource interface. The current mechanism
> relies on the XEN_DOMCTL_getvcpuinfo and holds a single global domctl_lock for
> the entire hypercall; and iterate over every vcpu in the system for every
> update thus impacting operations that share that lock.
> 
> This commit proposes to expose vcpu RUNSTATE_running via the
> xenforeignmemory interface thus preventing to issue the hypercall and holding
> the lock. For that purpose, a new resource type named stats_table is added. The
> first frame of this resource stores per-vcpu counters. The frame has one entry
> of type struct vcpu_stats per vcpu. The allocation of this frame only happens
> if the resource is requested. The frame is released after the domain is
> destroyed.
> 
> Note that the updating of this counter is in a hot path, thus, in this commit,
> copying only happens if it is specifically required.
> 
> Note that the exposed structure is extensible in two ways. First, the structure
> vcpu_stats can be extended with new per-vcpu counters while it fits in a frame.

I'm afraid I don't see how this is "extensible". I would recommend that
you outline for yourself how a change would look like to actually add
such a 2nd counter. While doing that keep in mind that whatever changes
you make may not break existing consumers.

It's also not clear what you mean with "fits in a frame": struct
shared_vcpustatspage is a container for an array with a single element.
I may guess (looking at just the public interface) that this really is
meant to be a flexible array (and hence should be marked as such - see
other uses of XEN_FLEX_ARRAY_DIM in the public headers). Yet if that's
the case, then a single page already won't suffice for a domain with
sufficiently many vCPU-s.

> Second, new frames can be added in case new counters are required.

Are you talking of "new counters" here which aren't "new per-vcpu
counters"? Or else what's the difference from the 1st way?

> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -741,6 +741,8 @@ void hvm_domain_relinquish_resources(struct domain *d)
>  
>      ioreq_server_destroy_all(d);
>  
> +    stats_free_vcpu_mfn(d);

How come this lives here? Surely this new feature should be not only
guest-type independent, but also arch-agnostic? Clearly you putting
the new data in struct domain (and not struct arch_domain or yet
deeper in the hierarchy) indicates you may have been meaning to make
it so.

> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -1078,6 +1078,12 @@ unsigned int ioreq_server_max_frames(const struct domain *d)
>      return nr;
>  }
>  
> +unsigned int stats_table_max_frames(const struct domain *d)
> +{
> +    /* One frame per 512 vcpus. */
> +    return 1;
> +}

As alluded to earlier already - 1 isn't going to be suitable for
arbitrary size domains. (Yes, HVM domains are presently limited to
128 vCPU-s, but as per above this shouldn't be a HVM-only feature.)

> @@ -1162,6 +1171,88 @@ static int acquire_vmtrace_buf(
>      return nr_frames;
>  }
>  
> +void stats_free_vcpu_mfn(struct domain * d)
> +{
> +    struct page_info *pg = d->vcpustats_page.pg;
> +
> +    if ( !pg )
> +        return;
> +
> +    d->vcpustats_page.pg = NULL;
> +
> +    if ( d->vcpustats_page.va )
> +        unmap_domain_page_global(d->vcpustats_page.va);
> +
> +    d->vcpustats_page.va = NULL;

We ought to gain UNMAP_DOMAIN_PAGE_GLOBAL() for purposes like this one,
paralleling UNMAP_DOMAIN_PAGE().

> +    put_page_alloc_ref(pg);
> +    put_page_and_type(pg);
> +}
> +
> +static int stats_vcpu_alloc_mfn(struct domain *d)
> +{
> +    struct page_info *pg;
> +
> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
> +
> +    if ( !pg )
> +        return -ENOMEM;
> +
> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {

Style: Brace placement (more elsewhere).

> +        put_page_alloc_ref(pg);

This is not allowed when what you may put is the last reference.
See other examples we have in the tree.

> +        return -ENODATA;
> +    }
> +
> +    d->vcpustats_page.va = __map_domain_page_global(pg);
> +    if ( !d->vcpustats_page.va )
> +        goto fail;
> +
> +    d->vcpustats_page.pg = pg;
> +    clear_page(d->vcpustats_page.va);

I guess this should happen before you globally announce the
address.

> +    return 1;

Functions returning -errno on error want to return 0 on success,
unless e.g. multiple success indicators are needed.

> +fail:

Style: Label indentation.

> +    put_page_alloc_ref(pg);
> +    put_page_and_type(pg);
> +
> +    return -ENOMEM;
> +}
> +
> +static int acquire_stats_table(struct domain *d,
> +                                unsigned int id,
> +                                unsigned int frame,
> +                                unsigned int nr_frames,
> +                                xen_pfn_t mfn_list[])

Style: Indentation.

> +{
> +    mfn_t mfn;
> +    int rc;
> +    unsigned int i;
> +
> +    if ( !d )
> +        return -ENOENT;
> +
> +    for ( i = 0; i < nr_frames; i++ )
> +    {
> +        switch ( i )
> +        {
> +        case XENMEM_resource_stats_frame_vcpustats:

Isn't this supposed to be indexed by "id" (which presently you ignore
altogether, which can't be right)?

> --- a/xen/common/sched/core.c
> +++ b/xen/common/sched/core.c
> @@ -264,6 +264,8 @@ static inline void vcpu_runstate_change(
>  {
>      s_time_t delta;
>      struct sched_unit *unit = v->sched_unit;
> +    shared_vcpustatspage_t * vcpustats_va;

Style: Stray blank (more elsewhere).

> @@ -287,6 +289,20 @@ static inline void vcpu_runstate_change(
>      }
>  
>      v->runstate.state = new_state;
> +
> +    vcpustats_va = (shared_vcpustatspage_t*)d->vcpustats_page.va;

There should be no need for a cast here.

> +    if ( vcpustats_va )
> +    {
> +	vcpustats_va->vcpu_info[v->vcpu_id].version =

Style: Hard tab.

> +	    version_update_begin(vcpustats_va->vcpu_info[v->vcpu_id].version);
> +        smp_wmb();
> +        memcpy(&vcpustats_va->vcpu_info[v->vcpu_id].runstate_running_time,
> +               &v->runstate.time[RUNSTATE_running],
> +               sizeof(v->runstate.time[RUNSTATE_running]));

Why memcpy() and not a plain assignment?

> +        smp_wmb();
> +        vcpustats_va->vcpu_info[v->vcpu_id].version =
> +            version_update_end(vcpustats_va->vcpu_info[v->vcpu_id].version);
> +    }

Overall latching &vcpustats_va->vcpu_info[v->vcpu_id] into a helper
variable would likely help readability quite a bit.

> --- a/xen/include/public/vcpu.h
> +++ b/xen/include/public/vcpu.h
> @@ -235,6 +235,22 @@ struct vcpu_register_time_memory_area {
>  typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
>  DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
>  
> +struct vcpu_stats{

Style: Missing blank.

> +    /* If the least-significant bit of the version number is set then an update

Style: Comment layout.

Jan
Jan Beulich Dec. 14, 2022, 7:29 a.m. UTC | #2
On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -1078,6 +1078,12 @@ unsigned int ioreq_server_max_frames(const struct domain *d)
>       return nr;
>  }
>  
> +unsigned int stats_table_max_frames(const struct domain *d)
> +{
> +    /* One frame per 512 vcpus. */
> +    return 1;
> +}

Beyond my earlier comment (and irrespective of this needing changing
anyway): I guess this "512" was not updated to match the now larger
size of struct vcpu_stats?

> +static int stats_vcpu_alloc_mfn(struct domain *d)
> +{
> +    struct page_info *pg;
> +
> +    pg = alloc_domheap_page(d, MEMF_no_refcount);

The ioreq and vmtrace resources are also allocated this way, but they're
HVM-specific. The one here being supposed to be VM-type independent, I'm
afraid such pages will be accessible by an "owning" PV domain (it'll
need to guess the MFN, but that's no excuse).

> +    if ( !pg )
> +        return -ENOMEM;
> +
> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
> +        put_page_alloc_ref(pg);
> +        return -ENODATA;
> +    }
> +
> +    d->vcpustats_page.va = __map_domain_page_global(pg);
> +    if ( !d->vcpustats_page.va )
> +        goto fail;
> +
> +    d->vcpustats_page.pg = pg;
> +    clear_page(d->vcpustats_page.va);

Beyond my earlier comment: I think that by the time the surrounding
hypercall returns the page ought to contain valid data. Otherwise I
see no way for the consumer to know from which point on the data is
going to be valid.

> @@ -287,6 +289,20 @@ static inline void vcpu_runstate_change(
>      }
>  
>      v->runstate.state = new_state;
> +
> +    vcpustats_va = (shared_vcpustatspage_t*)d->vcpustats_page.va;
> +    if ( vcpustats_va )
> +    {
> +	vcpustats_va->vcpu_info[v->vcpu_id].version =
> +	    version_update_begin(vcpustats_va->vcpu_info[v->vcpu_id].version);
> +        smp_wmb();
> +        memcpy(&vcpustats_va->vcpu_info[v->vcpu_id].runstate_running_time,
> +               &v->runstate.time[RUNSTATE_running],
> +               sizeof(v->runstate.time[RUNSTATE_running]));
> +        smp_wmb();
> +        vcpustats_va->vcpu_info[v->vcpu_id].version =
> +            version_update_end(vcpustats_va->vcpu_info[v->vcpu_id].version);
> +    }

A further aspect to consider here is cache line ping-pong. I think the
per-vCPU elements of the array want to be big enough to not share a
cache line. The interface being generic this presents some challenge
in determining what the supposed size is to be. However, taking into
account the extensibility question, maybe the route to take is to
simply settle on a power-of-2 value somewhere between x86'es and Arm's
cache line sizes and the pretty common page size of 4k, e.g. 512 bytes
or 1k?

> --- a/xen/include/public/vcpu.h
> +++ b/xen/include/public/vcpu.h
> @@ -235,6 +235,22 @@ struct vcpu_register_time_memory_area {
>  typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
>  DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
>  
> +struct vcpu_stats{
> +    /* If the least-significant bit of the version number is set then an update
> +     * is in progress and the guest must wait to read a consistent set of values
> +     * This mechanism is similar to Linux's seqlock.
> +     */
> +    uint32_t version;
> +    uint32_t pad0;
> +    uint64_t runstate_running_time;
> +};
> +
> +struct shared_vcpustatspage {
> +    struct vcpu_stats vcpu_info[1];
> +};
> +
> +typedef struct shared_vcpustatspage shared_vcpustatspage_t;

For new additions please avoid further name space issues: All types
and alike want to be prefixed by "xen_".

Jan
Jan Beulich Dec. 14, 2022, 7:56 a.m. UTC | #3
On 14.12.2022 08:29, Jan Beulich wrote:
> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
>> +static int stats_vcpu_alloc_mfn(struct domain *d)
>> +{
>> +    struct page_info *pg;
>> +
>> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
> 
> The ioreq and vmtrace resources are also allocated this way, but they're
> HVM-specific. The one here being supposed to be VM-type independent, I'm
> afraid such pages will be accessible by an "owning" PV domain (it'll
> need to guess the MFN, but that's no excuse).

Which might be tolerable if it then can't write to the page. That would
require "locking" the page r/o (from guest pov), which ought to be
possible by leveraging a variant of what share_xen_page_with_guest()
does: It marks pages PGT_none with a single type ref. This would mean
...

>> +    if ( !pg )
>> +        return -ENOMEM;
>> +
>> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {

... using PGT_none here. Afaict this _should_ work, but we have no
precedent of doing so in the tree, and I may be overlooking something
which prevents that from working.

Jan
Matias Ezequiel Vara Larsen Feb. 16, 2023, 2:48 p.m. UTC | #4
Hello Jan and thanks for your comments. I addressed most of the them but
I've still some questions. Please find my questions below:

On Tue, Dec 13, 2022 at 06:02:55PM +0100, Jan Beulich wrote:
> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
> > This commit proposes a new mechanism to query the RUNSTATE_running counter for
> > a given vcpu from a dom0 userspace application. This commit proposes to expose
> > that counter by using the acquire_resource interface. The current mechanism
> > relies on the XEN_DOMCTL_getvcpuinfo and holds a single global domctl_lock for
> > the entire hypercall; and iterate over every vcpu in the system for every
> > update thus impacting operations that share that lock.
> > 
> > This commit proposes to expose vcpu RUNSTATE_running via the
> > xenforeignmemory interface thus preventing to issue the hypercall and holding
> > the lock. For that purpose, a new resource type named stats_table is added. The
> > first frame of this resource stores per-vcpu counters. The frame has one entry
> > of type struct vcpu_stats per vcpu. The allocation of this frame only happens
> > if the resource is requested. The frame is released after the domain is
> > destroyed.
> > 
> > Note that the updating of this counter is in a hot path, thus, in this commit,
> > copying only happens if it is specifically required.
> > 
> > Note that the exposed structure is extensible in two ways. First, the structure
> > vcpu_stats can be extended with new per-vcpu counters while it fits in a frame.
> 
> I'm afraid I don't see how this is "extensible". I would recommend that
> you outline for yourself how a change would look like to actually add
> such a 2nd counter. While doing that keep in mind that whatever changes
> you make may not break existing consumers.
> 
> It's also not clear what you mean with "fits in a frame": struct
> shared_vcpustatspage is a container for an array with a single element.
> I may guess (looking at just the public interface) that this really is
> meant to be a flexible array (and hence should be marked as such - see
> other uses of XEN_FLEX_ARRAY_DIM in the public headers). Yet if that's
> the case, then a single page already won't suffice for a domain with
> sufficiently many vCPU-s.
> 

I taclked this by using "d->max_vcpus" to calculate the number of required frames
to allocate for a given guest. Also, I added a new type-specific resource named
XENMEM_resource_stats_table_id_vcpustats to host per-vcpu counters. I
completely forgot the "id" in the previous series.

> > Second, new frames can be added in case new counters are required.
> 
> Are you talking of "new counters" here which aren't "new per-vcpu
> counters"? Or else what's the difference from the 1st way?

Yes, I was talking about that sort of counters. In the next series, that sort
of counters could be added by adding a new type-specific resource id.

> 
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -741,6 +741,8 @@ void hvm_domain_relinquish_resources(struct domain *d)
> >  
> >      ioreq_server_destroy_all(d);
> >  
> > +    stats_free_vcpu_mfn(d);
> 
> How come this lives here? Surely this new feature should be not only
> guest-type independent, but also arch-agnostic? Clearly you putting
> the new data in struct domain (and not struct arch_domain or yet
> deeper in the hierarchy) indicates you may have been meaning to make
> it so.
> 

The whole feature shall to be guest-type independent and also arch-agnostic.
Would it be better to put it at xen/common/domain.c:domain_kill()?
 
> > --- a/xen/common/memory.c
> > +++ b/xen/common/memory.c
> > @@ -1078,6 +1078,12 @@ unsigned int ioreq_server_max_frames(const struct domain *d)
> >      return nr;
> >  }
> >  
> > +unsigned int stats_table_max_frames(const struct domain *d)
> > +{
> > +    /* One frame per 512 vcpus. */
> > +    return 1;
> > +}
> 
> As alluded to earlier already - 1 isn't going to be suitable for
> arbitrary size domains. (Yes, HVM domains are presently limited to
> 128 vCPU-s, but as per above this shouldn't be a HVM-only feature.)
>

I am going to use "d->max_vcpus" to calculate the number of required frames for
per-vcpu counters for a given guest.
 
> > @@ -1162,6 +1171,88 @@ static int acquire_vmtrace_buf(
> >      return nr_frames;
> >  }
> >  
> > +void stats_free_vcpu_mfn(struct domain * d)
> > +{
> > +    struct page_info *pg = d->vcpustats_page.pg;
> > +
> > +    if ( !pg )
> > +        return;
> > +
> > +    d->vcpustats_page.pg = NULL;
> > +
> > +    if ( d->vcpustats_page.va )
> > +        unmap_domain_page_global(d->vcpustats_page.va);
> > +
> > +    d->vcpustats_page.va = NULL;
> 
> We ought to gain UNMAP_DOMAIN_PAGE_GLOBAL() for purposes like this one,
> paralleling UNMAP_DOMAIN_PAGE().
> 

I do not understand this comment. Could you elaborate it?

> > +    put_page_alloc_ref(pg);
> > +    put_page_and_type(pg);
> > +}
> > +
> > +static int stats_vcpu_alloc_mfn(struct domain *d)
> > +{
> > +    struct page_info *pg;
> > +
> > +    pg = alloc_domheap_page(d, MEMF_no_refcount);
> > +
> > +    if ( !pg )
> > +        return -ENOMEM;
> > +
> > +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
> 
> Style: Brace placement (more elsewhere).
> 
> > +        put_page_alloc_ref(pg);
> 
> This is not allowed when what you may put is the last reference.
> See other examples we have in the tree.
> 

I do not understand this comment. Could you point me to an example? I used
ioreq_server_alloc_mfn() as example but it may not be a good example. 

> > +        return -ENODATA;
> > +    }
> > +
> > +    d->vcpustats_page.va = __map_domain_page_global(pg);
> > +    if ( !d->vcpustats_page.va )
> > +        goto fail;
> > +
> > +    d->vcpustats_page.pg = pg;
> > +    clear_page(d->vcpustats_page.va);
> 
> I guess this should happen before you globally announce the
> address.
> 

If I understand correctly, I should invoke clear_page() before I assign the
address to "d->vcpustats_page.va". Am I right?

> > +    return 1;
> 
> Functions returning -errno on error want to return 0 on success,
> unless e.g. multiple success indicators are needed.
> 
> > +fail:
> 
> Style: Label indentation.
> 
> > +    put_page_alloc_ref(pg);
> > +    put_page_and_type(pg);
> > +
> > +    return -ENOMEM;
> > +}
> > +
> > +static int acquire_stats_table(struct domain *d,
> > +                                unsigned int id,
> > +                                unsigned int frame,
> > +                                unsigned int nr_frames,
> > +                                xen_pfn_t mfn_list[])
> 
> Style: Indentation.
> 
> > +{
> > +    mfn_t mfn;
> > +    int rc;
> > +    unsigned int i;
> > +
> > +    if ( !d )
> > +        return -ENOENT;
> > +
> > +    for ( i = 0; i < nr_frames; i++ )
> > +    {
> > +        switch ( i )
> > +        {
> > +        case XENMEM_resource_stats_frame_vcpustats:
> 
> Isn't this supposed to be indexed by "id" (which presently you ignore
> altogether, which can't be right)?

I forgot the "id". I added a new type-specific resource id in the next
series. 

> 
> > --- a/xen/common/sched/core.c
> > +++ b/xen/common/sched/core.c
> > @@ -264,6 +264,8 @@ static inline void vcpu_runstate_change(
> >  {
> >      s_time_t delta;
> >      struct sched_unit *unit = v->sched_unit;
> > +    shared_vcpustatspage_t * vcpustats_va;
> 
> Style: Stray blank (more elsewhere).
> 
> > @@ -287,6 +289,20 @@ static inline void vcpu_runstate_change(
> >      }
> >  
> >      v->runstate.state = new_state;
> > +
> > +    vcpustats_va = (shared_vcpustatspage_t*)d->vcpustats_page.va;
> 
> There should be no need for a cast here.
> 
> > +    if ( vcpustats_va )
> > +    {
> > +	vcpustats_va->vcpu_info[v->vcpu_id].version =
> 
> Style: Hard tab.
> 
> > +	    version_update_begin(vcpustats_va->vcpu_info[v->vcpu_id].version);
> > +        smp_wmb();
> > +        memcpy(&vcpustats_va->vcpu_info[v->vcpu_id].runstate_running_time,
> > +               &v->runstate.time[RUNSTATE_running],
> > +               sizeof(v->runstate.time[RUNSTATE_running]));
> 
> Why memcpy() and not a plain assignment?
> 
> > +        smp_wmb();
> > +        vcpustats_va->vcpu_info[v->vcpu_id].version =
> > +            version_update_end(vcpustats_va->vcpu_info[v->vcpu_id].version);
> > +    }
> 
> Overall latching &vcpustats_va->vcpu_info[v->vcpu_id] into a helper
> variable would likely help readability quite a bit.
> 
> > --- a/xen/include/public/vcpu.h
> > +++ b/xen/include/public/vcpu.h
> > @@ -235,6 +235,22 @@ struct vcpu_register_time_memory_area {
> >  typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
> >  DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
> >  
> > +struct vcpu_stats{
> 
> Style: Missing blank.
> 
> > +    /* If the least-significant bit of the version number is set then an update
> 
> Style: Comment layout.
> 

Thanks for the comments regarding style.

Matias
Matias Ezequiel Vara Larsen Feb. 16, 2023, 3:07 p.m. UTC | #5
On Wed, Dec 14, 2022 at 08:29:53AM +0100, Jan Beulich wrote:
> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
> > --- a/xen/common/memory.c
> > +++ b/xen/common/memory.c
> > @@ -1078,6 +1078,12 @@ unsigned int ioreq_server_max_frames(const struct domain *d)
> >       return nr;
> >  }
> >  
> > +unsigned int stats_table_max_frames(const struct domain *d)
> > +{
> > +    /* One frame per 512 vcpus. */
> > +    return 1;
> > +}
> 
> Beyond my earlier comment (and irrespective of this needing changing
> anyway): I guess this "512" was not updated to match the now larger
> size of struct vcpu_stats?

In the next series, I am calculating the number of frames by:

nr = DIV_ROUND_UP(d->max_vcpus * sizeof(struct vcpu_stats), PAGE_SIZE);

> 
> > +static int stats_vcpu_alloc_mfn(struct domain *d)
> > +{
> > +    struct page_info *pg;
> > +
> > +    pg = alloc_domheap_page(d, MEMF_no_refcount);
> 
> The ioreq and vmtrace resources are also allocated this way, but they're
> HVM-specific. The one here being supposed to be VM-type independent, I'm
> afraid such pages will be accessible by an "owning" PV domain (it'll
> need to guess the MFN, but that's no excuse).
> 
> > +    if ( !pg )
> > +        return -ENOMEM;
> > +
> > +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
> > +        put_page_alloc_ref(pg);
> > +        return -ENODATA;
> > +    }
> > +
> > +    d->vcpustats_page.va = __map_domain_page_global(pg);
> > +    if ( !d->vcpustats_page.va )
> > +        goto fail;
> > +
> > +    d->vcpustats_page.pg = pg;
> > +    clear_page(d->vcpustats_page.va);
> 
> Beyond my earlier comment: I think that by the time the surrounding
> hypercall returns the page ought to contain valid data. Otherwise I
> see no way for the consumer to know from which point on the data is
> going to be valid.
> 
> > @@ -287,6 +289,20 @@ static inline void vcpu_runstate_change(
> >      }
> >  
> >      v->runstate.state = new_state;
> > +
> > +    vcpustats_va = (shared_vcpustatspage_t*)d->vcpustats_page.va;
> > +    if ( vcpustats_va )
> > +    {
> > +	vcpustats_va->vcpu_info[v->vcpu_id].version =
> > +	    version_update_begin(vcpustats_va->vcpu_info[v->vcpu_id].version);
> > +        smp_wmb();
> > +        memcpy(&vcpustats_va->vcpu_info[v->vcpu_id].runstate_running_time,
> > +               &v->runstate.time[RUNSTATE_running],
> > +               sizeof(v->runstate.time[RUNSTATE_running]));
> > +        smp_wmb();
> > +        vcpustats_va->vcpu_info[v->vcpu_id].version =
> > +            version_update_end(vcpustats_va->vcpu_info[v->vcpu_id].version);
> > +    }
> 
> A further aspect to consider here is cache line ping-pong. I think the
> per-vCPU elements of the array want to be big enough to not share a
> cache line. The interface being generic this presents some challenge
> in determining what the supposed size is to be. However, taking into
> account the extensibility question, maybe the route to take is to
> simply settle on a power-of-2 value somewhere between x86'es and Arm's
> cache line sizes and the pretty common page size of 4k, e.g. 512 bytes
> or 1k?
> 

I do not now how to address this. I was thinking to align each vcpu_stats
instance to a multiple of the cache-line. I would pick up the first multiple
that is bigger to the size of the vcpu_stats structure. For example, currently
the structure is 16 bytes so I would align each instance in a frame to 64
bytes. Would it make sense? 

> > --- a/xen/include/public/vcpu.h
> > +++ b/xen/include/public/vcpu.h
> > @@ -235,6 +235,22 @@ struct vcpu_register_time_memory_area {
> >  typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
> >  DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
> >  
> > +struct vcpu_stats{
> > +    /* If the least-significant bit of the version number is set then an update
> > +     * is in progress and the guest must wait to read a consistent set of values
> > +     * This mechanism is similar to Linux's seqlock.
> > +     */
> > +    uint32_t version;
> > +    uint32_t pad0;
> > +    uint64_t runstate_running_time;
> > +};
> > +
> > +struct shared_vcpustatspage {
> > +    struct vcpu_stats vcpu_info[1];
> > +};
> > +
> > +typedef struct shared_vcpustatspage shared_vcpustatspage_t;
> 
> For new additions please avoid further name space issues: All types
> and alike want to be prefixed by "xen_".
>

Should I name it "xen_shared_vcpustatspage_t" instead?

Matias
Jan Beulich Feb. 16, 2023, 3:10 p.m. UTC | #6
On 16.02.2023 15:48, Matias Ezequiel Vara Larsen wrote:
> On Tue, Dec 13, 2022 at 06:02:55PM +0100, Jan Beulich wrote:
>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
>>> This commit proposes a new mechanism to query the RUNSTATE_running counter for
>>> a given vcpu from a dom0 userspace application. This commit proposes to expose
>>> that counter by using the acquire_resource interface. The current mechanism
>>> relies on the XEN_DOMCTL_getvcpuinfo and holds a single global domctl_lock for
>>> the entire hypercall; and iterate over every vcpu in the system for every
>>> update thus impacting operations that share that lock.
>>>
>>> This commit proposes to expose vcpu RUNSTATE_running via the
>>> xenforeignmemory interface thus preventing to issue the hypercall and holding
>>> the lock. For that purpose, a new resource type named stats_table is added. The
>>> first frame of this resource stores per-vcpu counters. The frame has one entry
>>> of type struct vcpu_stats per vcpu. The allocation of this frame only happens
>>> if the resource is requested. The frame is released after the domain is
>>> destroyed.
>>>
>>> Note that the updating of this counter is in a hot path, thus, in this commit,
>>> copying only happens if it is specifically required.
>>>
>>> Note that the exposed structure is extensible in two ways. First, the structure
>>> vcpu_stats can be extended with new per-vcpu counters while it fits in a frame.
>>
>> I'm afraid I don't see how this is "extensible". I would recommend that
>> you outline for yourself how a change would look like to actually add
>> such a 2nd counter. While doing that keep in mind that whatever changes
>> you make may not break existing consumers.
>>
>> It's also not clear what you mean with "fits in a frame": struct
>> shared_vcpustatspage is a container for an array with a single element.
>> I may guess (looking at just the public interface) that this really is
>> meant to be a flexible array (and hence should be marked as such - see
>> other uses of XEN_FLEX_ARRAY_DIM in the public headers). Yet if that's
>> the case, then a single page already won't suffice for a domain with
>> sufficiently many vCPU-s.
>>
> 
> I taclked this by using "d->max_vcpus" to calculate the number of required frames
> to allocate for a given guest. Also, I added a new type-specific resource named
> XENMEM_resource_stats_table_id_vcpustats to host per-vcpu counters. I
> completely forgot the "id" in the previous series.

May I suggest that before you submit a new version of your patches, you
make yourself (and then perhaps submit for commenting) a layout of the
data structures you want to introduce, including how they interact and
what "granularity" (global, per-domain, per-vCPU, per-pCPU, or alike)
they are. While doing that, as previously suggested, put yourself in
the position of someone later wanting to add another counter. With the
initial logic there, such an extension should then end up being pretty
mechanical, or else the arrangement likely needs further adjustment.

>>> --- a/xen/arch/x86/hvm/hvm.c
>>> +++ b/xen/arch/x86/hvm/hvm.c
>>> @@ -741,6 +741,8 @@ void hvm_domain_relinquish_resources(struct domain *d)
>>>  
>>>      ioreq_server_destroy_all(d);
>>>  
>>> +    stats_free_vcpu_mfn(d);
>>
>> How come this lives here? Surely this new feature should be not only
>> guest-type independent, but also arch-agnostic? Clearly you putting
>> the new data in struct domain (and not struct arch_domain or yet
>> deeper in the hierarchy) indicates you may have been meaning to make
>> it so.
>>
> 
> The whole feature shall to be guest-type independent and also arch-agnostic.
> Would it be better to put it at xen/common/domain.c:domain_kill()?

Likely, and the earlier this is (safely) possible, the better.

>>> @@ -1162,6 +1171,88 @@ static int acquire_vmtrace_buf(
>>>      return nr_frames;
>>>  }
>>>  
>>> +void stats_free_vcpu_mfn(struct domain * d)
>>> +{
>>> +    struct page_info *pg = d->vcpustats_page.pg;
>>> +
>>> +    if ( !pg )
>>> +        return;
>>> +
>>> +    d->vcpustats_page.pg = NULL;
>>> +
>>> +    if ( d->vcpustats_page.va )
>>> +        unmap_domain_page_global(d->vcpustats_page.va);
>>> +
>>> +    d->vcpustats_page.va = NULL;
>>
>> We ought to gain UNMAP_DOMAIN_PAGE_GLOBAL() for purposes like this one,
>> paralleling UNMAP_DOMAIN_PAGE().
>>
> 
> I do not understand this comment. Could you elaborate it?

The last four lines of code would better be collapsed to a single one,
using the mentioned yet-to-be-introduced construct. I assume you did
look up UNMAP_DOMAIN_PAGE() to spot its difference from
unmap_domain_page()?

>>> +static int stats_vcpu_alloc_mfn(struct domain *d)
>>> +{
>>> +    struct page_info *pg;
>>> +
>>> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
>>> +
>>> +    if ( !pg )
>>> +        return -ENOMEM;
>>> +
>>> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
>>
>> Style: Brace placement (more elsewhere).
>>
>>> +        put_page_alloc_ref(pg);
>>
>> This is not allowed when what you may put is the last reference.
>> See other examples we have in the tree.
>>
> 
> I do not understand this comment. Could you point me to an example? I used
> ioreq_server_alloc_mfn() as example but it may not be a good example. 

That's an okay example; what's not okay is that you altered what is
done there. There is a reason that the other function doesn't use
put_page_alloc_ref() like you do. And I would assume you've looked
up put_page_alloc_ref() and found the comment there that explains
things.

>>> +        return -ENODATA;
>>> +    }
>>> +
>>> +    d->vcpustats_page.va = __map_domain_page_global(pg);
>>> +    if ( !d->vcpustats_page.va )
>>> +        goto fail;
>>> +
>>> +    d->vcpustats_page.pg = pg;
>>> +    clear_page(d->vcpustats_page.va);
>>
>> I guess this should happen before you globally announce the
>> address.
>>
> 
> If I understand correctly, I should invoke clear_page() before I assign the
> address to "d->vcpustats_page.va". Am I right?

Yes.

Jan
Jan Beulich Feb. 16, 2023, 3:15 p.m. UTC | #7
On 16.02.2023 16:07, Matias Ezequiel Vara Larsen wrote:
> On Wed, Dec 14, 2022 at 08:29:53AM +0100, Jan Beulich wrote:
>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
>>> @@ -287,6 +289,20 @@ static inline void vcpu_runstate_change(
>>>      }
>>>  
>>>      v->runstate.state = new_state;
>>> +
>>> +    vcpustats_va = (shared_vcpustatspage_t*)d->vcpustats_page.va;
>>> +    if ( vcpustats_va )
>>> +    {
>>> +	vcpustats_va->vcpu_info[v->vcpu_id].version =
>>> +	    version_update_begin(vcpustats_va->vcpu_info[v->vcpu_id].version);
>>> +        smp_wmb();
>>> +        memcpy(&vcpustats_va->vcpu_info[v->vcpu_id].runstate_running_time,
>>> +               &v->runstate.time[RUNSTATE_running],
>>> +               sizeof(v->runstate.time[RUNSTATE_running]));
>>> +        smp_wmb();
>>> +        vcpustats_va->vcpu_info[v->vcpu_id].version =
>>> +            version_update_end(vcpustats_va->vcpu_info[v->vcpu_id].version);
>>> +    }
>>
>> A further aspect to consider here is cache line ping-pong. I think the
>> per-vCPU elements of the array want to be big enough to not share a
>> cache line. The interface being generic this presents some challenge
>> in determining what the supposed size is to be. However, taking into
>> account the extensibility question, maybe the route to take is to
>> simply settle on a power-of-2 value somewhere between x86'es and Arm's
>> cache line sizes and the pretty common page size of 4k, e.g. 512 bytes
>> or 1k?
>>
> 
> I do not now how to address this. I was thinking to align each vcpu_stats
> instance to a multiple of the cache-line. I would pick up the first multiple
> that is bigger to the size of the vcpu_stats structure. For example, currently
> the structure is 16 bytes so I would align each instance in a frame to 64
> bytes. Would it make sense? 

Well, 64 may be an option, but I gave higher numbers for a reason. One thing
I don't know is what common cache line sizes are on Arm or e.g. RISC-V.

>>> --- a/xen/include/public/vcpu.h
>>> +++ b/xen/include/public/vcpu.h
>>> @@ -235,6 +235,22 @@ struct vcpu_register_time_memory_area {
>>>  typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
>>>  DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
>>>  
>>> +struct vcpu_stats{
>>> +    /* If the least-significant bit of the version number is set then an update
>>> +     * is in progress and the guest must wait to read a consistent set of values
>>> +     * This mechanism is similar to Linux's seqlock.
>>> +     */
>>> +    uint32_t version;
>>> +    uint32_t pad0;
>>> +    uint64_t runstate_running_time;
>>> +};
>>> +
>>> +struct shared_vcpustatspage {
>>> +    struct vcpu_stats vcpu_info[1];
>>> +};
>>> +
>>> +typedef struct shared_vcpustatspage shared_vcpustatspage_t;
>>
>> For new additions please avoid further name space issues: All types
>> and alike want to be prefixed by "xen_".
> 
> Should I name it "xen_shared_vcpustatspage_t" instead?

Yes, that would fulfill the name space requirements. It's getting longish,
so you may want to think about abbreviating it some. For example, I'm not
sure the "page" in the name is really necessary.

Jan
Matias Ezequiel Vara Larsen Feb. 17, 2023, 8:50 a.m. UTC | #8
On Wed, Dec 14, 2022 at 08:56:57AM +0100, Jan Beulich wrote:
> On 14.12.2022 08:29, Jan Beulich wrote:
> > On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
> >> +static int stats_vcpu_alloc_mfn(struct domain *d)
> >> +{
> >> +    struct page_info *pg;
> >> +
> >> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
> > 
> > The ioreq and vmtrace resources are also allocated this way, but they're
> > HVM-specific. The one here being supposed to be VM-type independent, I'm
> > afraid such pages will be accessible by an "owning" PV domain (it'll
> > need to guess the MFN, but that's no excuse).
> 
> Which might be tolerable if it then can't write to the page. That would
> require "locking" the page r/o (from guest pov), which ought to be
> possible by leveraging a variant of what share_xen_page_with_guest()
> does: It marks pages PGT_none with a single type ref. This would mean
> ...
> 
> >> +    if ( !pg )
> >> +        return -ENOMEM;
> >> +
> >> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
> 
> ... using PGT_none here. Afaict this _should_ work, but we have no
> precedent of doing so in the tree, and I may be overlooking something
> which prevents that from working.
> 

I do not fully understand this. I checked share_xen_page_with_guest() and I
think you're talking about doing something like this for each allocated page to
set them ro from a pv guest pov:

pg->u.inuse.type_info = PGT_none;
pg->u.inuse.type_info |= PGT_validated | 1;
page_set_owner(page, d); // not sure if this is needed

Then, I should use PGT_none instead of PGT_writable_page in
get_page_and_type(). Am I right?

Matias
Jan Beulich Feb. 17, 2023, 8:57 a.m. UTC | #9
On 17.02.2023 09:50, Matias Ezequiel Vara Larsen wrote:
> On Wed, Dec 14, 2022 at 08:56:57AM +0100, Jan Beulich wrote:
>> On 14.12.2022 08:29, Jan Beulich wrote:
>>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
>>>> +static int stats_vcpu_alloc_mfn(struct domain *d)
>>>> +{
>>>> +    struct page_info *pg;
>>>> +
>>>> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
>>>
>>> The ioreq and vmtrace resources are also allocated this way, but they're
>>> HVM-specific. The one here being supposed to be VM-type independent, I'm
>>> afraid such pages will be accessible by an "owning" PV domain (it'll
>>> need to guess the MFN, but that's no excuse).
>>
>> Which might be tolerable if it then can't write to the page. That would
>> require "locking" the page r/o (from guest pov), which ought to be
>> possible by leveraging a variant of what share_xen_page_with_guest()
>> does: It marks pages PGT_none with a single type ref. This would mean
>> ...
>>
>>>> +    if ( !pg )
>>>> +        return -ENOMEM;
>>>> +
>>>> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
>>
>> ... using PGT_none here. Afaict this _should_ work, but we have no
>> precedent of doing so in the tree, and I may be overlooking something
>> which prevents that from working.
>>
> 
> I do not fully understand this. I checked share_xen_page_with_guest() and I
> think you're talking about doing something like this for each allocated page to
> set them ro from a pv guest pov:
> 
> pg->u.inuse.type_info = PGT_none;
> pg->u.inuse.type_info |= PGT_validated | 1;
> page_set_owner(page, d); // not sure if this is needed
> 
> Then, I should use PGT_none instead of PGT_writable_page in
> get_page_and_type(). Am I right?

No, if at all possible you should avoid open-coding anything. As said,
simply passing PGT_none to get_page_and_type() ought to work (again, as
said, unless I'm overlooking something). share_xen_page_with_guest()
can do what it does because the page isn't owned yet. For a page with
owner you may not fiddle with type_info in such an open-coded manner.

Jan
Matias Ezequiel Vara Larsen Feb. 17, 2023, 9:29 a.m. UTC | #10
On Fri, Feb 17, 2023 at 09:57:43AM +0100, Jan Beulich wrote:
> On 17.02.2023 09:50, Matias Ezequiel Vara Larsen wrote:
> > On Wed, Dec 14, 2022 at 08:56:57AM +0100, Jan Beulich wrote:
> >> On 14.12.2022 08:29, Jan Beulich wrote:
> >>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
> >>>> +static int stats_vcpu_alloc_mfn(struct domain *d)
> >>>> +{
> >>>> +    struct page_info *pg;
> >>>> +
> >>>> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
> >>>
> >>> The ioreq and vmtrace resources are also allocated this way, but they're
> >>> HVM-specific. The one here being supposed to be VM-type independent, I'm
> >>> afraid such pages will be accessible by an "owning" PV domain (it'll
> >>> need to guess the MFN, but that's no excuse).
> >>
> >> Which might be tolerable if it then can't write to the page. That would
> >> require "locking" the page r/o (from guest pov), which ought to be
> >> possible by leveraging a variant of what share_xen_page_with_guest()
> >> does: It marks pages PGT_none with a single type ref. This would mean
> >> ...
> >>
> >>>> +    if ( !pg )
> >>>> +        return -ENOMEM;
> >>>> +
> >>>> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
> >>
> >> ... using PGT_none here. Afaict this _should_ work, but we have no
> >> precedent of doing so in the tree, and I may be overlooking something
> >> which prevents that from working.
> >>
> > 
> > I do not fully understand this. I checked share_xen_page_with_guest() and I
> > think you're talking about doing something like this for each allocated page to
> > set them ro from a pv guest pov:
> > 
> > pg->u.inuse.type_info = PGT_none;
> > pg->u.inuse.type_info |= PGT_validated | 1;
> > page_set_owner(page, d); // not sure if this is needed
> > 
> > Then, I should use PGT_none instead of PGT_writable_page in
> > get_page_and_type(). Am I right?
> 
> No, if at all possible you should avoid open-coding anything. As said,
> simply passing PGT_none to get_page_and_type() ought to work (again, as
> said, unless I'm overlooking something). share_xen_page_with_guest()
> can do what it does because the page isn't owned yet. For a page with
> owner you may not fiddle with type_info in such an open-coded manner.
> 

Thanks. I got the following bug when passing PGT_none:

(XEN) Bad type in validate_page 0 t=0000000000000001 c=8040000000000002
(XEN) Xen BUG at mm.c:2643

I did not investigate yet why this has happened.

Matias
Jan Beulich Feb. 17, 2023, 2:10 p.m. UTC | #11
On 17.02.2023 10:29, Matias Ezequiel Vara Larsen wrote:
> On Fri, Feb 17, 2023 at 09:57:43AM +0100, Jan Beulich wrote:
>> On 17.02.2023 09:50, Matias Ezequiel Vara Larsen wrote:
>>> On Wed, Dec 14, 2022 at 08:56:57AM +0100, Jan Beulich wrote:
>>>> On 14.12.2022 08:29, Jan Beulich wrote:
>>>>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
>>>>>> +static int stats_vcpu_alloc_mfn(struct domain *d)
>>>>>> +{
>>>>>> +    struct page_info *pg;
>>>>>> +
>>>>>> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
>>>>>
>>>>> The ioreq and vmtrace resources are also allocated this way, but they're
>>>>> HVM-specific. The one here being supposed to be VM-type independent, I'm
>>>>> afraid such pages will be accessible by an "owning" PV domain (it'll
>>>>> need to guess the MFN, but that's no excuse).
>>>>
>>>> Which might be tolerable if it then can't write to the page. That would
>>>> require "locking" the page r/o (from guest pov), which ought to be
>>>> possible by leveraging a variant of what share_xen_page_with_guest()
>>>> does: It marks pages PGT_none with a single type ref. This would mean
>>>> ...
>>>>
>>>>>> +    if ( !pg )
>>>>>> +        return -ENOMEM;
>>>>>> +
>>>>>> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
>>>>
>>>> ... using PGT_none here. Afaict this _should_ work, but we have no
>>>> precedent of doing so in the tree, and I may be overlooking something
>>>> which prevents that from working.
>>>>
>>>
>>> I do not fully understand this. I checked share_xen_page_with_guest() and I
>>> think you're talking about doing something like this for each allocated page to
>>> set them ro from a pv guest pov:
>>>
>>> pg->u.inuse.type_info = PGT_none;
>>> pg->u.inuse.type_info |= PGT_validated | 1;
>>> page_set_owner(page, d); // not sure if this is needed
>>>
>>> Then, I should use PGT_none instead of PGT_writable_page in
>>> get_page_and_type(). Am I right?
>>
>> No, if at all possible you should avoid open-coding anything. As said,
>> simply passing PGT_none to get_page_and_type() ought to work (again, as
>> said, unless I'm overlooking something). share_xen_page_with_guest()
>> can do what it does because the page isn't owned yet. For a page with
>> owner you may not fiddle with type_info in such an open-coded manner.
>>
> 
> Thanks. I got the following bug when passing PGT_none:
> 
> (XEN) Bad type in validate_page 0 t=0000000000000001 c=8040000000000002
> (XEN) Xen BUG at mm.c:2643

The caller of the function needs to avoid the call not only for writable
and shared pages, but also for this new case of PGT_none.

Jan
Matias Ezequiel Vara Larsen Feb. 20, 2023, 4:51 p.m. UTC | #12
On Thu, Feb 16, 2023 at 04:15:29PM +0100, Jan Beulich wrote:
> On 16.02.2023 16:07, Matias Ezequiel Vara Larsen wrote:
> > On Wed, Dec 14, 2022 at 08:29:53AM +0100, Jan Beulich wrote:
> >> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
> >>> @@ -287,6 +289,20 @@ static inline void vcpu_runstate_change(
> >>>      }
> >>>  
> >>>      v->runstate.state = new_state;
> >>> +
> >>> +    vcpustats_va = (shared_vcpustatspage_t*)d->vcpustats_page.va;
> >>> +    if ( vcpustats_va )
> >>> +    {
> >>> +	vcpustats_va->vcpu_info[v->vcpu_id].version =
> >>> +	    version_update_begin(vcpustats_va->vcpu_info[v->vcpu_id].version);
> >>> +        smp_wmb();
> >>> +        memcpy(&vcpustats_va->vcpu_info[v->vcpu_id].runstate_running_time,
> >>> +               &v->runstate.time[RUNSTATE_running],
> >>> +               sizeof(v->runstate.time[RUNSTATE_running]));
> >>> +        smp_wmb();
> >>> +        vcpustats_va->vcpu_info[v->vcpu_id].version =
> >>> +            version_update_end(vcpustats_va->vcpu_info[v->vcpu_id].version);
> >>> +    }
> >>
> >> A further aspect to consider here is cache line ping-pong. I think the
> >> per-vCPU elements of the array want to be big enough to not share a
> >> cache line. The interface being generic this presents some challenge
> >> in determining what the supposed size is to be. However, taking into
> >> account the extensibility question, maybe the route to take is to
> >> simply settle on a power-of-2 value somewhere between x86'es and Arm's
> >> cache line sizes and the pretty common page size of 4k, e.g. 512 bytes
> >> or 1k?
> >>
> > 
> > I do not now how to address this. I was thinking to align each vcpu_stats
> > instance to a multiple of the cache-line. I would pick up the first multiple
> > that is bigger to the size of the vcpu_stats structure. For example, currently
> > the structure is 16 bytes so I would align each instance in a frame to 64
> > bytes. Would it make sense? 
> 
> Well, 64 may be an option, but I gave higher numbers for a reason. One thing
> I don't know is what common cache line sizes are on Arm or e.g. RISC-V.

Thanks. I found that structures that require cache-aligment are defined with
"__cacheline_aligned" that uses L1_CACHE_BYTES. For example, in x86, this
aligns to 128 bytes. What is the reason to use a higher value like 512 bytes or
1k?.

Thanks, Matias.
Jan Beulich Feb. 21, 2023, 8:48 a.m. UTC | #13
On 20.02.2023 17:51, Matias Ezequiel Vara Larsen wrote:
> On Thu, Feb 16, 2023 at 04:15:29PM +0100, Jan Beulich wrote:
>> On 16.02.2023 16:07, Matias Ezequiel Vara Larsen wrote:
>>> On Wed, Dec 14, 2022 at 08:29:53AM +0100, Jan Beulich wrote:
>>>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
>>>>> @@ -287,6 +289,20 @@ static inline void vcpu_runstate_change(
>>>>>      }
>>>>>  
>>>>>      v->runstate.state = new_state;
>>>>> +
>>>>> +    vcpustats_va = (shared_vcpustatspage_t*)d->vcpustats_page.va;
>>>>> +    if ( vcpustats_va )
>>>>> +    {
>>>>> +	vcpustats_va->vcpu_info[v->vcpu_id].version =
>>>>> +	    version_update_begin(vcpustats_va->vcpu_info[v->vcpu_id].version);
>>>>> +        smp_wmb();
>>>>> +        memcpy(&vcpustats_va->vcpu_info[v->vcpu_id].runstate_running_time,
>>>>> +               &v->runstate.time[RUNSTATE_running],
>>>>> +               sizeof(v->runstate.time[RUNSTATE_running]));
>>>>> +        smp_wmb();
>>>>> +        vcpustats_va->vcpu_info[v->vcpu_id].version =
>>>>> +            version_update_end(vcpustats_va->vcpu_info[v->vcpu_id].version);
>>>>> +    }
>>>>
>>>> A further aspect to consider here is cache line ping-pong. I think the
>>>> per-vCPU elements of the array want to be big enough to not share a
>>>> cache line. The interface being generic this presents some challenge
>>>> in determining what the supposed size is to be. However, taking into
>>>> account the extensibility question, maybe the route to take is to
>>>> simply settle on a power-of-2 value somewhere between x86'es and Arm's
>>>> cache line sizes and the pretty common page size of 4k, e.g. 512 bytes
>>>> or 1k?
>>>>
>>>
>>> I do not now how to address this. I was thinking to align each vcpu_stats
>>> instance to a multiple of the cache-line. I would pick up the first multiple
>>> that is bigger to the size of the vcpu_stats structure. For example, currently
>>> the structure is 16 bytes so I would align each instance in a frame to 64
>>> bytes. Would it make sense? 
>>
>> Well, 64 may be an option, but I gave higher numbers for a reason. One thing
>> I don't know is what common cache line sizes are on Arm or e.g. RISC-V.
> 
> Thanks. I found that structures that require cache-aligment are defined with
> "__cacheline_aligned" that uses L1_CACHE_BYTES. For example, in x86, this
> aligns to 128 bytes. What is the reason to use a higher value like 512 bytes or
> 1k?.

You cannot bake an x86 property (which may even change: at some point we may
choose to drop the 128-byte special for the very few CPUs actually using
such, when the majority uses 64-byte cache lines) into the public interface.
You also don't want to make an aspect of the public interface arch-dependent
when not really needed. My suggestion for a higher value was in the hope that
we may never see a port to an architecture with cache lines wider than, say,
512 bytes. What exactly the value should be is of course up for discussion,
but I think it wants to include some slack on top of what we currently
support (arch-wise).

Jan
Matias Ezequiel Vara Larsen Feb. 23, 2023, 12:16 p.m. UTC | #14
On Fri, Feb 17, 2023 at 03:10:53PM +0100, Jan Beulich wrote:
> On 17.02.2023 10:29, Matias Ezequiel Vara Larsen wrote:
> > On Fri, Feb 17, 2023 at 09:57:43AM +0100, Jan Beulich wrote:
> >> On 17.02.2023 09:50, Matias Ezequiel Vara Larsen wrote:
> >>> On Wed, Dec 14, 2022 at 08:56:57AM +0100, Jan Beulich wrote:
> >>>> On 14.12.2022 08:29, Jan Beulich wrote:
> >>>>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
> >>>>>> +static int stats_vcpu_alloc_mfn(struct domain *d)
> >>>>>> +{
> >>>>>> +    struct page_info *pg;
> >>>>>> +
> >>>>>> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
> >>>>>
> >>>>> The ioreq and vmtrace resources are also allocated this way, but they're
> >>>>> HVM-specific. The one here being supposed to be VM-type independent, I'm
> >>>>> afraid such pages will be accessible by an "owning" PV domain (it'll
> >>>>> need to guess the MFN, but that's no excuse).
> >>>>
> >>>> Which might be tolerable if it then can't write to the page. That would
> >>>> require "locking" the page r/o (from guest pov), which ought to be
> >>>> possible by leveraging a variant of what share_xen_page_with_guest()
> >>>> does: It marks pages PGT_none with a single type ref. This would mean
> >>>> ...
> >>>>
> >>>>>> +    if ( !pg )
> >>>>>> +        return -ENOMEM;
> >>>>>> +
> >>>>>> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
> >>>>
> >>>> ... using PGT_none here. Afaict this _should_ work, but we have no
> >>>> precedent of doing so in the tree, and I may be overlooking something
> >>>> which prevents that from working.
> >>>>
> >>>
> >>> I do not fully understand this. I checked share_xen_page_with_guest() and I
> >>> think you're talking about doing something like this for each allocated page to
> >>> set them ro from a pv guest pov:
> >>>
> >>> pg->u.inuse.type_info = PGT_none;
> >>> pg->u.inuse.type_info |= PGT_validated | 1;
> >>> page_set_owner(page, d); // not sure if this is needed
> >>>
> >>> Then, I should use PGT_none instead of PGT_writable_page in
> >>> get_page_and_type(). Am I right?
> >>
> >> No, if at all possible you should avoid open-coding anything. As said,
> >> simply passing PGT_none to get_page_and_type() ought to work (again, as
> >> said, unless I'm overlooking something). share_xen_page_with_guest()
> >> can do what it does because the page isn't owned yet. For a page with
> >> owner you may not fiddle with type_info in such an open-coded manner.
> >>
> > 
> > Thanks. I got the following bug when passing PGT_none:
> > 
> > (XEN) Bad type in validate_page 0 t=0000000000000001 c=8040000000000002
> > (XEN) Xen BUG at mm.c:2643
> 
> The caller of the function needs to avoid the call not only for writable
> and shared pages, but also for this new case of PGT_none.

Thanks. If I understand correctly, _get_page_type() needs to avoid to call
validate_page() when type = PGT_none. For the writable and shared pages, this
is avoided by setting nx |= PGT_validated. Am I right?

Matias
Jan Beulich Feb. 23, 2023, 12:42 p.m. UTC | #15
On 23.02.2023 13:16, Matias Ezequiel Vara Larsen wrote:
> On Fri, Feb 17, 2023 at 03:10:53PM +0100, Jan Beulich wrote:
>> On 17.02.2023 10:29, Matias Ezequiel Vara Larsen wrote:
>>> On Fri, Feb 17, 2023 at 09:57:43AM +0100, Jan Beulich wrote:
>>>> On 17.02.2023 09:50, Matias Ezequiel Vara Larsen wrote:
>>>>> On Wed, Dec 14, 2022 at 08:56:57AM +0100, Jan Beulich wrote:
>>>>>> On 14.12.2022 08:29, Jan Beulich wrote:
>>>>>>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
>>>>>>>> +static int stats_vcpu_alloc_mfn(struct domain *d)
>>>>>>>> +{
>>>>>>>> +    struct page_info *pg;
>>>>>>>> +
>>>>>>>> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
>>>>>>>
>>>>>>> The ioreq and vmtrace resources are also allocated this way, but they're
>>>>>>> HVM-specific. The one here being supposed to be VM-type independent, I'm
>>>>>>> afraid such pages will be accessible by an "owning" PV domain (it'll
>>>>>>> need to guess the MFN, but that's no excuse).
>>>>>>
>>>>>> Which might be tolerable if it then can't write to the page. That would
>>>>>> require "locking" the page r/o (from guest pov), which ought to be
>>>>>> possible by leveraging a variant of what share_xen_page_with_guest()
>>>>>> does: It marks pages PGT_none with a single type ref. This would mean
>>>>>> ...
>>>>>>
>>>>>>>> +    if ( !pg )
>>>>>>>> +        return -ENOMEM;
>>>>>>>> +
>>>>>>>> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
>>>>>>
>>>>>> ... using PGT_none here. Afaict this _should_ work, but we have no
>>>>>> precedent of doing so in the tree, and I may be overlooking something
>>>>>> which prevents that from working.
>>>>>>
>>>>>
>>>>> I do not fully understand this. I checked share_xen_page_with_guest() and I
>>>>> think you're talking about doing something like this for each allocated page to
>>>>> set them ro from a pv guest pov:
>>>>>
>>>>> pg->u.inuse.type_info = PGT_none;
>>>>> pg->u.inuse.type_info |= PGT_validated | 1;
>>>>> page_set_owner(page, d); // not sure if this is needed
>>>>>
>>>>> Then, I should use PGT_none instead of PGT_writable_page in
>>>>> get_page_and_type(). Am I right?
>>>>
>>>> No, if at all possible you should avoid open-coding anything. As said,
>>>> simply passing PGT_none to get_page_and_type() ought to work (again, as
>>>> said, unless I'm overlooking something). share_xen_page_with_guest()
>>>> can do what it does because the page isn't owned yet. For a page with
>>>> owner you may not fiddle with type_info in such an open-coded manner.
>>>>
>>>
>>> Thanks. I got the following bug when passing PGT_none:
>>>
>>> (XEN) Bad type in validate_page 0 t=0000000000000001 c=8040000000000002
>>> (XEN) Xen BUG at mm.c:2643
>>
>> The caller of the function needs to avoid the call not only for writable
>> and shared pages, but also for this new case of PGT_none.
> 
> Thanks. If I understand correctly, _get_page_type() needs to avoid to call
> validate_page() when type = PGT_none.

Yes.

> For the writable and shared pages, this
> is avoided by setting nx |= PGT_validated. Am I right?

Well, no, I wouldn't describe it like that. The two (soon three) types not
requiring validation simply set the flag without calling validate_page().

Jan
Matias Ezequiel Vara Larsen March 7, 2023, 2:44 p.m. UTC | #16
On Thu, Feb 23, 2023 at 01:42:08PM +0100, Jan Beulich wrote:
> On 23.02.2023 13:16, Matias Ezequiel Vara Larsen wrote:
> > On Fri, Feb 17, 2023 at 03:10:53PM +0100, Jan Beulich wrote:
> >> On 17.02.2023 10:29, Matias Ezequiel Vara Larsen wrote:
> >>> On Fri, Feb 17, 2023 at 09:57:43AM +0100, Jan Beulich wrote:
> >>>> On 17.02.2023 09:50, Matias Ezequiel Vara Larsen wrote:
> >>>>> On Wed, Dec 14, 2022 at 08:56:57AM +0100, Jan Beulich wrote:
> >>>>>> On 14.12.2022 08:29, Jan Beulich wrote:
> >>>>>>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
> >>>>>>>> +static int stats_vcpu_alloc_mfn(struct domain *d)
> >>>>>>>> +{
> >>>>>>>> +    struct page_info *pg;
> >>>>>>>> +
> >>>>>>>> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
> >>>>>>>
> >>>>>>> The ioreq and vmtrace resources are also allocated this way, but they're
> >>>>>>> HVM-specific. The one here being supposed to be VM-type independent, I'm
> >>>>>>> afraid such pages will be accessible by an "owning" PV domain (it'll
> >>>>>>> need to guess the MFN, but that's no excuse).
> >>>>>>
> >>>>>> Which might be tolerable if it then can't write to the page. That would
> >>>>>> require "locking" the page r/o (from guest pov), which ought to be
> >>>>>> possible by leveraging a variant of what share_xen_page_with_guest()
> >>>>>> does: It marks pages PGT_none with a single type ref. This would mean
> >>>>>> ...
> >>>>>>
> >>>>>>>> +    if ( !pg )
> >>>>>>>> +        return -ENOMEM;
> >>>>>>>> +
> >>>>>>>> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
> >>>>>>
> >>>>>> ... using PGT_none here. Afaict this _should_ work, but we have no
> >>>>>> precedent of doing so in the tree, and I may be overlooking something
> >>>>>> which prevents that from working.
> >>>>>>
> >>>>>
> >>>>> I do not fully understand this. I checked share_xen_page_with_guest() and I
> >>>>> think you're talking about doing something like this for each allocated page to
> >>>>> set them ro from a pv guest pov:
> >>>>>
> >>>>> pg->u.inuse.type_info = PGT_none;
> >>>>> pg->u.inuse.type_info |= PGT_validated | 1;
> >>>>> page_set_owner(page, d); // not sure if this is needed
> >>>>>
> >>>>> Then, I should use PGT_none instead of PGT_writable_page in
> >>>>> get_page_and_type(). Am I right?
> >>>>
> >>>> No, if at all possible you should avoid open-coding anything. As said,
> >>>> simply passing PGT_none to get_page_and_type() ought to work (again, as
> >>>> said, unless I'm overlooking something). share_xen_page_with_guest()
> >>>> can do what it does because the page isn't owned yet. For a page with
> >>>> owner you may not fiddle with type_info in such an open-coded manner.
> >>>>
> >>>
> >>> Thanks. I got the following bug when passing PGT_none:
> >>>
> >>> (XEN) Bad type in validate_page 0 t=0000000000000001 c=8040000000000002
> >>> (XEN) Xen BUG at mm.c:2643
> >>
> >> The caller of the function needs to avoid the call not only for writable
> >> and shared pages, but also for this new case of PGT_none.
> > 
> > Thanks. If I understand correctly, _get_page_type() needs to avoid to call
> > validate_page() when type = PGT_none.
> 
> Yes.
> 
> > For the writable and shared pages, this
> > is avoided by setting nx |= PGT_validated. Am I right?
> 
> Well, no, I wouldn't describe it like that. The two (soon three) types not
> requiring validation simply set the flag without calling validate_page().
> 

I see, thanks. I added the corresponding check at _get_page_type() to set the
flag without calling validate_page() for the PGT_none type. I think I am
missing something when I am releasing the pages. I am triggering the following
BUG() when issuing put_page_and_type():
 
(XEN) Xen BUG at mm.c:2698

This is at devalidate_page(). I guess the call to devalidate_page() shall be
also avoided. I was wondering if put_page_and_type() is required in this case.

Matias
Jan Beulich March 7, 2023, 4:55 p.m. UTC | #17
On 07.03.2023 15:44, Matias Ezequiel Vara Larsen wrote:
> On Thu, Feb 23, 2023 at 01:42:08PM +0100, Jan Beulich wrote:
>> On 23.02.2023 13:16, Matias Ezequiel Vara Larsen wrote:
>>> On Fri, Feb 17, 2023 at 03:10:53PM +0100, Jan Beulich wrote:
>>>> On 17.02.2023 10:29, Matias Ezequiel Vara Larsen wrote:
>>>>> On Fri, Feb 17, 2023 at 09:57:43AM +0100, Jan Beulich wrote:
>>>>>> On 17.02.2023 09:50, Matias Ezequiel Vara Larsen wrote:
>>>>>>> On Wed, Dec 14, 2022 at 08:56:57AM +0100, Jan Beulich wrote:
>>>>>>>> On 14.12.2022 08:29, Jan Beulich wrote:
>>>>>>>>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
>>>>>>>>>> +static int stats_vcpu_alloc_mfn(struct domain *d)
>>>>>>>>>> +{
>>>>>>>>>> +    struct page_info *pg;
>>>>>>>>>> +
>>>>>>>>>> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
>>>>>>>>>
>>>>>>>>> The ioreq and vmtrace resources are also allocated this way, but they're
>>>>>>>>> HVM-specific. The one here being supposed to be VM-type independent, I'm
>>>>>>>>> afraid such pages will be accessible by an "owning" PV domain (it'll
>>>>>>>>> need to guess the MFN, but that's no excuse).
>>>>>>>>
>>>>>>>> Which might be tolerable if it then can't write to the page. That would
>>>>>>>> require "locking" the page r/o (from guest pov), which ought to be
>>>>>>>> possible by leveraging a variant of what share_xen_page_with_guest()
>>>>>>>> does: It marks pages PGT_none with a single type ref. This would mean
>>>>>>>> ...
>>>>>>>>
>>>>>>>>>> +    if ( !pg )
>>>>>>>>>> +        return -ENOMEM;
>>>>>>>>>> +
>>>>>>>>>> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
>>>>>>>>
>>>>>>>> ... using PGT_none here. Afaict this _should_ work, but we have no
>>>>>>>> precedent of doing so in the tree, and I may be overlooking something
>>>>>>>> which prevents that from working.
>>>>>>>>
>>>>>>>
>>>>>>> I do not fully understand this. I checked share_xen_page_with_guest() and I
>>>>>>> think you're talking about doing something like this for each allocated page to
>>>>>>> set them ro from a pv guest pov:
>>>>>>>
>>>>>>> pg->u.inuse.type_info = PGT_none;
>>>>>>> pg->u.inuse.type_info |= PGT_validated | 1;
>>>>>>> page_set_owner(page, d); // not sure if this is needed
>>>>>>>
>>>>>>> Then, I should use PGT_none instead of PGT_writable_page in
>>>>>>> get_page_and_type(). Am I right?
>>>>>>
>>>>>> No, if at all possible you should avoid open-coding anything. As said,
>>>>>> simply passing PGT_none to get_page_and_type() ought to work (again, as
>>>>>> said, unless I'm overlooking something). share_xen_page_with_guest()
>>>>>> can do what it does because the page isn't owned yet. For a page with
>>>>>> owner you may not fiddle with type_info in such an open-coded manner.
>>>>>>
>>>>>
>>>>> Thanks. I got the following bug when passing PGT_none:
>>>>>
>>>>> (XEN) Bad type in validate_page 0 t=0000000000000001 c=8040000000000002
>>>>> (XEN) Xen BUG at mm.c:2643
>>>>
>>>> The caller of the function needs to avoid the call not only for writable
>>>> and shared pages, but also for this new case of PGT_none.
>>>
>>> Thanks. If I understand correctly, _get_page_type() needs to avoid to call
>>> validate_page() when type = PGT_none.
>>
>> Yes.
>>
>>> For the writable and shared pages, this
>>> is avoided by setting nx |= PGT_validated. Am I right?
>>
>> Well, no, I wouldn't describe it like that. The two (soon three) types not
>> requiring validation simply set the flag without calling validate_page().
>>
> 
> I see, thanks. I added the corresponding check at _get_page_type() to set the
> flag without calling validate_page() for the PGT_none type. I think I am
> missing something when I am releasing the pages. I am triggering the following
> BUG() when issuing put_page_and_type():
>  
> (XEN) Xen BUG at mm.c:2698
> 
> This is at devalidate_page(). I guess the call to devalidate_page() shall be
> also avoided.

Well, yes, symmetry requires a change there as well. Here it's indirect:
You want to avoid the call to _put_final_page_type(). That's enclosed by
(nx & PGT_type_mask) <= PGT_l4_page_table, which happens to be true for
PGT_none as well. There may be more instances of such a comparison, so
it'll be necessary to find them and check whether they may now also be
reached with PGT_none (looks like a comparison against PGT_root_page_table
in _get_page_type() is also affected, albeit in a largely benign way).

> I was wondering if put_page_and_type() is required in this case.

It is, or some equivalent thereof. Again - see other examples where a
similar allocation pattern exists.

Jan
Matias Ezequiel Vara Larsen March 9, 2023, 9:22 a.m. UTC | #18
On Tue, Mar 07, 2023 at 05:55:26PM +0100, Jan Beulich wrote:
> On 07.03.2023 15:44, Matias Ezequiel Vara Larsen wrote:
> > On Thu, Feb 23, 2023 at 01:42:08PM +0100, Jan Beulich wrote:
> >> On 23.02.2023 13:16, Matias Ezequiel Vara Larsen wrote:
> >>> On Fri, Feb 17, 2023 at 03:10:53PM +0100, Jan Beulich wrote:
> >>>> On 17.02.2023 10:29, Matias Ezequiel Vara Larsen wrote:
> >>>>> On Fri, Feb 17, 2023 at 09:57:43AM +0100, Jan Beulich wrote:
> >>>>>> On 17.02.2023 09:50, Matias Ezequiel Vara Larsen wrote:
> >>>>>>> On Wed, Dec 14, 2022 at 08:56:57AM +0100, Jan Beulich wrote:
> >>>>>>>> On 14.12.2022 08:29, Jan Beulich wrote:
> >>>>>>>>> On 07.10.2022 14:39, Matias Ezequiel Vara Larsen wrote:
> >>>>>>>>>> +static int stats_vcpu_alloc_mfn(struct domain *d)
> >>>>>>>>>> +{
> >>>>>>>>>> +    struct page_info *pg;
> >>>>>>>>>> +
> >>>>>>>>>> +    pg = alloc_domheap_page(d, MEMF_no_refcount);
> >>>>>>>>>
> >>>>>>>>> The ioreq and vmtrace resources are also allocated this way, but they're
> >>>>>>>>> HVM-specific. The one here being supposed to be VM-type independent, I'm
> >>>>>>>>> afraid such pages will be accessible by an "owning" PV domain (it'll
> >>>>>>>>> need to guess the MFN, but that's no excuse).
> >>>>>>>>
> >>>>>>>> Which might be tolerable if it then can't write to the page. That would
> >>>>>>>> require "locking" the page r/o (from guest pov), which ought to be
> >>>>>>>> possible by leveraging a variant of what share_xen_page_with_guest()
> >>>>>>>> does: It marks pages PGT_none with a single type ref. This would mean
> >>>>>>>> ...
> >>>>>>>>
> >>>>>>>>>> +    if ( !pg )
> >>>>>>>>>> +        return -ENOMEM;
> >>>>>>>>>> +
> >>>>>>>>>> +    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
> >>>>>>>>
> >>>>>>>> ... using PGT_none here. Afaict this _should_ work, but we have no
> >>>>>>>> precedent of doing so in the tree, and I may be overlooking something
> >>>>>>>> which prevents that from working.
> >>>>>>>>
> >>>>>>>
> >>>>>>> I do not fully understand this. I checked share_xen_page_with_guest() and I
> >>>>>>> think you're talking about doing something like this for each allocated page to
> >>>>>>> set them ro from a pv guest pov:
> >>>>>>>
> >>>>>>> pg->u.inuse.type_info = PGT_none;
> >>>>>>> pg->u.inuse.type_info |= PGT_validated | 1;
> >>>>>>> page_set_owner(page, d); // not sure if this is needed
> >>>>>>>
> >>>>>>> Then, I should use PGT_none instead of PGT_writable_page in
> >>>>>>> get_page_and_type(). Am I right?
> >>>>>>
> >>>>>> No, if at all possible you should avoid open-coding anything. As said,
> >>>>>> simply passing PGT_none to get_page_and_type() ought to work (again, as
> >>>>>> said, unless I'm overlooking something). share_xen_page_with_guest()
> >>>>>> can do what it does because the page isn't owned yet. For a page with
> >>>>>> owner you may not fiddle with type_info in such an open-coded manner.
> >>>>>>
> >>>>>
> >>>>> Thanks. I got the following bug when passing PGT_none:
> >>>>>
> >>>>> (XEN) Bad type in validate_page 0 t=0000000000000001 c=8040000000000002
> >>>>> (XEN) Xen BUG at mm.c:2643
> >>>>
> >>>> The caller of the function needs to avoid the call not only for writable
> >>>> and shared pages, but also for this new case of PGT_none.
> >>>
> >>> Thanks. If I understand correctly, _get_page_type() needs to avoid to call
> >>> validate_page() when type = PGT_none.
> >>
> >> Yes.
> >>
> >>> For the writable and shared pages, this
> >>> is avoided by setting nx |= PGT_validated. Am I right?
> >>
> >> Well, no, I wouldn't describe it like that. The two (soon three) types not
> >> requiring validation simply set the flag without calling validate_page().
> >>
> > 
> > I see, thanks. I added the corresponding check at _get_page_type() to set the
> > flag without calling validate_page() for the PGT_none type. I think I am
> > missing something when I am releasing the pages. I am triggering the following
> > BUG() when issuing put_page_and_type():
> >  
> > (XEN) Xen BUG at mm.c:2698
> > 
> > This is at devalidate_page(). I guess the call to devalidate_page() shall be
> > also avoided.
> 
> Well, yes, symmetry requires a change there as well. Here it's indirect:
> You want to avoid the call to _put_final_page_type(). That's enclosed by
> (nx & PGT_type_mask) <= PGT_l4_page_table, which happens to be true for
> PGT_none as well. There may be more instances of such a comparison, so
> it'll be necessary to find them and check whether they may now also be
> reached with PGT_none (looks like a comparison against PGT_root_page_table
> in _get_page_type() is also affected, albeit in a largely benign way).
> 
Thanks. I could not find any other instance of that comparison except those
that you mention. I'll add a new patch in the series to deal with the
support for PGT_none.

Matias
diff mbox series

Patch

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index ddd001a6ad..1ef6cb5ff0 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -741,6 +741,8 @@  void hvm_domain_relinquish_resources(struct domain *d)
 
     ioreq_server_destroy_all(d);
 
+    stats_free_vcpu_mfn(d);
+
     msixtbl_pt_cleanup(d);
 
     /* Stop all asynchronous timer actions. */
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 297b98a562..749486d5d4 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -1078,6 +1078,12 @@  unsigned int ioreq_server_max_frames(const struct domain *d)
     return nr;
 }
 
+unsigned int stats_table_max_frames(const struct domain *d)
+{
+    /* One frame per 512 vcpus. */
+    return 1;
+}
+
 /*
  * Return 0 on any kind of error.  Caller converts to -EINVAL.
  *
@@ -1099,6 +1105,9 @@  static unsigned int resource_max_frames(const struct domain *d,
     case XENMEM_resource_vmtrace_buf:
         return d->vmtrace_size >> PAGE_SHIFT;
 
+    case XENMEM_resource_stats_table:
+        return stats_table_max_frames(d);
+
     default:
         return -EOPNOTSUPP;
     }
@@ -1162,6 +1171,88 @@  static int acquire_vmtrace_buf(
     return nr_frames;
 }
 
+void stats_free_vcpu_mfn(struct domain * d)
+{
+    struct page_info *pg = d->vcpustats_page.pg;
+
+    if ( !pg )
+        return;
+
+    d->vcpustats_page.pg = NULL;
+
+    if ( d->vcpustats_page.va )
+        unmap_domain_page_global(d->vcpustats_page.va);
+
+    d->vcpustats_page.va = NULL;
+
+    put_page_alloc_ref(pg);
+    put_page_and_type(pg);
+}
+
+static int stats_vcpu_alloc_mfn(struct domain *d)
+{
+    struct page_info *pg;
+
+    pg = alloc_domheap_page(d, MEMF_no_refcount);
+
+    if ( !pg )
+        return -ENOMEM;
+
+    if ( !get_page_and_type(pg, d, PGT_writable_page) ) {
+        put_page_alloc_ref(pg);
+        return -ENODATA;
+    }
+
+    d->vcpustats_page.va = __map_domain_page_global(pg);
+    if ( !d->vcpustats_page.va )
+        goto fail;
+
+    d->vcpustats_page.pg = pg;
+    clear_page(d->vcpustats_page.va);
+    return 1;
+
+fail:
+    put_page_alloc_ref(pg);
+    put_page_and_type(pg);
+
+    return -ENOMEM;
+}
+
+static int acquire_stats_table(struct domain *d,
+                                unsigned int id,
+                                unsigned int frame,
+                                unsigned int nr_frames,
+                                xen_pfn_t mfn_list[])
+{
+    mfn_t mfn;
+    int rc;
+    unsigned int i;
+
+    if ( !d )
+        return -ENOENT;
+
+    for ( i = 0; i < nr_frames; i++ )
+    {
+        switch ( i )
+        {
+        case XENMEM_resource_stats_frame_vcpustats:
+            if ( !d->vcpustats_page.pg ) {
+                rc = stats_vcpu_alloc_mfn(d);
+                if ( rc < 1 )
+                    return rc;
+            }
+            mfn = page_to_mfn(d->vcpustats_page.pg);
+            mfn_list[i] = mfn_x(mfn);
+            break;
+
+        default:
+            return -EINVAL;
+        }
+    }
+
+    return nr_frames;
+}
+
 /*
  * Returns -errno on error, or positive in the range [1, nr_frames] on
  * success.  Returning less than nr_frames contitutes a request for a
@@ -1182,6 +1273,9 @@  static int _acquire_resource(
     case XENMEM_resource_vmtrace_buf:
         return acquire_vmtrace_buf(d, id, frame, nr_frames, mfn_list);
 
+    case XENMEM_resource_stats_table:
+        return acquire_stats_table(d, id, frame, nr_frames, mfn_list);
+
     default:
         return -EOPNOTSUPP;
     }
diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
index 8f4b1ca10d..3543a531a1 100644
--- a/xen/common/sched/core.c
+++ b/xen/common/sched/core.c
@@ -264,6 +264,8 @@  static inline void vcpu_runstate_change(
 {
     s_time_t delta;
     struct sched_unit *unit = v->sched_unit;
+    shared_vcpustatspage_t * vcpustats_va;
+    struct domain *d = v->domain;
 
     ASSERT(spin_is_locked(get_sched_res(v->processor)->schedule_lock));
     if ( v->runstate.state == new_state )
@@ -287,6 +289,20 @@  static inline void vcpu_runstate_change(
     }
 
     v->runstate.state = new_state;
+
+    vcpustats_va = (shared_vcpustatspage_t*)d->vcpustats_page.va;
+    if ( vcpustats_va )
+    {
+	vcpustats_va->vcpu_info[v->vcpu_id].version =
+	    version_update_begin(vcpustats_va->vcpu_info[v->vcpu_id].version);
+        smp_wmb();
+        memcpy(&vcpustats_va->vcpu_info[v->vcpu_id].runstate_running_time,
+               &v->runstate.time[RUNSTATE_running],
+               sizeof(v->runstate.time[RUNSTATE_running]));
+        smp_wmb();
+        vcpustats_va->vcpu_info[v->vcpu_id].version =
+            version_update_end(vcpustats_va->vcpu_info[v->vcpu_id].version);
+    }
 }
 
 void sched_guest_idle(void (*idle) (void), unsigned int cpu)
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 50e73eef98..e1a10b8b97 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -626,6 +626,7 @@  struct xen_mem_acquire_resource {
 #define XENMEM_resource_ioreq_server 0
 #define XENMEM_resource_grant_table 1
 #define XENMEM_resource_vmtrace_buf 2
+#define XENMEM_resource_stats_table 3
 
     /*
      * IN - a type-specific resource identifier, which must be zero
@@ -683,6 +684,8 @@  struct xen_mem_acquire_resource {
 typedef struct xen_mem_acquire_resource xen_mem_acquire_resource_t;
 DEFINE_XEN_GUEST_HANDLE(xen_mem_acquire_resource_t);
 
+#define XENMEM_resource_stats_frame_vcpustats 0
+
 /*
  * XENMEM_get_vnumainfo used by guest to get
  * vNUMA topology from hypervisor.
diff --git a/xen/include/public/vcpu.h b/xen/include/public/vcpu.h
index 3623af932f..5c1812dfd2 100644
--- a/xen/include/public/vcpu.h
+++ b/xen/include/public/vcpu.h
@@ -235,6 +235,22 @@  struct vcpu_register_time_memory_area {
 typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
 DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
 
+struct vcpu_stats{
+    /* If the least-significant bit of the version number is set then an update
+     * is in progress and the guest must wait to read a consistent set of values
+     * This mechanism is similar to Linux's seqlock.
+     */
+    uint32_t version;
+    uint32_t pad0;
+    uint64_t runstate_running_time;
+};
+
+struct shared_vcpustatspage {
+    struct vcpu_stats vcpu_info[1];
+};
+
+typedef struct shared_vcpustatspage shared_vcpustatspage_t;
+
 #endif /* __XEN_PUBLIC_VCPU_H__ */
 
 /*
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 667f9dac83..d1ca8b9aa8 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -134,6 +134,8 @@  int assign_pages(
 /* Dump info to serial console */
 void arch_dump_shared_mem_info(void);
 
+void stats_free_vcpu_mfn(struct domain * d);
+
 /*
  * Extra fault info types which are used to further describe
  * the source of an access violation.
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 5485d08afb..d9551ce35f 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -577,6 +577,11 @@  struct domain
         struct ioreq_server     *server[MAX_NR_IOREQ_SERVERS];
     } ioreq_server;
 #endif
+    /* Page that hosts vcpu stats */
+    struct {
+        struct page_info *pg;
+        void *va;
+    } vcpustats_page;
 };
 
 static inline struct page_list_head *page_to_list(