mbox series

[RFC,v2,0/2] Add a new acquire resource to query vcpu statistics

Message ID cover.1665138677.git.matias.vara@vates.fr (mailing list archive)
Headers show
Series Add a new acquire resource to query vcpu statistics | expand

Message

Matias Ezequiel Vara Larsen Oct. 7, 2022, 12:39 p.m. UTC
Hello all,

The purpose of this RFC is to get feedback about a new acquire resource that
exposes vcpu statistics for a given domain. The current mechanism to get those
statistics is by querying the hypervisor. This mechanism relies on a hypercall
and holds the domctl spinlock during its execution. When a pv tool like xcp-rrdd
periodically samples these counters, it ends up affecting other paths that share
that spinlock. By using acquire resources, the pv tool only requires a few
hypercalls to set the shared memory region and samples are got without issuing
any other hypercall. The original idea has been suggested by Andrew Cooper to
which I have been discussing about how to implement the current PoC. You can
find the RFC patch series at [1]. The series is rebased on top of stable-4.15.

I am currently a bit blocked on 1) what to expose and 2) how to expose it. For
1), I decided to expose what xcp-rrdd is querying, e.g., XEN_DOMCTL_getvcpuinfo.
More precisely, xcp-rrd gets runstate.time[RUNSTATE_running]. This is a uint64_t
counter. However, the time spent in other states may be interesting too.
Regarding 2), I am not sure if simply using an array of uint64_t is enough or if
a different interface should be exposed. The remaining question is when to get
new values. For the moment, I am updating this counter during
vcpu_runstate_change().

The current series includes a simple pv tool that shows how this new interface is
used. This tool maps the counter and periodically samples it.

Any feedback/help would be appreciated.

Thanks, Matias.

[1] https://github.com/MatiasVara/xen/tree/feature_stats

Changes in v2:
- rework to ensure that consumer fetches consistent data

Changes in v1:
- rework how the resource is allocated and released
- rework when the resource is allocated that happens only when the resource is
  requested 
- rework the structure shared between the tool and Xen to make it extensible to
  new counters and declare it in a public header

There are still the following questions:
   - resource shall be released when there are no more readers otherwise we keep
     updating it during a hot path
   - one frame can host up to 512 vcpus. Should I check to this limit when
     updating? Should it be possible to allocate more than one frame for vcpu
     counters? 

Matias Ezequiel Vara Larsen (2):
  xen/memory : Add a stats_table resource type
  tools/misc: Add xen-vcpus-stats tool

 tools/misc/Makefile          |  6 +++
 tools/misc/xen-vcpus-stats.c | 87 +++++++++++++++++++++++++++++++++
 xen/arch/x86/hvm/hvm.c       |  2 +
 xen/common/memory.c          | 94 ++++++++++++++++++++++++++++++++++++
 xen/common/sched/core.c      | 16 ++++++
 xen/include/public/memory.h  |  3 ++
 xen/include/public/vcpu.h    | 16 ++++++
 xen/include/xen/mm.h         |  2 +
 xen/include/xen/sched.h      |  5 ++
 9 files changed, 231 insertions(+)
 create mode 100644 tools/misc/xen-vcpus-stats.c

Comments

Andrew Cooper Feb. 23, 2023, 7:56 p.m. UTC | #1
A discussion about forward extensible APIs and ABIs here.

First, its important to say that this should be "domain stats" and not
just "vcpu stats".  This is easy to account for now, but hard to
retrofit later.

For the shared memory, we have a minimum granularity of PAGE_SIZE (4k
for now, but this will change in due course on non-x86 architectures),
and a resource-agnostic way of determining the resource size (as a
multiple of the page size).

But there are other things we need to consider:

1) To be sensibly extendable, there needs to be some metadata, and the
domain stats is clearly going to be a different shape to the vcpu stats.

2) We also want to give Xen some flexibility to allocate memory in a
suitable manner for the system.

3) Xen and the userspace consuming this interface will likely be built
from different versions of the header.  This needs to inter-operate with
the common subset of functionality.


So what we want, at least to describe the shape, is something like this:

struct dom_shmem_stats {
    uint32_t dom_size; // sizeof(dom)
    uint32_t vcpu_offs;
    uint32_t vcpu_size; // sizeof(vcpu)
    uint32_t vcpu_stride;
    ...
};

struct vcpu_shmem_stats {
    ...
};

where the data layout shall be that there is one dom structure starting
at 0, and an array of vcpu_stride objects starting at vcpu_offset.

Obviously, some invariants apply.  vcpu_offs >= dom_size, and
vcpu_stride >= vcpu_size.  The total size of the mapping must be larger
than vcpu_offs + vcpus * vcpu_stride  (and no, I intentionally don't
care about the rounding at the end because it doesn't change things in
practice.)

A very simple layout, packing the data as closely as reasonable, might be:

vcpu_offs = roundup(sizeof(dom), cacheline_size)
vcpu_stride = roundup(sizeof(vcpu), cacheline_size);

but Xen might have reason to use some other rounding.  As the dom or
vcpu size approaches a page, then Xen will want to change allocation
scheme to use page size for both, and not vmap the lot as one
consecutive region.



For the stats data itself, there wants to be some indication of what
data Xen is producing, so userspace can know not to bother consuming. 
So something like:

struct $foo_stats {
    ...

#define STATS_A (1u << 0)
#define STATS_B (1u << 2)
    uint32_t stats_active;
   
    struct $foo_stats_a {
        uint32_t single_field;
        ... // maybe other singleton fields
    };

    struct $foo_stats_b {
        uint32_t seq;  // "seq" is more common than "version"
        uint32_t _pad;
        uint64_t field1;
        uint64_t field2;
        ...
    };
};


Forward compatibility rules say that you can only ever append new
fields.  But as hinted at with the stats_active field, its fine to leave
reserved fields around for future use, generally with a rule that
anything reserved shall be 0.

Importantly, this means that offsetof(dom, stats_b) is fixed, and will
inter-operate just fine if e.g. userspace knows about a stats_c that Xen
doesn't know about.

But this does highlight some more invariants.  Xen must not produce any
data outside of [0, vcpu_offs) for dom data, and [base, vcpu_stride) for
vcpu data.

Furthermore, Xen should not produce any data not indicated by the
stats_active field.  That said, if Xen is compiled knowing about
dom->stats_c, and userspace is not, then userspace will observe Xen
advertising a stat it doesn't know, and producing data beyond
userspace's sizeof(dom), but within dom->vcpu_offs.  This is explicitly
fine and expected, and why Xen writes it's sizeof() information into the
dom header.  This allows both sides to agree on the layout even when
they're not compiled from identical copies of the header.



A few closing thoughts.

1) It is wise to have a magic number at the head of each dom and vcpu
struct.  This helps sanity check that both sides have correctly agreed
on the layout, but can also serve double duty as an ABI "version".  If
we screw up spectacularly, and decide that the best course of action is
to break backwards compatibility, then we can change the magic and edit
the structs in a non-forwards-compatible way.

2) We may get to a point of wanting arch specific stats.  This can be
accommodated in the model by having struct arch_{dom,vcpu}_stats at
positions described by the layout at the start of dom.  It would be wise
to leave space (reserved fields) there to use if necessary.  This is
cleaner than deciding that we need to put some new layout fields after
the latest $foo_stats_$N and before $foo_stats_$M.

3) It would be great if we could have something in tools/tests/ which
can attach to a running VM and assess the correctness of the invariants
given.  Better yet if it could compile for each change of the ABI and
assess the correctness for all.


I hope this all makes sense.  I know its not trivial, but there's also
nothing in here which is rocket science, and with a bit of good design
work up front, it will be a flexible interface that we never have to
break backwards compatibility with.

~Andrew
Matias Ezequiel Vara Larsen March 6, 2023, 2:23 p.m. UTC | #2
Hello Andrew and thanks for the comments, please find my comments below.

On Thu, Feb 23, 2023 at 07:56:28PM +0000, Andrew Cooper wrote:
> A discussion about forward extensible APIs and ABIs here.
> 
> First, its important to say that this should be "domain stats" and not
> just "vcpu stats".  This is easy to account for now, but hard to
> retrofit later.
> 
> For the shared memory, we have a minimum granularity of PAGE_SIZE (4k
> for now, but this will change in due course on non-x86 architectures),
> and a resource-agnostic way of determining the resource size (as a
> multiple of the page size).
> 
> But there are other things we need to consider:
> 
> 1) To be sensibly extendable, there needs to be some metadata, and the
> domain stats is clearly going to be a different shape to the vcpu stats.
> 
> 2) We also want to give Xen some flexibility to allocate memory in a
> suitable manner for the system.
> 
> 3) Xen and the userspace consuming this interface will likely be built
> from different versions of the header.  This needs to inter-operate with
> the common subset of functionality.
> 
> 
> So what we want, at least to describe the shape, is something like this:
> 
> struct dom_shmem_stats {
>     uint32_t dom_size; // sizeof(dom)
>     uint32_t vcpu_offs;
>     uint32_t vcpu_size; // sizeof(vcpu)
>     uint32_t vcpu_stride;
>     ...
> };
> 
> struct vcpu_shmem_stats {
>     ...
> };
> 
> where the data layout shall be that there is one dom structure starting
> at 0, and an array of vcpu_stride objects starting at vcpu_offset.
> 
> Obviously, some invariants apply.  vcpu_offs >= dom_size, and
> vcpu_stride >= vcpu_size.  The total size of the mapping must be larger
> than vcpu_offs + vcpus * vcpu_stride  (and no, I intentionally don't
> care about the rounding at the end because it doesn't change things in
> practice.)
> 

Would it make sense to use different type-specific resources identifiers for
each "stat"?, e.g., XENMEM_resource_stats_table_id_vcpustats,
XENMEM_resource_stats_table_id_domstats and so on. The size of each of these
type-specific resources would be queried by using
`xenforeignmemory_resource_size()`. The mapping would be done by using
`xenforeignmemory_map_resource()`.

The metadata to represent the XENMEM_resource_stats_table_id_vcpustats
resource could be:

struct vcpu_shmem_stats {
#define STATS_A (1u << 0)
...
#define VCPU_STATS_MAGIC 0xaabbccdd
     uint32_t magic;
     uint32_t stats_active;
     uint32_t offset;  // roundup(sizeof(vcpu_shmem_stats), cacheline_size)
     uint32_t size;    // sizeof(vcpu_stats)
     uint32_t stride;  // roundup(sizeof(vcpu_stats), cacheline_size)
     uint32_t _pad;
     ...
};

struct vcpu_stats {
    /*
     * If the least-significant bit of the seq number is set then an update
     * is in progress and the consumer must wait to read a consistent set of
     * values. This mechanism is similar to Linux's seqlock.
     */
    uint32_t seq;
    uint32 _pad;
    uint64_t stats_a; // e.g., runstate_running_time
    ...
};

The data layout shall be that there is one vcpu_shmem_stats structure starting
at 0, and an array of stride objects starting at offset, i.e., vcpu_stats
structures. The invariants would be:
* 1. offset >= sizeof(struct vcpu_shmem_stats)
* 2. stride >= size
* 3. the total size of the mapping in frames, which is the value returned by
  resource_size(), must be larger than (offs + vcpus * stride).
* 4. Xen must not produce any data outside of [base, stride) for vcpu data.

The steps to add a new counter B would be:
1. append the new field at vcpu_stats structure.
2. define the bit in stats_active,.i.e., #define STATS_B (1u << 1)
3. advertise STATS_B
I may be missing some steps here but that would be the overall process.

Regarding your email, I have the following comments:

- I still do not know how to choose the value of cacheline_size. I understand
this value shall be between the actual cacheline_size and PAGE_SIZE. A value
that could match most architectures could be 256 bytes.

- Xen shall use the "stats_active" field to indicate what it is producing. In
  this field, reserved bits shall be 0. This shall allow us to agree on the
layout even when producer and consumer are compiled with different headers.

- In the vcpu_stats structure, new fields can only ever be appended.

- The magic field shall act as a sanity check but also as an ABI version in case
we decide to break backward-compatibility.

> A very simple layout, packing the data as closely as reasonable, might be:
> 
> vcpu_offs = roundup(sizeof(dom), cacheline_size)
> vcpu_stride = roundup(sizeof(vcpu), cacheline_size);
> 
> but Xen might have reason to use some other rounding.  As the dom or
> vcpu size approaches a page, then Xen will want to change allocation
> scheme to use page size for both, and not vmap the lot as one
> consecutive region.
> 
> 
> 
> For the stats data itself, there wants to be some indication of what
> data Xen is producing, so userspace can know not to bother consuming. 
> So something like:
> 
> struct $foo_stats {
>     ...
> 
> #define STATS_A (1u << 0)
> #define STATS_B (1u << 2)
>     uint32_t stats_active;
>    
>     struct $foo_stats_a {
>         uint32_t single_field;
>         ... // maybe other singleton fields
>     };
> 
>     struct $foo_stats_b {
>         uint32_t seq;  // "seq" is more common than "version"
>         uint32_t _pad;
>         uint64_t field1;
>         uint64_t field2;
>         ...
>     };
> };
> 
> 
> Forward compatibility rules say that you can only ever append new
> fields.  But as hinted at with the stats_active field, its fine to leave
> reserved fields around for future use, generally with a rule that
> anything reserved shall be 0.
> 
> Importantly, this means that offsetof(dom, stats_b) is fixed, and will
> inter-operate just fine if e.g. userspace knows about a stats_c that Xen
> doesn't know about.
> 
> But this does highlight some more invariants.  Xen must not produce any
> data outside of [0, vcpu_offs) for dom data, and [base, vcpu_stride) for
> vcpu data.
> 
> Furthermore, Xen should not produce any data not indicated by the
> stats_active field.  That said, if Xen is compiled knowing about
> dom->stats_c, and userspace is not, then userspace will observe Xen
> advertising a stat it doesn't know, and producing data beyond
> userspace's sizeof(dom), but within dom->vcpu_offs.  This is explicitly
> fine and expected, and why Xen writes it's sizeof() information into the
> dom header.  This allows both sides to agree on the layout even when
> they're not compiled from identical copies of the header.
> 
> 
> 
> A few closing thoughts.
> 
> 1) It is wise to have a magic number at the head of each dom and vcpu
> struct.  This helps sanity check that both sides have correctly agreed
> on the layout, but can also serve double duty as an ABI "version".  If
> we screw up spectacularly, and decide that the best course of action is
> to break backwards compatibility, then we can change the magic and edit
> the structs in a non-forwards-compatible way.
> 
> 2) We may get to a point of wanting arch specific stats.  This can be
> accommodated in the model by having struct arch_{dom,vcpu}_stats at
> positions described by the layout at the start of dom.  It would be wise
> to leave space (reserved fields) there to use if necessary.  This is
> cleaner than deciding that we need to put some new layout fields after
> the latest $foo_stats_$N and before $foo_stats_$M.
> 

I did not address this yet. The vcpu_stats could have some fields that would be
architecture-dependent. Those fields would be advertised by the stats_active
field.

> 3) It would be great if we could have something in tools/tests/ which
> can attach to a running VM and assess the correctness of the invariants
> given.  Better yet if it could compile for each change of the ABI and
> assess the correctness for all.
> 

I agree.

Feedback is welcome.

Thanks, Matias.
Jan Beulich March 7, 2023, 10:12 a.m. UTC | #3
On 06.03.2023 15:23, Matias Ezequiel Vara Larsen wrote:
> Regarding your email, I have the following comments:
> 
> - I still do not know how to choose the value of cacheline_size. I understand
> this value shall be between the actual cacheline_size and PAGE_SIZE. A value
> that could match most architectures could be 256 bytes.

This isn't a concern anymore when offset and stride are stored in the
header. It was a concern when trying to come up with a layout where
these value were to be inferred (or known a priori).

> - Xen shall use the "stats_active" field to indicate what it is producing. In
>   this field, reserved bits shall be 0. This shall allow us to agree on the
> layout even when producer and consumer are compiled with different headers.

I wonder how well such a bitfield is going to scale. It provides for
only 32 (maybe 64) counters. Of course this may seem a lot right now,
but you never know how quickly something like this can grow. Plus
with ...

> - In the vcpu_stats structure, new fields can only ever be appended.

... this rule the only ambiguity that could arise to consumers when
no active flags existed would be that they can't tell "event never
occurred" from "hypervisor doesn't know about this anymore".

Jan
Matias Ezequiel Vara Larsen March 8, 2023, 11:54 a.m. UTC | #4
On Tue, Mar 07, 2023 at 11:12:00AM +0100, Jan Beulich wrote:
> On 06.03.2023 15:23, Matias Ezequiel Vara Larsen wrote:
> > Regarding your email, I have the following comments:
> > 
> > - I still do not know how to choose the value of cacheline_size. I understand
> > this value shall be between the actual cacheline_size and PAGE_SIZE. A value
> > that could match most architectures could be 256 bytes.
> 
> This isn't a concern anymore when offset and stride are stored in the
> header. It was a concern when trying to come up with a layout where
> these value were to be inferred (or known a priori).
> 

Oh, I see. Cacheline_size shall be decided at compilation time for a given
arch, e.g., SMP_CACHE_BYTES.

> > - Xen shall use the "stats_active" field to indicate what it is producing. In
> >   this field, reserved bits shall be 0. This shall allow us to agree on the
> > layout even when producer and consumer are compiled with different headers.
> 
> I wonder how well such a bitfield is going to scale. It provides for
> only 32 (maybe 64) counters. Of course this may seem a lot right now,
> but you never know how quickly something like this can grow. Plus
> with ...
> 

Would it make sense to define it like this?:

struct vcpu_shmem_stats {
#define STATS_A (1u << 0)
...
#define VCPU_STATS_MAGIC 0xaabbccdd
     uint32_t magic;
     uint32_t offset;  // roundup(sizeof(vcpu_shmem_stats) + sizeof(uint32_t) * nr_stats, cacheline_size)
     uint32_t size;    // sizeof(vcpu_stats)
     uint32_t stride;  // roundup(sizeof(vcpu_stats), cacheline_size)
     uint32_t nr_stats; // size of stats_active in uint32_t
     uint32_t stats_active[XEN_FLEX_ARRAY_DIM];
     ...
};
 
> > - In the vcpu_stats structure, new fields can only ever be appended.
> 
> ... this rule the only ambiguity that could arise to consumers when
> no active flags existed would be that they can't tell "event never
> occurred" from "hypervisor doesn't know about this anymore".

Do you mean how the consumer can figure out if either 1) Xen does not know yet
about some flag or 2) the flag has been deprecated? I think 2) is the case that
Andrew mention in which the magic number could be used as an ABI version to
break backwards compatibility.

Thanks, Matias.
Jan Beulich March 8, 2023, 2:16 p.m. UTC | #5
On 08.03.2023 12:54, Matias Ezequiel Vara Larsen wrote:
> On Tue, Mar 07, 2023 at 11:12:00AM +0100, Jan Beulich wrote:
>> On 06.03.2023 15:23, Matias Ezequiel Vara Larsen wrote:
>>> - Xen shall use the "stats_active" field to indicate what it is producing. In
>>>   this field, reserved bits shall be 0. This shall allow us to agree on the
>>> layout even when producer and consumer are compiled with different headers.
>>
>> I wonder how well such a bitfield is going to scale. It provides for
>> only 32 (maybe 64) counters. Of course this may seem a lot right now,
>> but you never know how quickly something like this can grow. Plus
>> with ...
>>
> 
> Would it make sense to define it like this?:
> 
> struct vcpu_shmem_stats {
> #define STATS_A (1u << 0)
> ...
> #define VCPU_STATS_MAGIC 0xaabbccdd
>      uint32_t magic;
>      uint32_t offset;  // roundup(sizeof(vcpu_shmem_stats) + sizeof(uint32_t) * nr_stats, cacheline_size)
>      uint32_t size;    // sizeof(vcpu_stats)
>      uint32_t stride;  // roundup(sizeof(vcpu_stats), cacheline_size)
>      uint32_t nr_stats; // size of stats_active in uint32_t
>      uint32_t stats_active[XEN_FLEX_ARRAY_DIM];
>      ...
> };

Possibly, but this would make it harder to use the interface. An alternative
might be to specify that an actual stats value set to ~0 marks an inactive
element. Since these are 64-bit counters, with today's and tomorrow's
computers we won't be at risk of a counter reaching a value of 2^^64-1, I
guess. And even if there was one where such a risk could not be excluded
(e.g. because of using pretty large increments), one would merely need to
make sure to saturate at 2^^64-2. Plus at such a point one would need to
consider anyway to switch to 128-bit fields, as neither saturated nor
wrapped values are of much use to consumers.

>>> - In the vcpu_stats structure, new fields can only ever be appended.
>>
>> ... this rule the only ambiguity that could arise to consumers when
>> no active flags existed would be that they can't tell "event never
>> occurred" from "hypervisor doesn't know about this anymore".
> 
> Do you mean how the consumer can figure out if either 1) Xen does not know yet
> about some flag or 2) the flag has been deprecated? I think 2) is the case that
> Andrew mention in which the magic number could be used as an ABI version to
> break backwards compatibility.

No, an inactive field wouldn't break the ABI. An ABI change would be if
such an inactive field was actually removed from the array. Or if e.g.,
as per above, we needed to switch to 128-bit counters.

Jan
Matias Ezequiel Vara Larsen March 9, 2023, 10:38 a.m. UTC | #6
On Wed, Mar 08, 2023 at 03:16:05PM +0100, Jan Beulich wrote:
> On 08.03.2023 12:54, Matias Ezequiel Vara Larsen wrote:
> > On Tue, Mar 07, 2023 at 11:12:00AM +0100, Jan Beulich wrote:
> >> On 06.03.2023 15:23, Matias Ezequiel Vara Larsen wrote:
> >>> - Xen shall use the "stats_active" field to indicate what it is producing. In
> >>>   this field, reserved bits shall be 0. This shall allow us to agree on the
> >>> layout even when producer and consumer are compiled with different headers.
> >>
> >> I wonder how well such a bitfield is going to scale. It provides for
> >> only 32 (maybe 64) counters. Of course this may seem a lot right now,
> >> but you never know how quickly something like this can grow. Plus
> >> with ...
> >>
> > 
> > Would it make sense to define it like this?:
> > 
> > struct vcpu_shmem_stats {
> > #define STATS_A (1u << 0)
> > ...
> > #define VCPU_STATS_MAGIC 0xaabbccdd
> >      uint32_t magic;
> >      uint32_t offset;  // roundup(sizeof(vcpu_shmem_stats) + sizeof(uint32_t) * nr_stats, cacheline_size)
> >      uint32_t size;    // sizeof(vcpu_stats)
> >      uint32_t stride;  // roundup(sizeof(vcpu_stats), cacheline_size)
> >      uint32_t nr_stats; // size of stats_active in uint32_t
> >      uint32_t stats_active[XEN_FLEX_ARRAY_DIM];
> >      ...
> > };
> 

The use of stats_active[] is meant to have a bitmap that could scale thus not
limiting the number of counters in the vcpu_stat structure to 32 or 64. I can't
see other way to have an unlimited number of counters though.

> Possibly, but this would make it harder to use the interface. An alternative
> might be to specify that an actual stats value set to ~0 marks an inactive
> element. Since these are 64-bit counters, with today's and tomorrow's
> computers we won't be at risk of a counter reaching a value of 2^^64-1, I
> guess. And even if there was one where such a risk could not be excluded
> (e.g. because of using pretty large increments), one would merely need to
> make sure to saturate at 2^^64-2. Plus at such a point one would need to
> consider anyway to switch to 128-bit fields, as neither saturated nor
> wrapped values are of much use to consumers.
> 

If I understand well, this use-case is in case an element in the stats_active
bitmap becomes inactive, i.e., it is set to "0" in stats_active[]. You are
proposing to set to ~0 the actual stats value to mark an inactive element. I
may be missing something here but would not be enough to set to "0" the
corresponding stats_active[] bit? 

> >>> - In the vcpu_stats structure, new fields can only ever be appended.
> >>
> >> ... this rule the only ambiguity that could arise to consumers when
> >> no active flags existed would be that they can't tell "event never
> >> occurred" from "hypervisor doesn't know about this anymore".
> > 
> > Do you mean how the consumer can figure out if either 1) Xen does not know yet
> > about some flag or 2) the flag has been deprecated? I think 2) is the case that
> > Andrew mention in which the magic number could be used as an ABI version to
> > break backwards compatibility.
> 
> No, an inactive field wouldn't break the ABI. An ABI change would be if
> such an inactive field was actually removed from the array. Or if e.g.,
> as per above, we needed to switch to 128-bit counters.

Got it, Thanks.

Matias
Jan Beulich March 9, 2023, 11:50 a.m. UTC | #7
On 09.03.2023 11:38, Matias Ezequiel Vara Larsen wrote:
> On Wed, Mar 08, 2023 at 03:16:05PM +0100, Jan Beulich wrote:
>> On 08.03.2023 12:54, Matias Ezequiel Vara Larsen wrote:
>>> On Tue, Mar 07, 2023 at 11:12:00AM +0100, Jan Beulich wrote:
>>>> On 06.03.2023 15:23, Matias Ezequiel Vara Larsen wrote:
>>>>> - Xen shall use the "stats_active" field to indicate what it is producing. In
>>>>>   this field, reserved bits shall be 0. This shall allow us to agree on the
>>>>> layout even when producer and consumer are compiled with different headers.
>>>>
>>>> I wonder how well such a bitfield is going to scale. It provides for
>>>> only 32 (maybe 64) counters. Of course this may seem a lot right now,
>>>> but you never know how quickly something like this can grow. Plus
>>>> with ...
>>>>
>>>
>>> Would it make sense to define it like this?:
>>>
>>> struct vcpu_shmem_stats {
>>> #define STATS_A (1u << 0)
>>> ...
>>> #define VCPU_STATS_MAGIC 0xaabbccdd
>>>      uint32_t magic;
>>>      uint32_t offset;  // roundup(sizeof(vcpu_shmem_stats) + sizeof(uint32_t) * nr_stats, cacheline_size)
>>>      uint32_t size;    // sizeof(vcpu_stats)
>>>      uint32_t stride;  // roundup(sizeof(vcpu_stats), cacheline_size)
>>>      uint32_t nr_stats; // size of stats_active in uint32_t
>>>      uint32_t stats_active[XEN_FLEX_ARRAY_DIM];
>>>      ...
>>> };
>>
> 
> The use of stats_active[] is meant to have a bitmap that could scale thus not
> limiting the number of counters in the vcpu_stat structure to 32 or 64. I can't
> see other way to have an unlimited number of counters though.
> 
>> Possibly, but this would make it harder to use the interface. An alternative
>> might be to specify that an actual stats value set to ~0 marks an inactive
>> element. Since these are 64-bit counters, with today's and tomorrow's
>> computers we won't be at risk of a counter reaching a value of 2^^64-1, I
>> guess. And even if there was one where such a risk could not be excluded
>> (e.g. because of using pretty large increments), one would merely need to
>> make sure to saturate at 2^^64-2. Plus at such a point one would need to
>> consider anyway to switch to 128-bit fields, as neither saturated nor
>> wrapped values are of much use to consumers.
>>
> 
> If I understand well, this use-case is in case an element in the stats_active
> bitmap becomes inactive, i.e., it is set to "0" in stats_active[]. You are
> proposing to set to ~0 the actual stats value to mark an inactive element. I
> may be missing something here but would not be enough to set to "0" the
> corresponding stats_active[] bit? 

The suggestion was to eliminate the need for stats_active[].

Jan
Matias Ezequiel Vara Larsen March 10, 2023, 10:58 a.m. UTC | #8
On Thu, Mar 09, 2023 at 12:50:18PM +0100, Jan Beulich wrote:
> On 09.03.2023 11:38, Matias Ezequiel Vara Larsen wrote:
> > On Wed, Mar 08, 2023 at 03:16:05PM +0100, Jan Beulich wrote:
> >> On 08.03.2023 12:54, Matias Ezequiel Vara Larsen wrote:
> >>> On Tue, Mar 07, 2023 at 11:12:00AM +0100, Jan Beulich wrote:
> >>>> On 06.03.2023 15:23, Matias Ezequiel Vara Larsen wrote:
> >>>>> - Xen shall use the "stats_active" field to indicate what it is producing. In
> >>>>>   this field, reserved bits shall be 0. This shall allow us to agree on the
> >>>>> layout even when producer and consumer are compiled with different headers.
> >>>>
> >>>> I wonder how well such a bitfield is going to scale. It provides for
> >>>> only 32 (maybe 64) counters. Of course this may seem a lot right now,
> >>>> but you never know how quickly something like this can grow. Plus
> >>>> with ...
> >>>>
> >>>
> >>> Would it make sense to define it like this?:
> >>>
> >>> struct vcpu_shmem_stats {
> >>> #define STATS_A (1u << 0)
> >>> ...
> >>> #define VCPU_STATS_MAGIC 0xaabbccdd
> >>>      uint32_t magic;
> >>>      uint32_t offset;  // roundup(sizeof(vcpu_shmem_stats) + sizeof(uint32_t) * nr_stats, cacheline_size)
> >>>      uint32_t size;    // sizeof(vcpu_stats)
> >>>      uint32_t stride;  // roundup(sizeof(vcpu_stats), cacheline_size)
> >>>      uint32_t nr_stats; // size of stats_active in uint32_t
> >>>      uint32_t stats_active[XEN_FLEX_ARRAY_DIM];
> >>>      ...
> >>> };
> >>
> > 
> > The use of stats_active[] is meant to have a bitmap that could scale thus not
> > limiting the number of counters in the vcpu_stat structure to 32 or 64. I can't
> > see other way to have an unlimited number of counters though.
> > 
> >> Possibly, but this would make it harder to use the interface. An alternative
> >> might be to specify that an actual stats value set to ~0 marks an inactive
> >> element. Since these are 64-bit counters, with today's and tomorrow's
> >> computers we won't be at risk of a counter reaching a value of 2^^64-1, I
> >> guess. And even if there was one where such a risk could not be excluded
> >> (e.g. because of using pretty large increments), one would merely need to
> >> make sure to saturate at 2^^64-2. Plus at such a point one would need to
> >> consider anyway to switch to 128-bit fields, as neither saturated nor
> >> wrapped values are of much use to consumers.
> >>
> > 
> > If I understand well, this use-case is in case an element in the stats_active
> > bitmap becomes inactive, i.e., it is set to "0" in stats_active[]. You are
> > proposing to set to ~0 the actual stats value to mark an inactive element. I
> > may be missing something here but would not be enough to set to "0" the
> > corresponding stats_active[] bit? 
> 
> The suggestion was to eliminate the need for stats_active[].
> 
Oh, I see, thanks for the clarification. To summarise, these are the current
options:
1. Use a "uint64_t" field thus limiting the number of counters to 64. The
current vcpu_runstate_info structure is limited to 4 counters though, one for
each RUNSTATE_*. 
2. Use a dynamic array but this makes harder to use the interface.
3. Eliminate stats_active and set to ~0 the actual stats value to mark inactive
counters. This requires adding a "nr_stats" field to know how many counters are.
Also, this requires to make sure to saturate at 2^^64-2.

I might miss some details here but these are the options to evaluate. 

I would go with a variation of 1) by using two uint64_t, i.e., up to 128 vcpu's
counters, which I think it would be enough. I may be wrong.

Thoughs?
 
Matias
Jan Beulich March 10, 2023, 11:34 a.m. UTC | #9
On 10.03.2023 11:58, Matias Ezequiel Vara Larsen wrote:
> Oh, I see, thanks for the clarification. To summarise, these are the current
> options:
> 1. Use a "uint64_t" field thus limiting the number of counters to 64. The
> current vcpu_runstate_info structure is limited to 4 counters though, one for
> each RUNSTATE_*. 
> 2. Use a dynamic array but this makes harder to use the interface.
> 3. Eliminate stats_active and set to ~0 the actual stats value to mark inactive
> counters. This requires adding a "nr_stats" field to know how many counters are.

While nr_stats can indeed be seen as a generalization of the earlier
stats_active, I think it is possible to get away without, as long as
padding fields also are filled with the "inactive" marker.

> Also, this requires to make sure to saturate at 2^^64-2.

Thinking of it - considering overflowed counters inactive looks like a
reasonable model to me as well (which would mean saturating at 2^^64-1).

> I might miss some details here but these are the options to evaluate. 
> 
> I would go with a variation of 1) by using two uint64_t, i.e., up to 128 vcpu's
> counters, which I think it would be enough. I may be wrong.

Well, to me it doesn't matter whether it's 32, 64, or 128 - my concern
is with any kind of inherent upper bound. Using 128 right away might
look excessive, just like 32 might look too little. Hence my desire to
get away without any built-in upper bound. IOW I continue to favor 3,
irrespective of the presence or absence of nr_stats.

Jan
Matias Ezequiel Vara Larsen March 14, 2023, 10:28 a.m. UTC | #10
On Fri, Mar 10, 2023 at 12:34:33PM +0100, Jan Beulich wrote:
> On 10.03.2023 11:58, Matias Ezequiel Vara Larsen wrote:
> > Oh, I see, thanks for the clarification. To summarise, these are the current
> > options:
> > 1. Use a "uint64_t" field thus limiting the number of counters to 64. The
> > current vcpu_runstate_info structure is limited to 4 counters though, one for
> > each RUNSTATE_*. 
> > 2. Use a dynamic array but this makes harder to use the interface.
> > 3. Eliminate stats_active and set to ~0 the actual stats value to mark inactive
> > counters. This requires adding a "nr_stats" field to know how many counters are.
> 
> While nr_stats can indeed be seen as a generalization of the earlier
> stats_active, I think it is possible to get away without, as long as
> padding fields also are filled with the "inactive" marker.
> 

Understood.

> > Also, this requires to make sure to saturate at 2^^64-2.
> 
> Thinking of it - considering overflowed counters inactive looks like a
> reasonable model to me as well (which would mean saturating at 2^^64-1).
> 
> > I might miss some details here but these are the options to evaluate. 
> > 
> > I would go with a variation of 1) by using two uint64_t, i.e., up to 128 vcpu's
> > counters, which I think it would be enough. I may be wrong.
> 
> Well, to me it doesn't matter whether it's 32, 64, or 128 - my concern
> is with any kind of inherent upper bound. Using 128 right away might
> look excessive, just like 32 might look too little. Hence my desire to
> get away without any built-in upper bound. IOW I continue to favor 3,
> irrespective of the presence or absence of nr_stats.
> 
I see. 3) layout would look like:

struct vcpu_shmem_stats {
#define VCPU_STATS_MAGIC 0xaabbccdd
    uint32_t magic;
    uint32_t offset;  // roundup(sizeof(vcpu_shmem_stats), cacheline_size)
    uint32_t size;    // sizeof(vcpu_stats)
    uint32_t stride;  // roundup(sizeof(vcpu_stats), cacheline_size)
};

struct vcpu_stats {
    /*
     * If the least-significant bit of the seq number is set then an update
     * is in progress and the consumer must wait to read a consistent set of
     * values. This mechanism is similar to Linux's seqlock.
     */
    uint32_t seq;
    uint32 _pad;
    /*
     * If the most-significant bit of a counter is set then the counter
     * is inactive and the consumer must ignore its value. Note that this
     * could also indicate that the counter has overflowed.
     */
    uint64_t stats_a; // e.g., runstate_running_time
    ...
};

All padding fields shall be marked as "inactive". The consumer can't
distinguish inactive from overflowed. Also, the consumer shall always verify
before reading that:

offsetof(struct vcpu_stats, stats_y) < size. 

in case the consumer knows about a counter, e.g., stats_y, that Xen does not
it.

Matias
Jan Beulich March 14, 2023, 10:34 a.m. UTC | #11
On 14.03.2023 11:28, Matias Ezequiel Vara Larsen wrote:
> On Fri, Mar 10, 2023 at 12:34:33PM +0100, Jan Beulich wrote:
>> On 10.03.2023 11:58, Matias Ezequiel Vara Larsen wrote:
>>> Oh, I see, thanks for the clarification. To summarise, these are the current
>>> options:
>>> 1. Use a "uint64_t" field thus limiting the number of counters to 64. The
>>> current vcpu_runstate_info structure is limited to 4 counters though, one for
>>> each RUNSTATE_*. 
>>> 2. Use a dynamic array but this makes harder to use the interface.
>>> 3. Eliminate stats_active and set to ~0 the actual stats value to mark inactive
>>> counters. This requires adding a "nr_stats" field to know how many counters are.
>>
>> While nr_stats can indeed be seen as a generalization of the earlier
>> stats_active, I think it is possible to get away without, as long as
>> padding fields also are filled with the "inactive" marker.
>>
> 
> Understood.
> 
>>> Also, this requires to make sure to saturate at 2^^64-2.
>>
>> Thinking of it - considering overflowed counters inactive looks like a
>> reasonable model to me as well (which would mean saturating at 2^^64-1).
>>
>>> I might miss some details here but these are the options to evaluate. 
>>>
>>> I would go with a variation of 1) by using two uint64_t, i.e., up to 128 vcpu's
>>> counters, which I think it would be enough. I may be wrong.
>>
>> Well, to me it doesn't matter whether it's 32, 64, or 128 - my concern
>> is with any kind of inherent upper bound. Using 128 right away might
>> look excessive, just like 32 might look too little. Hence my desire to
>> get away without any built-in upper bound. IOW I continue to favor 3,
>> irrespective of the presence or absence of nr_stats.
>>
> I see. 3) layout would look like:
> 
> struct vcpu_shmem_stats {
> #define VCPU_STATS_MAGIC 0xaabbccdd
>     uint32_t magic;
>     uint32_t offset;  // roundup(sizeof(vcpu_shmem_stats), cacheline_size)
>     uint32_t size;    // sizeof(vcpu_stats)
>     uint32_t stride;  // roundup(sizeof(vcpu_stats), cacheline_size)
> };
> 
> struct vcpu_stats {
>     /*
>      * If the least-significant bit of the seq number is set then an update
>      * is in progress and the consumer must wait to read a consistent set of
>      * values. This mechanism is similar to Linux's seqlock.
>      */
>     uint32_t seq;
>     uint32 _pad;
>     /*
>      * If the most-significant bit of a counter is set then the counter
>      * is inactive and the consumer must ignore its value. Note that this
>      * could also indicate that the counter has overflowed.
>      */
>     uint64_t stats_a; // e.g., runstate_running_time
>     ...
> };
> 
> All padding fields shall be marked as "inactive". The consumer can't
> distinguish inactive from overflowed. Also, the consumer shall always verify
> before reading that:
> 
> offsetof(struct vcpu_stats, stats_y) < size. 
> 
> in case the consumer knows about a counter, e.g., stats_y, that Xen does not
> it.

Looks okay to me this way, but please have this verified by another party
(preferably Andrew, whose proposal was now changed).

Jan