diff mbox series

[v2] mm/pdx: Add comments throughout the codebase for pdx

Message ID 20230622140237.8996-1-alejandro.vallejo@cloud.com (mailing list archive)
State Superseded
Headers show
Series [v2] mm/pdx: Add comments throughout the codebase for pdx | expand

Commit Message

Alejandro Vallejo June 22, 2023, 2:02 p.m. UTC
Document the behaviour of the pdx machinery in Xen. Some logic is fairly
opaque and hard to follow without it being documented anywhere. This
explains the rationale behind compression and its relationship to
frametable indexing and directmap management.

While modifying the file:
  * Convert u64 -> uint64_t
  * Remove extern keyword from function prototypes

No functional change.

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
---
v2:
  * [pdx.h] Moved pdx_init_mask() after pdx_region_mask() so comments are
            more self-documenting

  (Jan feedback)
  * [pdx.c] Rewrote ma_bottom_mask as ma_va_bottom in comment.
  * [pdx.c] Rewrote comment in pdx_region_mask() to explain in terms of
            msb() rather than fill_mask().
  * [mm.h] Made it explicit that the compression scheme might still yield
           an identity mapping between mfn and pdx.
  * [mm.h] Removed vaddr definition in mm.h
  * [pdx.h] Refine definition of compression so it's clear it can apply to
            many regions
  * [pdx.h] s/they exist/they are backed by RAM/
  * [pdx.h] Improved comment in pdx_region_mask()
  * [pdx.h] Corrected mistaken comment in pdx_init_mask()
  * [pdx.h] Added mathematical interval in comment for set_pdx_range()
  * [pdx.h] Rewrote __mfn_valid() to refer to the frame table rather than
            a pdx
---
 xen/common/pdx.c      |  67 +++++++++++++++++++--
 xen/include/xen/mm.h  |  10 +++
 xen/include/xen/pdx.h | 137 ++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 206 insertions(+), 8 deletions(-)

Comments

Jan Beulich July 6, 2023, 9:50 a.m. UTC | #1
On 22.06.2023 16:02, Alejandro Vallejo wrote:
> @@ -57,9 +100,25 @@ uint64_t __init pdx_init_mask(uint64_t base_addr)
>                           (uint64_t)1 << (MAX_ORDER + PAGE_SHIFT)) - 1);
>  }
>  
> -u64 __init pdx_region_mask(u64 base, u64 len)
> +uint64_t __init pdx_region_mask(uint64_t base, uint64_t len)
>  {
> -    return fill_mask(base ^ (base + len - 1));
> +    uint64_t last = base + len - 1;
> +    /*
> +     * The only bit that matters in base^last is the MSB. There are 2 cases.
> +     *
> +     * case msb(base) < msb(last):
> +     *     then msb(fill_mask(base^last)) == msb(last). This is non
> +     *     compressible.
> +     * case msb(base) == msb(last):
> +     *     This means that there _may_ be a sequence of compressible zeroes
> +     *     for all addresses between `base` and `last` iff `base` has enough
> +     *     trailing zeroes. That is, it's compressible when

Why trailing zeros? [100000f000,10ffffffff] has compressible bits
32-35, but the low bits of base don't matter at all.

> +     *     msb(fill_mask(base^last)) < msb(last)

No caller uses the result this way, so I'm unconvinced it is helpful
to explain it here this way. This is also why I'm still not convinced
of the introduction of "last" (as a real variable and in the comment).
It's only the invariant bits in the range that we're after, as you
say ...

> +     * The resulting mask is effectively the moving bits between `base` and
> +     * `last`.

... here (where things could be expressed without "last").

In any event, nit: If you introduce a local variable, then it wants
following by a blank line.

> --- a/xen/include/xen/mm.h
> +++ b/xen/include/xen/mm.h
> @@ -31,6 +31,16 @@
>   *   (i.e. all devices assigned to) a guest share a single DMA address space
>   *   and, by default, Xen will ensure dfn == pfn.
>   *
> + * pdx: Page InDeX
> + *   Indices into the frame table holding the per-page's book-keeping
> + *   metadata. A compression scheme is used and there's a possibly non

s/is/may be/ ?

Also as said earlier at least on x86 pdx-es are also used as direct map
indices. I think this wants mentioning irrespective of what Arm does.

> --- a/xen/include/xen/pdx.h
> +++ b/xen/include/xen/pdx.h
> @@ -1,6 +1,67 @@
>  #ifndef __XEN_PDX_H__
>  #define __XEN_PDX_H__
>  
> +/*
> + * PDX (Page inDeX)
> + *
> + * This file deals with optimisations pertaining frame table indexing,

Nit: Missing "to"?

> + * A pdx is an index into the frame table. However, having an identity
> + * relationship between mfn and pdx could waste copious amounts of memory
> + * in empty frame table entries. There are some techniques to bring memory
> + * wastage down.

Like above the direct map wants mentioning here as well, I think.

> + * ## PDX grouping
> + *
> + * The frame table may have some sparsity even on systems where the memory
> + * banks are tightly packed. This is due to system quirks (like the PCI
> + * hole) which might introduce several GiB of unused page frame numbers
> + * that uselessly waste memory in the frame table. PDX grouping addresses
> + * this by keeping a bitmap of the ranges in the frame table containing
> + * invalid entries and not allocating backing memory for them.
> + *
> + * ## PDX compression
> + *
> + * This is a technique to avoid wasting memory on machines known to have
> + * split their machine address space in several big discontinuous and highly
> + * disjoint chunks.
> + *
> + * In its uncompressed form the frame table must have book-keeping metadata
> + * structures for every page between [0, max_mfn) (whether they are backed
> + * by RAM or not), and a similar condition exists for the direct map. We
> + * know some systems, however, that have some sparsity in their address
> + * space, leading to a lot of wastage in the form of unused frame table
> + * entries.
> + *
> + * This is where compression becomes useful. The idea is to note that if
> + * you have several big chunks of memory sufficiently far apart you can
> + * ignore the middle part of the address because it will always contain
> + * zeroes as long as the base address is sufficiently well aligned and the
> + * length of the region is much smaller than the base address.

As per above alignment of the base address doesn't really matter.

> + * i.e:
> + *   Consider 2 regions of memory. One starts at 0 while the other starts
> + *   at offset 2^off_h. Furthermore, let's assume both regions are smaller
> + *   than 2^off_l. This means that all addresses between [2^off_l, 2^off_h)
> + *   are invalid and we can assume them to be zero on all valid addresses.
> + *
> + *                 off_h     off_l
> + *                 |         |
> + *                 V         V
> + *         --------------------------
> + *         |HHHHHHH|000000000|LLLLLL| <--- mfn
> + *         --------------------------
> + *           ^ |
> + *           | | (de)compression by adding/removing "useless" zeroes
> + *           | V
> + *         ---------------
> + *         |HHHHHHHLLLLLL| <--- pdx
> + *         ---------------
> + *
> + * This scheme also holds for multiple regions, where HHHHHHH acts as
> + * the region identifier and LLLLLL fully contains the span of every
> + * region involved. Consider the following 3 regions.
> + */
> +
>  #ifdef CONFIG_HAS_PDX

Stray last sentence in the comment?

> @@ -13,22 +74,81 @@ extern unsigned long pfn_top_mask, ma_top_mask;
>                           (sizeof(*frame_table) & -sizeof(*frame_table)))
>  extern unsigned long pdx_group_valid[];
>  
> -extern uint64_t pdx_init_mask(u64 base_addr);
> -extern u64 pdx_region_mask(u64 base, u64 len);
> +/**
> + * Calculates a mask covering "moving" bits of all addresses of a region
> + *
> + * The i-th bit of the mask must be set if there's 2 different addresses
> + * in the region that have different j-th bits. where j >= i.
> + *
> + * e.g:
> + *       base=0x1B00000000
> + *   len+base=0x1B00082000
> + *
> + *   ought to return 0x00000FFFFF, which implies that every bit position
> + *   with a zero in the mask remains unchanged in every address of the
> + *   region.

Maybe the example would look "more generic" if the resulting mask didn't
consist of just 0s and Fs, e.g.

 *       base=0x1B00000000
 *   len+base=0x1B00042000
 *
 *   ought to return 0x000007FFFF, ...

> + * @param base Base address of the region
> + * @param len  Size in octets of the region
> + * @return Mask of moving bits at the bottom of all the region addresses
> + */
> +uint64_t pdx_region_mask(uint64_t base, uint64_t len);
>  
> -extern void set_pdx_range(unsigned long smfn, unsigned long emfn);
> +/**
> + * Creates a basic pdx mask

What is "basic" trying to describe here? "The mask to start from" would
look more to the point to me, plus saying that this is the (up front)
companion to pdx_region_mask().

> + * This mask is used to determine non-compressible bits. No address in the
> + * system shall have compressible bits covered by this initial mask.
> + *
> + * It's the larger of:
> + *   - A mask of MAX_ORDER + PAGESHIFT 1s
> + *   - base_addr rounded up to the nearest `2^n - 1`

I'm having trouble with this mathematically: Rounding always means
going to some proper multiple of some base number (granularity). This
doesn't fit with the value being 2^n-1. Maybe "padded"?

> + * @param base_addr Address of the first maddr in the system
> + * @return An integer of the form 2^n - 1
> + */
> +uint64_t pdx_init_mask(uint64_t base_addr);
> +
> +/**
> + * Mark [smfn, emfn) as allocatable in the frame table

What does "allocatable" mean here?

Jan
Alejandro Vallejo July 7, 2023, 3:55 p.m. UTC | #2
On Thu, Jul 06, 2023 at 11:50:58AM +0200, Jan Beulich wrote:
> On 22.06.2023 16:02, Alejandro Vallejo wrote:
> > @@ -57,9 +100,25 @@ uint64_t __init pdx_init_mask(uint64_t base_addr)
> >                           (uint64_t)1 << (MAX_ORDER + PAGE_SHIFT)) - 1);
> >  }
> >  
> > -u64 __init pdx_region_mask(u64 base, u64 len)
> > +uint64_t __init pdx_region_mask(uint64_t base, uint64_t len)
> >  {
> > -    return fill_mask(base ^ (base + len - 1));
> > +    uint64_t last = base + len - 1;
> > +    /*
> > +     * The only bit that matters in base^last is the MSB. There are 2 cases.
> > +     *
> > +     * case msb(base) < msb(last):
> > +     *     then msb(fill_mask(base^last)) == msb(last). This is non
> > +     *     compressible.
> > +     * case msb(base) == msb(last):
> > +     *     This means that there _may_ be a sequence of compressible zeroes
> > +     *     for all addresses between `base` and `last` iff `base` has enough
> > +     *     trailing zeroes. That is, it's compressible when
> 
> Why trailing zeros? [100000f000,10ffffffff] has compressible bits
> 32-35, but the low bits of base don't matter at all.
Ugh, I was thinking about the zeroes in the hole, but those are hardly
trailing indeed. Look below for the revamped comment though, as this simply
goes away.

> 
> > +     *     msb(fill_mask(base^last)) < msb(last)
> 
> No caller uses the result this way, so I'm unconvinced it is helpful
> to explain it here this way. This is also why I'm still not convinced
> of the introduction of "last" (as a real variable and in the comment).
> It's only the invariant bits in the range that we're after, as you
> say ...
> > +     * The resulting mask is effectively the moving bits between `base` and
> > +     * `last`.
> 
> ... here (where things could be expressed without "last").
> 
I've given it a go rephrashing it in terms of the logical operations being
applied rather than the relationship between the inputs. If you're still
unsatisfied I'm happy to hear other suggestions. It's just a complicated
thing to put into words.

With this definition I'm happy to remove the `last` auxiliary variable from
the patch because it's unnecessary.

```
     * We say a bit "moves" in a range if there exist 2 addresses in that
     * range that have that bit both set and cleared respectively. We want
     * to create a mask of _all_ moving bits in this range. We do this by
     * comparing the first and last addresses in the range, discarding the
     * bits that remain the same (this is logically an XOR operation). The
     * MSB of the resulting expression is the most significant moving bit
     * in the range. Then it's a matter of setting every bit in lower
     * positions in order to get the mask of moving bits.
```

> > --- a/xen/include/xen/mm.h
> > +++ b/xen/include/xen/mm.h
> > @@ -31,6 +31,16 @@
> >   *   (i.e. all devices assigned to) a guest share a single DMA address space
> >   *   and, by default, Xen will ensure dfn == pfn.
> >   *
> > + * pdx: Page InDeX
> > + *   Indices into the frame table holding the per-page's book-keeping
> > + *   metadata. A compression scheme is used and there's a possibly non
> 
> s/is/may be/ ?
Ack. The scheme is used even if it can yield no gains, but I do plan on
making it optional, so this comment can preemptively already reflect that.

> 
> Also as said earlier at least on x86 pdx-es are also used as direct map
> indices. I think this wants mentioning irrespective of what Arm does.
This is a common header. If the warping of the directmap due to pdx
compression is mentioned one has to talk about the common aspects of it,
and at least give a heads up that each port is free to apply further warps.
I'll mention the notion of it, but it must be vague in the spirit of
describing common behaviour and not x86 in particular.

> 
> > --- a/xen/include/xen/pdx.h
> > +++ b/xen/include/xen/pdx.h
> > @@ -1,6 +1,67 @@
> >  #ifndef __XEN_PDX_H__
> >  #define __XEN_PDX_H__
> >  
> > +/*
> > + * PDX (Page inDeX)
> > + *
> > + * This file deals with optimisations pertaining frame table indexing,
> 
> Nit: Missing "to"?
Indeed
> 
> > + * A pdx is an index into the frame table. However, having an identity
> > + * relationship between mfn and pdx could waste copious amounts of memory
> > + * in empty frame table entries. There are some techniques to bring memory
> > + * wastage down.
> 
> Like above the direct map wants mentioning here as well, I think.
I can add another paragraph mentioning that warp. Like I mentioned before,
we should be careful not to leave the ARM port (or others) outside the
scope of these definitions as they aren't x86-specific.

> > + * ## PDX compression
> > + *
> > + * This is a technique to avoid wasting memory on machines known to have
> > + * split their machine address space in several big discontinuous and highly
> > + * disjoint chunks.
> > + *
> > + * In its uncompressed form the frame table must have book-keeping metadata
> > + * structures for every page between [0, max_mfn) (whether they are backed
> > + * by RAM or not), and a similar condition exists for the direct map. We
> > + * know some systems, however, that have some sparsity in their address
> > + * space, leading to a lot of wastage in the form of unused frame table
> > + * entries.
> > + *
> > + * This is where compression becomes useful. The idea is to note that if
> > + * you have several big chunks of memory sufficiently far apart you can
> > + * ignore the middle part of the address because it will always contain
> > + * zeroes as long as the base address is sufficiently well aligned and the
> > + * length of the region is much smaller than the base address.
> 
> As per above alignment of the base address doesn't really matter.
Where above? As far as I understand you need enough alignment to cover the
hole or you won't have zeroes to compress. Point in case:

  * region1: [0x0000000000000000 -
              0x00000000FFFFFFFF]

  * region2: [0x0001FFFFFFFFF000 -
              0x00020000FFFFFFFF]

I can agree this configuration is beyond dumb and statistically unlikely to
exist in the wild, but it should (IMO) still be covered by that comment.

> 
> > + * i.e:
> > + *   Consider 2 regions of memory. One starts at 0 while the other starts
> > + *   at offset 2^off_h. Furthermore, let's assume both regions are smaller
> > + *   than 2^off_l. This means that all addresses between [2^off_l, 2^off_h)
> > + *   are invalid and we can assume them to be zero on all valid addresses.
> > + *
> > + *                 off_h     off_l
> > + *                 |         |
> > + *                 V         V
> > + *         --------------------------
> > + *         |HHHHHHH|000000000|LLLLLL| <--- mfn
> > + *         --------------------------
> > + *           ^ |
> > + *           | | (de)compression by adding/removing "useless" zeroes
> > + *           | V
> > + *         ---------------
> > + *         |HHHHHHHLLLLLL| <--- pdx
> > + *         ---------------
> > + *
> > + * This scheme also holds for multiple regions, where HHHHHHH acts as
> > + * the region identifier and LLLLLL fully contains the span of every
> > + * region involved. Consider the following 3 regions.
> > + */
> > +
> >  #ifdef CONFIG_HAS_PDX
> 
> Stray last sentence in the comment?
Oops, yes. I had more, but it was inconsequential and I ended up removing
it. I'll remove that last sentence too.

> 
> > @@ -13,22 +74,81 @@ extern unsigned long pfn_top_mask, ma_top_mask;
> >                           (sizeof(*frame_table) & -sizeof(*frame_table)))
> >  extern unsigned long pdx_group_valid[];
> >  
> > -extern uint64_t pdx_init_mask(u64 base_addr);
> > -extern u64 pdx_region_mask(u64 base, u64 len);
> > +/**
> > + * Calculates a mask covering "moving" bits of all addresses of a region
> > + *
> > + * The i-th bit of the mask must be set if there's 2 different addresses
> > + * in the region that have different j-th bits. where j >= i.
> > + *
> > + * e.g:
> > + *       base=0x1B00000000
> > + *   len+base=0x1B00082000
> > + *
> > + *   ought to return 0x00000FFFFF, which implies that every bit position
> > + *   with a zero in the mask remains unchanged in every address of the
> > + *   region.
> 
> Maybe the example would look "more generic" if the resulting mask didn't
> consist of just 0s and Fs, e.g.
> 
>  *       base=0x1B00000000
>  *   len+base=0x1B00042000
>  *
>  *   ought to return 0x000007FFFF, ...
Good idea, sure.

> 
> > + * @param base Base address of the region
> > + * @param len  Size in octets of the region
> > + * @return Mask of moving bits at the bottom of all the region addresses
> > + */
> > +uint64_t pdx_region_mask(uint64_t base, uint64_t len);
> >  
> > -extern void set_pdx_range(unsigned long smfn, unsigned long emfn);
> > +/**
> > + * Creates a basic pdx mask
> 
> What is "basic" trying to describe here? "The mask to start from" would
> look more to the point to me, plus saying that this is the (up front)
> companion to pdx_region_mask().
I'm honestly on the fence on whether this function should exist at all. One
could always integrate this mask in pdx_region_mask() and have the
callers start with a zero mask. I won't do this on this patch, but I'll try
to get rid of it on the follow-up series to these docs.

> 
> > + * This mask is used to determine non-compressible bits. No address in the
> > + * system shall have compressible bits covered by this initial mask.
> > + *
> > + * It's the larger of:
> > + *   - A mask of MAX_ORDER + PAGESHIFT 1s
> > + *   - base_addr rounded up to the nearest `2^n - 1`
> 
> I'm having trouble with this mathematically: Rounding always means
> going to some proper multiple of some base number (granularity).
I'm not sure about that definition, but regardless I can just remove that
"larger of" and ignore the problem. It's basically telling what the code
does and not adding content.

> This doesn't fit with the value being 2^n-1. Maybe "padded"?
Padding the number would be adding zeroes or ones, but we're mutating it.
IMO, that would add more confusion than it would solve.

> 
> > + * @param base_addr Address of the first maddr in the system
> > + * @return An integer of the form 2^n - 1
> > + */
> > +uint64_t pdx_init_mask(uint64_t base_addr);
> > +
> > +/**
> > + * Mark [smfn, emfn) as allocatable in the frame table
> 
> What does "allocatable" mean here?
> 
> Jan
Accesible, which is probably what it should say instead.

Thanks,

Alejandro
Jan Beulich July 10, 2023, 7:43 a.m. UTC | #3
On 07.07.2023 17:55, Alejandro Vallejo wrote:
> On Thu, Jul 06, 2023 at 11:50:58AM +0200, Jan Beulich wrote:
>> On 22.06.2023 16:02, Alejandro Vallejo wrote:
>>> @@ -57,9 +100,25 @@ uint64_t __init pdx_init_mask(uint64_t base_addr)
>>>                           (uint64_t)1 << (MAX_ORDER + PAGE_SHIFT)) - 1);
>>>  }
>>>  
>>> -u64 __init pdx_region_mask(u64 base, u64 len)
>>> +uint64_t __init pdx_region_mask(uint64_t base, uint64_t len)
>>>  {
>>> -    return fill_mask(base ^ (base + len - 1));
>>> +    uint64_t last = base + len - 1;
>>> +    /*
>>> +     * The only bit that matters in base^last is the MSB. There are 2 cases.
>>> +     *
>>> +     * case msb(base) < msb(last):
>>> +     *     then msb(fill_mask(base^last)) == msb(last). This is non
>>> +     *     compressible.
>>> +     * case msb(base) == msb(last):
>>> +     *     This means that there _may_ be a sequence of compressible zeroes
>>> +     *     for all addresses between `base` and `last` iff `base` has enough
>>> +     *     trailing zeroes. That is, it's compressible when
>>
>> Why trailing zeros? [100000f000,10ffffffff] has compressible bits
>> 32-35, but the low bits of base don't matter at all.

This is ...

>>> + * ## PDX compression
>>> + *
>>> + * This is a technique to avoid wasting memory on machines known to have
>>> + * split their machine address space in several big discontinuous and highly
>>> + * disjoint chunks.
>>> + *
>>> + * In its uncompressed form the frame table must have book-keeping metadata
>>> + * structures for every page between [0, max_mfn) (whether they are backed
>>> + * by RAM or not), and a similar condition exists for the direct map. We
>>> + * know some systems, however, that have some sparsity in their address
>>> + * space, leading to a lot of wastage in the form of unused frame table
>>> + * entries.
>>> + *
>>> + * This is where compression becomes useful. The idea is to note that if
>>> + * you have several big chunks of memory sufficiently far apart you can
>>> + * ignore the middle part of the address because it will always contain
>>> + * zeroes as long as the base address is sufficiently well aligned and the
>>> + * length of the region is much smaller than the base address.
>>
>> As per above alignment of the base address doesn't really matter.
> Where above?

... what "above" here meant.

> As far as I understand you need enough alignment to cover the
> hole or you won't have zeroes to compress. Point in case:
> 
>   * region1: [0x0000000000000000 -
>               0x00000000FFFFFFFF]
> 
>   * region2: [0x0001FFFFFFFFF000 -
>               0x00020000FFFFFFFF]
> 
> I can agree this configuration is beyond dumb and statistically unlikely to
> exist in the wild, but it should (IMO) still be covered by that comment.

Right, but this isn't relevant here - in such a case no compression
can occur, yes, but not (just) because of missing alignment. See the
example I gave above (in the earlier reply) for where alignment
clearly doesn't matter for compression to be possible.

Jan
Alejandro Vallejo July 10, 2023, 4:12 p.m. UTC | #4
On Mon, Jul 10, 2023 at 09:43:34AM +0200, Jan Beulich wrote:
> This is ...
> >>> [snip]
> >>> + * This is where compression becomes useful. The idea is to note that if
> >>> + * you have several big chunks of memory sufficiently far apart you can
> >>> + * ignore the middle part of the address because it will always contain
> >>> + * zeroes as long as the base address is sufficiently well aligned and the
> >>> + * length of the region is much smaller than the base address.
> >>
> >> As per above alignment of the base address doesn't really matter.
> > Where above?
> 
> ... what "above" here meant.
> 
> > As far as I understand you need enough alignment to cover the
> > hole or you won't have zeroes to compress. Point in case:
> > 
> >   * region1: [0x0000000000000000 -
> >               0x00000000FFFFFFFF]
> > 
> >   * region2: [0x0001FFFFFFFFF000 -
> >               0x00020000FFFFFFFF]
> > 
> > I can agree this configuration is beyond dumb and statistically unlikely to
> > exist in the wild, but it should (IMO) still be covered by that comment.
> 
> Right, but this isn't relevant here - in such a case no compression
> can occur, yes, but not (just) because of missing alignment. See the
> example I gave above (in the earlier reply) for where alignment
> clearly doesn't matter for compression to be possible.
> 
> Jan
Fair enough. Then I think we can simply drop the last sentence and be done
with it.

So the paragraph becomes:
```
   * This is where compression becomes useful. The idea is to note that if
   * you have several big chunks of memory sufficiently far apart you can
   * ignore the middle part of the address because it will always contain
   * zeroes.
```
The details on when or how compression is possible are implicit in the
following example anyway.

Would that, combined with v3, take care of everything? If not, I'll just
merge further patches with the pdx series I have on hold and send
everything together.

Thanks,

Alejandro
Jan Beulich July 10, 2023, 4:32 p.m. UTC | #5
On 10.07.2023 18:12, Alejandro Vallejo wrote:
> On Mon, Jul 10, 2023 at 09:43:34AM +0200, Jan Beulich wrote:
>> This is ...
>>>>> [snip]
>>>>> + * This is where compression becomes useful. The idea is to note that if
>>>>> + * you have several big chunks of memory sufficiently far apart you can
>>>>> + * ignore the middle part of the address because it will always contain
>>>>> + * zeroes as long as the base address is sufficiently well aligned and the
>>>>> + * length of the region is much smaller than the base address.
>>>>
>>>> As per above alignment of the base address doesn't really matter.
>>> Where above?
>>
>> ... what "above" here meant.
>>
>>> As far as I understand you need enough alignment to cover the
>>> hole or you won't have zeroes to compress. Point in case:
>>>
>>>   * region1: [0x0000000000000000 -
>>>               0x00000000FFFFFFFF]
>>>
>>>   * region2: [0x0001FFFFFFFFF000 -
>>>               0x00020000FFFFFFFF]
>>>
>>> I can agree this configuration is beyond dumb and statistically unlikely to
>>> exist in the wild, but it should (IMO) still be covered by that comment.
>>
>> Right, but this isn't relevant here - in such a case no compression
>> can occur, yes, but not (just) because of missing alignment. See the
>> example I gave above (in the earlier reply) for where alignment
>> clearly doesn't matter for compression to be possible.
>>
>> Jan
> Fair enough. Then I think we can simply drop the last sentence and be done
> with it.
> 
> So the paragraph becomes:
> ```
>    * This is where compression becomes useful. The idea is to note that if
>    * you have several big chunks of memory sufficiently far apart you can
>    * ignore the middle part of the address because it will always contain
>    * zeroes.
> ```
> The details on when or how compression is possible are implicit in the
> following example anyway.
> 
> Would that, combined with v3, take care of everything?

I'm yet to look at v3. Sorry, too many things going on / pending.

Jan
diff mbox series

Patch

diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index c91875fabe..27cdb49e17 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -20,13 +20,56 @@ 
 #include <xen/bitops.h>
 #include <xen/nospec.h>
 
-/* Parameters for PFN/MADDR compression. */
+/*
+ * Diagram to make sense of the following variables. The masks and shifts
+ * are done on mfn values in order to convert to/from pdx:
+ *
+ *                      pfn_hole_mask
+ *                      pfn_pdx_hole_shift (mask bitsize)
+ *                      |
+ *                 |---------|
+ *                 |         |
+ *                 V         V
+ *         --------------------------
+ *         |HHHHHHH|000000000|LLLLLL| <--- mfn
+ *         --------------------------
+ *         ^       ^         ^      ^
+ *         |       |         |------|
+ *         |       |             |
+ *         |       |             pfn_pdx_bottom_mask
+ *         |       |
+ *         |-------|
+ *             |
+ *             pfn_top_mask
+ *
+ * ma_{top,va_bottom}_mask is simply a shifted pfn_{top,pdx_bottom}_mask,
+ * where ma_top_mask has zeroes shifted in while ma_va_bottom_mask has
+ * ones.
+ */
+
+/** Maximum (non-inclusive) usable pdx */
 unsigned long __read_mostly max_pdx;
+
+/** Mask for the lower non-compressible bits of an mfn */
 unsigned long __read_mostly pfn_pdx_bottom_mask = ~0UL;
+
+/** Mask for the lower non-compressible bits of an maddr or vaddr */
 unsigned long __read_mostly ma_va_bottom_mask = ~0UL;
+
+/** Mask for the higher non-compressible bits of an mfn */
 unsigned long __read_mostly pfn_top_mask = 0;
+
+/** Mask for the higher non-compressible bits of an maddr or vaddr */
 unsigned long __read_mostly ma_top_mask = 0;
+
+/**
+ * Mask for a pdx compression bit slice.
+ *
+ *  Invariant: valid(mfn) implies (mfn & pfn_hole_mask) == 0
+ */
 unsigned long __read_mostly pfn_hole_mask = 0;
+
+/** Number of bits of the "compressible" bit slice of an mfn */
 unsigned int __read_mostly pfn_pdx_hole_shift = 0;
 
 unsigned long __read_mostly pdx_group_valid[BITS_TO_LONGS(
@@ -42,7 +85,7 @@  bool __mfn_valid(unsigned long mfn)
 }
 
 /* Sets all bits from the most-significant 1-bit down to the LSB */
-static u64 __init fill_mask(u64 mask)
+static uint64_t __init fill_mask(uint64_t mask)
 {
     while (mask & (mask + 1))
         mask |= mask + 1;
@@ -57,9 +100,25 @@  uint64_t __init pdx_init_mask(uint64_t base_addr)
                          (uint64_t)1 << (MAX_ORDER + PAGE_SHIFT)) - 1);
 }
 
-u64 __init pdx_region_mask(u64 base, u64 len)
+uint64_t __init pdx_region_mask(uint64_t base, uint64_t len)
 {
-    return fill_mask(base ^ (base + len - 1));
+    uint64_t last = base + len - 1;
+    /*
+     * The only bit that matters in base^last is the MSB. There are 2 cases.
+     *
+     * case msb(base) < msb(last):
+     *     then msb(fill_mask(base^last)) == msb(last). This is non
+     *     compressible.
+     * case msb(base) == msb(last):
+     *     This means that there _may_ be a sequence of compressible zeroes
+     *     for all addresses between `base` and `last` iff `base` has enough
+     *     trailing zeroes. That is, it's compressible when
+     *     msb(fill_mask(base^last)) < msb(last)
+     *
+     * The resulting mask is effectively the moving bits between `base` and
+     * `last`.
+     */
+    return fill_mask(base ^ last);
 }
 
 void set_pdx_range(unsigned long smfn, unsigned long emfn)
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index b0dc3ba9c9..7e4b190357 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -31,6 +31,16 @@ 
  *   (i.e. all devices assigned to) a guest share a single DMA address space
  *   and, by default, Xen will ensure dfn == pfn.
  *
+ * pdx: Page InDeX
+ *   Indices into the frame table holding the per-page's book-keeping
+ *   metadata. A compression scheme is used and there's a possibly non
+ *   identity mapping between valid(mfn) <-> valid(pdx). In particular, the
+ *   mapping happens to be identity when the ranges are not compressible.
+ *   See the comments in pdx.c for an in-depth explanation of that mapping.
+ *
+ * maddr: Machine Address
+ *   The physical address that corresponds to an mfn
+ *
  * WARNING: Some of these terms have changed over time while others have been
  * used inconsistently, meaning that a lot of existing code does not match the
  * definitions above.  New code should use these terms as described here, and
diff --git a/xen/include/xen/pdx.h b/xen/include/xen/pdx.h
index 9fcfb0ce52..b3ff2aad09 100644
--- a/xen/include/xen/pdx.h
+++ b/xen/include/xen/pdx.h
@@ -1,6 +1,67 @@ 
 #ifndef __XEN_PDX_H__
 #define __XEN_PDX_H__
 
+/*
+ * PDX (Page inDeX)
+ *
+ * This file deals with optimisations pertaining frame table indexing,
+ * A pdx is an index into the frame table. However, having an identity
+ * relationship between mfn and pdx could waste copious amounts of memory
+ * in empty frame table entries. There are some techniques to bring memory
+ * wastage down.
+ *
+ * ## PDX grouping
+ *
+ * The frame table may have some sparsity even on systems where the memory
+ * banks are tightly packed. This is due to system quirks (like the PCI
+ * hole) which might introduce several GiB of unused page frame numbers
+ * that uselessly waste memory in the frame table. PDX grouping addresses
+ * this by keeping a bitmap of the ranges in the frame table containing
+ * invalid entries and not allocating backing memory for them.
+ *
+ * ## PDX compression
+ *
+ * This is a technique to avoid wasting memory on machines known to have
+ * split their machine address space in several big discontinuous and highly
+ * disjoint chunks.
+ *
+ * In its uncompressed form the frame table must have book-keeping metadata
+ * structures for every page between [0, max_mfn) (whether they are backed
+ * by RAM or not), and a similar condition exists for the direct map. We
+ * know some systems, however, that have some sparsity in their address
+ * space, leading to a lot of wastage in the form of unused frame table
+ * entries.
+ *
+ * This is where compression becomes useful. The idea is to note that if
+ * you have several big chunks of memory sufficiently far apart you can
+ * ignore the middle part of the address because it will always contain
+ * zeroes as long as the base address is sufficiently well aligned and the
+ * length of the region is much smaller than the base address.
+ *
+ * i.e:
+ *   Consider 2 regions of memory. One starts at 0 while the other starts
+ *   at offset 2^off_h. Furthermore, let's assume both regions are smaller
+ *   than 2^off_l. This means that all addresses between [2^off_l, 2^off_h)
+ *   are invalid and we can assume them to be zero on all valid addresses.
+ *
+ *                 off_h     off_l
+ *                 |         |
+ *                 V         V
+ *         --------------------------
+ *         |HHHHHHH|000000000|LLLLLL| <--- mfn
+ *         --------------------------
+ *           ^ |
+ *           | | (de)compression by adding/removing "useless" zeroes
+ *           | V
+ *         ---------------
+ *         |HHHHHHHLLLLLL| <--- pdx
+ *         ---------------
+ *
+ * This scheme also holds for multiple regions, where HHHHHHH acts as
+ * the region identifier and LLLLLL fully contains the span of every
+ * region involved. Consider the following 3 regions.
+ */
+
 #ifdef CONFIG_HAS_PDX
 
 extern unsigned long max_pdx;
@@ -13,22 +74,81 @@  extern unsigned long pfn_top_mask, ma_top_mask;
                          (sizeof(*frame_table) & -sizeof(*frame_table)))
 extern unsigned long pdx_group_valid[];
 
-extern uint64_t pdx_init_mask(u64 base_addr);
-extern u64 pdx_region_mask(u64 base, u64 len);
+/**
+ * Calculates a mask covering "moving" bits of all addresses of a region
+ *
+ * The i-th bit of the mask must be set if there's 2 different addresses
+ * in the region that have different j-th bits. where j >= i.
+ *
+ * e.g:
+ *       base=0x1B00000000
+ *   len+base=0x1B00082000
+ *
+ *   ought to return 0x00000FFFFF, which implies that every bit position
+ *   with a zero in the mask remains unchanged in every address of the
+ *   region.
+ *
+ * @param base Base address of the region
+ * @param len  Size in octets of the region
+ * @return Mask of moving bits at the bottom of all the region addresses
+ */
+uint64_t pdx_region_mask(uint64_t base, uint64_t len);
 
-extern void set_pdx_range(unsigned long smfn, unsigned long emfn);
+/**
+ * Creates a basic pdx mask
+ *
+ * This mask is used to determine non-compressible bits. No address in the
+ * system shall have compressible bits covered by this initial mask.
+ *
+ * It's the larger of:
+ *   - A mask of MAX_ORDER + PAGESHIFT 1s
+ *   - base_addr rounded up to the nearest `2^n - 1`
+ *
+ * @param base_addr Address of the first maddr in the system
+ * @return An integer of the form 2^n - 1
+ */
+uint64_t pdx_init_mask(uint64_t base_addr);
+
+/**
+ * Mark [smfn, emfn) as allocatable in the frame table
+ *
+ * @param smfn Start mfn
+ * @param emfn End mfn
+ */
+void set_pdx_range(unsigned long smfn, unsigned long emfn);
 
 #define page_to_pdx(pg)  ((pg) - frame_table)
 #define pdx_to_page(pdx) gcc11_wrap(frame_table + (pdx))
 
+/**
+ * Invoked to determine if an mfn has an associated valid frame table entry
+ *
+ * In order for it to be legal it must pass bounds, grouping and
+ * compression sanity checks.
+ *
+ * @param mfn To-be-checked mfn
+ * @return True iff all checks pass
+ */
 bool __mfn_valid(unsigned long mfn);
 
+/**
+ * Map pfn to its corresponding pdx
+ *
+ * @param pfn Frame number
+ * @return Obtained pdx after compressing the pfn
+ */
 static inline unsigned long pfn_to_pdx(unsigned long pfn)
 {
     return (pfn & pfn_pdx_bottom_mask) |
            ((pfn & pfn_top_mask) >> pfn_pdx_hole_shift);
 }
 
+/**
+ * Map a pdx to its corresponding pfn
+ *
+ * @param pdx Page index
+ * @return Obtained pfn after decompressing the pdx
+ */
 static inline unsigned long pdx_to_pfn(unsigned long pdx)
 {
     return (pdx & pfn_pdx_bottom_mask) |
@@ -38,7 +158,16 @@  static inline unsigned long pdx_to_pfn(unsigned long pdx)
 #define mfn_to_pdx(mfn) pfn_to_pdx(mfn_x(mfn))
 #define pdx_to_mfn(pdx) _mfn(pdx_to_pfn(pdx))
 
-extern void pfn_pdx_hole_setup(unsigned long);
+/**
+ * Initializes global variables with information about the compressible
+ * range of the current memory regions.
+ *
+ * @param mask This mask is the biggest pdx_mask of every region in the
+ *             system ORed with all base addresses of every region in the
+ *             system. This results in a mask where every zero in a bit
+ *             position marks a potentially compressible bit.
+ */
+void pfn_pdx_hole_setup(unsigned long mask);
 
 #endif /* HAS_PDX */
 #endif /* __XEN_PDX_H__ */