diff mbox series

xen/page_alloc: Simplify domain_adjust_tot_pages

Message ID 20250224132724.9074-1-alejandro.vallejo@cloud.com (mailing list archive)
State New
Headers show
Series xen/page_alloc: Simplify domain_adjust_tot_pages | expand

Commit Message

Alejandro Vallejo Feb. 24, 2025, 1:27 p.m. UTC
The logic has too many levels of indirection and it's very hard to
understand it its current form. Split it between the corner case where
the adjustment is bigger than the current claim and the rest to avoid 5
auxiliary variables.

While at it, fix incorrect field name in nearby comment.

Not a functional change (although by its nature it might not seem
immediately obvious at first).

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
---
 xen/common/page_alloc.c | 25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

Comments

Alejandro Vallejo Feb. 24, 2025, 2:49 p.m. UTC | #1
Open question to whoever reviews this...

On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote:
>      spin_lock(&heap_lock);
> -    /* adjust domain outstanding pages; may not go negative */
> -    dom_before = d->outstanding_pages;
> -    dom_after = dom_before - pages;
> -    BUG_ON(dom_before < 0);
> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> -    d->outstanding_pages = dom_claimed;
> -    /* flag accounting bug if system outstanding_claims would go negative */
> -    sys_before = outstanding_claims;
> -    sys_after = sys_before - (dom_before - dom_claimed);
> -    BUG_ON(sys_after < 0);
> -    outstanding_claims = sys_after;
> +    BUG_ON(outstanding_claims < d->outstanding_pages);
> +    if ( pages > 0 && d->outstanding_pages < pages )
> +    {
> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> +        outstanding_claims -= d->outstanding_pages;
> +        d->outstanding_pages = 0;

While this matches the previous behaviour, do we _really_ want it? It's weird,
quirky, and it hard to extend to NUMA-aware claims (which is something in
midway through).

Wouldn't it make sense to fail the allocation (earlier) if the claim has run
out? Do we even expect this to ever happen this late in the allocation call
chain?

> +    } else {
> +        outstanding_claims -= pages;
> +        d->outstanding_pages -= pages;
> +    }
>      spin_unlock(&heap_lock);
>  
>  out:

Cheers,
Alejandro
Roger Pau Monné Feb. 26, 2025, 1:56 p.m. UTC | #2
On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
> The logic has too many levels of indirection and it's very hard to
> understand it its current form. Split it between the corner case where
> the adjustment is bigger than the current claim and the rest to avoid 5
> auxiliary variables.
> 
> While at it, fix incorrect field name in nearby comment.
> 
> Not a functional change (although by its nature it might not seem
> immediately obvious at first).
> 
> Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
> ---
>  xen/common/page_alloc.c | 25 +++++++++++--------------
>  1 file changed, 11 insertions(+), 14 deletions(-)
> 
> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> index 1bf070c8c5df..4306753f0bd2 100644
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
>  
>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>  {
> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
> -
>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
>      d->tot_pages += pages;
>  
>      /*
> -     * can test d->claimed_pages race-free because it can only change
> +     * can test d->outstanding_pages race-free because it can only change
>       * if d->page_alloc_lock and heap_lock are both held, see also
>       * domain_set_outstanding_pages below
>       */
> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>          goto out;

I think you can probably short-circuit the logic below if pages == 0?
(and avoid taking the heap_lock)

>  
>      spin_lock(&heap_lock);
> -    /* adjust domain outstanding pages; may not go negative */
> -    dom_before = d->outstanding_pages;
> -    dom_after = dom_before - pages;
> -    BUG_ON(dom_before < 0);
> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> -    d->outstanding_pages = dom_claimed;
> -    /* flag accounting bug if system outstanding_claims would go negative */
> -    sys_before = outstanding_claims;
> -    sys_after = sys_before - (dom_before - dom_claimed);
> -    BUG_ON(sys_after < 0);
> -    outstanding_claims = sys_after;
> +    BUG_ON(outstanding_claims < d->outstanding_pages);
> +    if ( pages > 0 && d->outstanding_pages < pages )
> +    {
> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> +        outstanding_claims -= d->outstanding_pages;
> +        d->outstanding_pages = 0;
> +    } else {
> +        outstanding_claims -= pages;
> +        d->outstanding_pages -= pages;

I wonder if it's intentional for a pages < 0 value to modify
outstanding_claims and d->outstanding_pages, I think those values
should only be set from domain_set_outstanding_pages().
domain_adjust_tot_pages() should only decrease the value, but never
increase either outstanding_claims or d->outstanding_pages.

At best the behavior is inconsistent, because once
d->outstanding_pages reaches 0 there will be no further modification
from domain_adjust_tot_pages().

I think it would be best if outstanding_claims and
d->outstanding_pages was only modified if pages > 0.

Thanks, Roger.
Jan Beulich Feb. 26, 2025, 2:02 p.m. UTC | #3
On 24.02.2025 14:27, Alejandro Vallejo wrote:
> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>          goto out;
>  
>      spin_lock(&heap_lock);
> -    /* adjust domain outstanding pages; may not go negative */
> -    dom_before = d->outstanding_pages;
> -    dom_after = dom_before - pages;
> -    BUG_ON(dom_before < 0);
> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> -    d->outstanding_pages = dom_claimed;
> -    /* flag accounting bug if system outstanding_claims would go negative */
> -    sys_before = outstanding_claims;
> -    sys_after = sys_before - (dom_before - dom_claimed);
> -    BUG_ON(sys_after < 0);
> -    outstanding_claims = sys_after;
> +    BUG_ON(outstanding_claims < d->outstanding_pages);
> +    if ( pages > 0 && d->outstanding_pages < pages )

The lhs isn't needed, is it? d->outstanding_pages is an unsigned quantity,
after all. Else dropping the earlier of the two BUG_ON() wouldn't be quite
right.

> +    {
> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> +        outstanding_claims -= d->outstanding_pages;
> +        d->outstanding_pages = 0;
> +    } else {

Nit: Braces on their own lines please.

In any event - yes, this reads quite a bit easier after the adjustment.

With the adjustments (happy to make while committing, so long as you agree)
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan

> +        outstanding_claims -= pages;
> +        d->outstanding_pages -= pages;
> +    }
>      spin_unlock(&heap_lock);
>  
>  out:
Jan Beulich Feb. 26, 2025, 2:05 p.m. UTC | #4
On 24.02.2025 15:49, Alejandro Vallejo wrote:
> Open question to whoever reviews this...
> 
> On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote:
>>      spin_lock(&heap_lock);
>> -    /* adjust domain outstanding pages; may not go negative */
>> -    dom_before = d->outstanding_pages;
>> -    dom_after = dom_before - pages;
>> -    BUG_ON(dom_before < 0);
>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
>> -    d->outstanding_pages = dom_claimed;
>> -    /* flag accounting bug if system outstanding_claims would go negative */
>> -    sys_before = outstanding_claims;
>> -    sys_after = sys_before - (dom_before - dom_claimed);
>> -    BUG_ON(sys_after < 0);
>> -    outstanding_claims = sys_after;
>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
>> +    if ( pages > 0 && d->outstanding_pages < pages )
>> +    {
>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
>> +        outstanding_claims -= d->outstanding_pages;
>> +        d->outstanding_pages = 0;
> 
> While this matches the previous behaviour, do we _really_ want it? It's weird,
> quirky, and it hard to extend to NUMA-aware claims (which is something in
> midway through).
> 
> Wouldn't it make sense to fail the allocation (earlier) if the claim has run
> out? Do we even expect this to ever happen this late in the allocation call
> chain?

This goes back to what a "claim" means. Even without any claim, a domain may
allocate memory. So a claim having run out doesn't imply allocation has to
fail.

NUMA-aware claims require more than an adjustment just here, I expect. Tracking
of claims (certainly the global, maybe also the per-domain value) would likely
need to become per-node, for example.

Jan
Jan Beulich Feb. 26, 2025, 2:08 p.m. UTC | #5
On 26.02.2025 14:56, Roger Pau Monné wrote:
> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
>> --- a/xen/common/page_alloc.c
>> +++ b/xen/common/page_alloc.c
>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
>>  
>>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>  {
>> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
>> -
>>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
>>      d->tot_pages += pages;
>>  
>>      /*
>> -     * can test d->claimed_pages race-free because it can only change
>> +     * can test d->outstanding_pages race-free because it can only change
>>       * if d->page_alloc_lock and heap_lock are both held, see also
>>       * domain_set_outstanding_pages below
>>       */
>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>          goto out;
> 
> I think you can probably short-circuit the logic below if pages == 0?
> (and avoid taking the heap_lock)

Are there callers passing in 0?

>>      spin_lock(&heap_lock);
>> -    /* adjust domain outstanding pages; may not go negative */
>> -    dom_before = d->outstanding_pages;
>> -    dom_after = dom_before - pages;
>> -    BUG_ON(dom_before < 0);
>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
>> -    d->outstanding_pages = dom_claimed;
>> -    /* flag accounting bug if system outstanding_claims would go negative */
>> -    sys_before = outstanding_claims;
>> -    sys_after = sys_before - (dom_before - dom_claimed);
>> -    BUG_ON(sys_after < 0);
>> -    outstanding_claims = sys_after;
>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
>> +    if ( pages > 0 && d->outstanding_pages < pages )
>> +    {
>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
>> +        outstanding_claims -= d->outstanding_pages;
>> +        d->outstanding_pages = 0;
>> +    } else {
>> +        outstanding_claims -= pages;
>> +        d->outstanding_pages -= pages;
> 
> I wonder if it's intentional for a pages < 0 value to modify
> outstanding_claims and d->outstanding_pages, I think those values
> should only be set from domain_set_outstanding_pages().
> domain_adjust_tot_pages() should only decrease the value, but never
> increase either outstanding_claims or d->outstanding_pages.
> 
> At best the behavior is inconsistent, because once
> d->outstanding_pages reaches 0 there will be no further modification
> from domain_adjust_tot_pages().

Right, at that point the claim has run out. While freeing pages with an
active claim means that the claim gets bigger (which naturally needs
reflecting in the global).

Jan
Roger Pau Monné Feb. 26, 2025, 2:18 p.m. UTC | #6
On Mon, Feb 24, 2025 at 02:49:48PM +0000, Alejandro Vallejo wrote:
> Open question to whoever reviews this...
> 
> On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote:
> >      spin_lock(&heap_lock);
> > -    /* adjust domain outstanding pages; may not go negative */
> > -    dom_before = d->outstanding_pages;
> > -    dom_after = dom_before - pages;
> > -    BUG_ON(dom_before < 0);
> > -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> > -    d->outstanding_pages = dom_claimed;
> > -    /* flag accounting bug if system outstanding_claims would go negative */
> > -    sys_before = outstanding_claims;
> > -    sys_after = sys_before - (dom_before - dom_claimed);
> > -    BUG_ON(sys_after < 0);
> > -    outstanding_claims = sys_after;
> > +    BUG_ON(outstanding_claims < d->outstanding_pages);
> > +    if ( pages > 0 && d->outstanding_pages < pages )
> > +    {
> > +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> > +        outstanding_claims -= d->outstanding_pages;
> > +        d->outstanding_pages = 0;
> 
> While this matches the previous behaviour, do we _really_ want it? It's weird,
> quirky, and it hard to extend to NUMA-aware claims (which is something in
> midway through).
> 
> Wouldn't it make sense to fail the allocation (earlier) if the claim has run
> out? Do we even expect this to ever happen this late in the allocation call
> chain?

I'm unsure.  This is the case where more memory than initially claimed
has been allocated, but by the time domain_adjust_tot_pages() gets
called the memory has already been allocated, so it's kind of
unhelpful to fail by then.

I think any caller that requests more memory than what has been
initially claimed for the domain should be prepared to deal with such
allocation failing.  This quirky handling is very likely a workaround
for the miscellaneous differences between the memory accounted by the
toolstack for a guest vs the memory really used by such guest.  I bet
if you limit a guest to strictly only allocate up to
d->outstanding_pages domain creation will fail.

In general the toolstack memory calculations are not fully accurate,
see for example how vmx_alloc_vlapic_mapping() allocates a domheap
page which very likely the toolstack won't have accounted for.  There
are likely other examples that would possibly break the accounting
done by the toolstack.

Thanks, Roger.
Roger Pau Monné Feb. 26, 2025, 2:28 p.m. UTC | #7
On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote:
> On 26.02.2025 14:56, Roger Pau Monné wrote:
> > On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
> >> --- a/xen/common/page_alloc.c
> >> +++ b/xen/common/page_alloc.c
> >> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
> >>  
> >>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> >>  {
> >> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
> >> -
> >>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
> >>      d->tot_pages += pages;
> >>  
> >>      /*
> >> -     * can test d->claimed_pages race-free because it can only change
> >> +     * can test d->outstanding_pages race-free because it can only change
> >>       * if d->page_alloc_lock and heap_lock are both held, see also
> >>       * domain_set_outstanding_pages below
> >>       */
> >> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> >>          goto out;
> > 
> > I think you can probably short-circuit the logic below if pages == 0?
> > (and avoid taking the heap_lock)
> 
> Are there callers passing in 0?

Not sure, but if there are no callers expected we might add an ASSERT
to that effect then.

> >>      spin_lock(&heap_lock);
> >> -    /* adjust domain outstanding pages; may not go negative */
> >> -    dom_before = d->outstanding_pages;
> >> -    dom_after = dom_before - pages;
> >> -    BUG_ON(dom_before < 0);
> >> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> >> -    d->outstanding_pages = dom_claimed;
> >> -    /* flag accounting bug if system outstanding_claims would go negative */
> >> -    sys_before = outstanding_claims;
> >> -    sys_after = sys_before - (dom_before - dom_claimed);
> >> -    BUG_ON(sys_after < 0);
> >> -    outstanding_claims = sys_after;
> >> +    BUG_ON(outstanding_claims < d->outstanding_pages);
> >> +    if ( pages > 0 && d->outstanding_pages < pages )
> >> +    {
> >> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> >> +        outstanding_claims -= d->outstanding_pages;
> >> +        d->outstanding_pages = 0;
> >> +    } else {
> >> +        outstanding_claims -= pages;
> >> +        d->outstanding_pages -= pages;
> > 
> > I wonder if it's intentional for a pages < 0 value to modify
> > outstanding_claims and d->outstanding_pages, I think those values
> > should only be set from domain_set_outstanding_pages().
> > domain_adjust_tot_pages() should only decrease the value, but never
> > increase either outstanding_claims or d->outstanding_pages.
> > 
> > At best the behavior is inconsistent, because once
> > d->outstanding_pages reaches 0 there will be no further modification
> > from domain_adjust_tot_pages().
> 
> Right, at that point the claim has run out. While freeing pages with an
> active claim means that the claim gets bigger (which naturally needs
> reflecting in the global).

domain_adjust_tot_pages() is not exclusively called when freeing
pages, see steal_page() for example.

When called from steal_page() it's wrong to increase the claim, as
it assumes that the page removed from d->tot_pages is freed, but
that's not the case.  The domain might end up in a situation where
the claim is bigger than the available amount of memory.

Thanks, Roger.
Jan Beulich Feb. 26, 2025, 2:36 p.m. UTC | #8
On 26.02.2025 15:28, Roger Pau Monné wrote:
> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote:
>> On 26.02.2025 14:56, Roger Pau Monné wrote:
>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
>>>> --- a/xen/common/page_alloc.c
>>>> +++ b/xen/common/page_alloc.c
>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
>>>>  
>>>>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>  {
>>>> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
>>>> -
>>>>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
>>>>      d->tot_pages += pages;
>>>>  
>>>>      /*
>>>> -     * can test d->claimed_pages race-free because it can only change
>>>> +     * can test d->outstanding_pages race-free because it can only change
>>>>       * if d->page_alloc_lock and heap_lock are both held, see also
>>>>       * domain_set_outstanding_pages below
>>>>       */
>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>          goto out;
>>>
>>> I think you can probably short-circuit the logic below if pages == 0?
>>> (and avoid taking the heap_lock)
>>
>> Are there callers passing in 0?
> 
> Not sure, but if there are no callers expected we might add an ASSERT
> to that effect then.
> 
>>>>      spin_lock(&heap_lock);
>>>> -    /* adjust domain outstanding pages; may not go negative */
>>>> -    dom_before = d->outstanding_pages;
>>>> -    dom_after = dom_before - pages;
>>>> -    BUG_ON(dom_before < 0);
>>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
>>>> -    d->outstanding_pages = dom_claimed;
>>>> -    /* flag accounting bug if system outstanding_claims would go negative */
>>>> -    sys_before = outstanding_claims;
>>>> -    sys_after = sys_before - (dom_before - dom_claimed);
>>>> -    BUG_ON(sys_after < 0);
>>>> -    outstanding_claims = sys_after;
>>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
>>>> +    if ( pages > 0 && d->outstanding_pages < pages )
>>>> +    {
>>>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
>>>> +        outstanding_claims -= d->outstanding_pages;
>>>> +        d->outstanding_pages = 0;
>>>> +    } else {
>>>> +        outstanding_claims -= pages;
>>>> +        d->outstanding_pages -= pages;
>>>
>>> I wonder if it's intentional for a pages < 0 value to modify
>>> outstanding_claims and d->outstanding_pages, I think those values
>>> should only be set from domain_set_outstanding_pages().
>>> domain_adjust_tot_pages() should only decrease the value, but never
>>> increase either outstanding_claims or d->outstanding_pages.
>>>
>>> At best the behavior is inconsistent, because once
>>> d->outstanding_pages reaches 0 there will be no further modification
>>> from domain_adjust_tot_pages().
>>
>> Right, at that point the claim has run out. While freeing pages with an
>> active claim means that the claim gets bigger (which naturally needs
>> reflecting in the global).
> 
> domain_adjust_tot_pages() is not exclusively called when freeing
> pages, see steal_page() for example.

Or also when pages were allocated. steal_page() ...

> When called from steal_page() it's wrong to increase the claim, as
> it assumes that the page removed from d->tot_pages is freed, but
> that's not the case.  The domain might end up in a situation where
> the claim is bigger than the available amount of memory.

... is a case that likely wasn't considered when the feature was added.

I never really liked this; I'd be quite happy to see it ripped out, as
long as we'd be reasonably certain it isn't in active use by people.

Jan
Roger Pau Monné Feb. 26, 2025, 4:04 p.m. UTC | #9
On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote:
> On 26.02.2025 15:28, Roger Pau Monné wrote:
> > On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote:
> >> On 26.02.2025 14:56, Roger Pau Monné wrote:
> >>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
> >>>> --- a/xen/common/page_alloc.c
> >>>> +++ b/xen/common/page_alloc.c
> >>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
> >>>>  
> >>>>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> >>>>  {
> >>>> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
> >>>> -
> >>>>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
> >>>>      d->tot_pages += pages;
> >>>>  
> >>>>      /*
> >>>> -     * can test d->claimed_pages race-free because it can only change
> >>>> +     * can test d->outstanding_pages race-free because it can only change
> >>>>       * if d->page_alloc_lock and heap_lock are both held, see also
> >>>>       * domain_set_outstanding_pages below
> >>>>       */
> >>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> >>>>          goto out;
> >>>
> >>> I think you can probably short-circuit the logic below if pages == 0?
> >>> (and avoid taking the heap_lock)
> >>
> >> Are there callers passing in 0?
> > 
> > Not sure, but if there are no callers expected we might add an ASSERT
> > to that effect then.
> > 
> >>>>      spin_lock(&heap_lock);
> >>>> -    /* adjust domain outstanding pages; may not go negative */
> >>>> -    dom_before = d->outstanding_pages;
> >>>> -    dom_after = dom_before - pages;
> >>>> -    BUG_ON(dom_before < 0);
> >>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> >>>> -    d->outstanding_pages = dom_claimed;
> >>>> -    /* flag accounting bug if system outstanding_claims would go negative */
> >>>> -    sys_before = outstanding_claims;
> >>>> -    sys_after = sys_before - (dom_before - dom_claimed);
> >>>> -    BUG_ON(sys_after < 0);
> >>>> -    outstanding_claims = sys_after;
> >>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
> >>>> +    if ( pages > 0 && d->outstanding_pages < pages )
> >>>> +    {
> >>>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> >>>> +        outstanding_claims -= d->outstanding_pages;
> >>>> +        d->outstanding_pages = 0;
> >>>> +    } else {
> >>>> +        outstanding_claims -= pages;
> >>>> +        d->outstanding_pages -= pages;
> >>>
> >>> I wonder if it's intentional for a pages < 0 value to modify
> >>> outstanding_claims and d->outstanding_pages, I think those values
> >>> should only be set from domain_set_outstanding_pages().
> >>> domain_adjust_tot_pages() should only decrease the value, but never
> >>> increase either outstanding_claims or d->outstanding_pages.
> >>>
> >>> At best the behavior is inconsistent, because once
> >>> d->outstanding_pages reaches 0 there will be no further modification
> >>> from domain_adjust_tot_pages().
> >>
> >> Right, at that point the claim has run out. While freeing pages with an
> >> active claim means that the claim gets bigger (which naturally needs
> >> reflecting in the global).
> > 
> > domain_adjust_tot_pages() is not exclusively called when freeing
> > pages, see steal_page() for example.
> 
> Or also when pages were allocated. steal_page() ...
> 
> > When called from steal_page() it's wrong to increase the claim, as
> > it assumes that the page removed from d->tot_pages is freed, but
> > that's not the case.  The domain might end up in a situation where
> > the claim is bigger than the available amount of memory.
> 
> ... is a case that likely wasn't considered when the feature was added.
> 
> I never really liked this; I'd be quite happy to see it ripped out, as
> long as we'd be reasonably certain it isn't in active use by people.

What do you mean with 'it' in the above sentence, the whole claim
stuff?  Or just getting rid of allowing the claim to increase as a
result of pages being removed from a domain?

Thanks, Roger.
Jan Beulich Feb. 26, 2025, 4:06 p.m. UTC | #10
On 26.02.2025 17:04, Roger Pau Monné wrote:
> On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote:
>> On 26.02.2025 15:28, Roger Pau Monné wrote:
>>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote:
>>>> On 26.02.2025 14:56, Roger Pau Monné wrote:
>>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
>>>>>> --- a/xen/common/page_alloc.c
>>>>>> +++ b/xen/common/page_alloc.c
>>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
>>>>>>  
>>>>>>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>>>  {
>>>>>> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
>>>>>> -
>>>>>>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
>>>>>>      d->tot_pages += pages;
>>>>>>  
>>>>>>      /*
>>>>>> -     * can test d->claimed_pages race-free because it can only change
>>>>>> +     * can test d->outstanding_pages race-free because it can only change
>>>>>>       * if d->page_alloc_lock and heap_lock are both held, see also
>>>>>>       * domain_set_outstanding_pages below
>>>>>>       */
>>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>>>          goto out;
>>>>>
>>>>> I think you can probably short-circuit the logic below if pages == 0?
>>>>> (and avoid taking the heap_lock)
>>>>
>>>> Are there callers passing in 0?
>>>
>>> Not sure, but if there are no callers expected we might add an ASSERT
>>> to that effect then.
>>>
>>>>>>      spin_lock(&heap_lock);
>>>>>> -    /* adjust domain outstanding pages; may not go negative */
>>>>>> -    dom_before = d->outstanding_pages;
>>>>>> -    dom_after = dom_before - pages;
>>>>>> -    BUG_ON(dom_before < 0);
>>>>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
>>>>>> -    d->outstanding_pages = dom_claimed;
>>>>>> -    /* flag accounting bug if system outstanding_claims would go negative */
>>>>>> -    sys_before = outstanding_claims;
>>>>>> -    sys_after = sys_before - (dom_before - dom_claimed);
>>>>>> -    BUG_ON(sys_after < 0);
>>>>>> -    outstanding_claims = sys_after;
>>>>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
>>>>>> +    if ( pages > 0 && d->outstanding_pages < pages )
>>>>>> +    {
>>>>>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
>>>>>> +        outstanding_claims -= d->outstanding_pages;
>>>>>> +        d->outstanding_pages = 0;
>>>>>> +    } else {
>>>>>> +        outstanding_claims -= pages;
>>>>>> +        d->outstanding_pages -= pages;
>>>>>
>>>>> I wonder if it's intentional for a pages < 0 value to modify
>>>>> outstanding_claims and d->outstanding_pages, I think those values
>>>>> should only be set from domain_set_outstanding_pages().
>>>>> domain_adjust_tot_pages() should only decrease the value, but never
>>>>> increase either outstanding_claims or d->outstanding_pages.
>>>>>
>>>>> At best the behavior is inconsistent, because once
>>>>> d->outstanding_pages reaches 0 there will be no further modification
>>>>> from domain_adjust_tot_pages().
>>>>
>>>> Right, at that point the claim has run out. While freeing pages with an
>>>> active claim means that the claim gets bigger (which naturally needs
>>>> reflecting in the global).
>>>
>>> domain_adjust_tot_pages() is not exclusively called when freeing
>>> pages, see steal_page() for example.
>>
>> Or also when pages were allocated. steal_page() ...
>>
>>> When called from steal_page() it's wrong to increase the claim, as
>>> it assumes that the page removed from d->tot_pages is freed, but
>>> that's not the case.  The domain might end up in a situation where
>>> the claim is bigger than the available amount of memory.
>>
>> ... is a case that likely wasn't considered when the feature was added.
>>
>> I never really liked this; I'd be quite happy to see it ripped out, as
>> long as we'd be reasonably certain it isn't in active use by people.
> 
> What do you mean with 'it' in the above sentence, the whole claim
> stuff?

Yes.

>  Or just getting rid of allowing the claim to increase as a
> result of pages being removed from a domain?

No.

Jan
Andrew Cooper Feb. 26, 2025, 4:34 p.m. UTC | #11
On 26/02/2025 4:06 pm, Jan Beulich wrote:
> On 26.02.2025 17:04, Roger Pau Monné wrote:
>> On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote:
>>> On 26.02.2025 15:28, Roger Pau Monné wrote:
>>>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote:
>>>>> On 26.02.2025 14:56, Roger Pau Monné wrote:
>>>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
>>>>>>> --- a/xen/common/page_alloc.c
>>>>>>> +++ b/xen/common/page_alloc.c
>>>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
>>>>>>>  
>>>>>>>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>>>>  {
>>>>>>> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
>>>>>>> -
>>>>>>>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
>>>>>>>      d->tot_pages += pages;
>>>>>>>  
>>>>>>>      /*
>>>>>>> -     * can test d->claimed_pages race-free because it can only change
>>>>>>> +     * can test d->outstanding_pages race-free because it can only change
>>>>>>>       * if d->page_alloc_lock and heap_lock are both held, see also
>>>>>>>       * domain_set_outstanding_pages below
>>>>>>>       */
>>>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>>>>          goto out;
>>>>>> I think you can probably short-circuit the logic below if pages == 0?
>>>>>> (and avoid taking the heap_lock)
>>>>> Are there callers passing in 0?
>>>> Not sure, but if there are no callers expected we might add an ASSERT
>>>> to that effect then.
>>>>
>>>>>>>      spin_lock(&heap_lock);
>>>>>>> -    /* adjust domain outstanding pages; may not go negative */
>>>>>>> -    dom_before = d->outstanding_pages;
>>>>>>> -    dom_after = dom_before - pages;
>>>>>>> -    BUG_ON(dom_before < 0);
>>>>>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
>>>>>>> -    d->outstanding_pages = dom_claimed;
>>>>>>> -    /* flag accounting bug if system outstanding_claims would go negative */
>>>>>>> -    sys_before = outstanding_claims;
>>>>>>> -    sys_after = sys_before - (dom_before - dom_claimed);
>>>>>>> -    BUG_ON(sys_after < 0);
>>>>>>> -    outstanding_claims = sys_after;
>>>>>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
>>>>>>> +    if ( pages > 0 && d->outstanding_pages < pages )
>>>>>>> +    {
>>>>>>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
>>>>>>> +        outstanding_claims -= d->outstanding_pages;
>>>>>>> +        d->outstanding_pages = 0;
>>>>>>> +    } else {
>>>>>>> +        outstanding_claims -= pages;
>>>>>>> +        d->outstanding_pages -= pages;
>>>>>> I wonder if it's intentional for a pages < 0 value to modify
>>>>>> outstanding_claims and d->outstanding_pages, I think those values
>>>>>> should only be set from domain_set_outstanding_pages().
>>>>>> domain_adjust_tot_pages() should only decrease the value, but never
>>>>>> increase either outstanding_claims or d->outstanding_pages.
>>>>>>
>>>>>> At best the behavior is inconsistent, because once
>>>>>> d->outstanding_pages reaches 0 there will be no further modification
>>>>>> from domain_adjust_tot_pages().
>>>>> Right, at that point the claim has run out. While freeing pages with an
>>>>> active claim means that the claim gets bigger (which naturally needs
>>>>> reflecting in the global).
>>>> domain_adjust_tot_pages() is not exclusively called when freeing
>>>> pages, see steal_page() for example.
>>> Or also when pages were allocated. steal_page() ...
>>>
>>>> When called from steal_page() it's wrong to increase the claim, as
>>>> it assumes that the page removed from d->tot_pages is freed, but
>>>> that's not the case.  The domain might end up in a situation where
>>>> the claim is bigger than the available amount of memory.
>>> ... is a case that likely wasn't considered when the feature was added.
>>>
>>> I never really liked this; I'd be quite happy to see it ripped out, as
>>> long as we'd be reasonably certain it isn't in active use by people.
>> What do you mean with 'it' in the above sentence, the whole claim
>> stuff?
> Yes.
>
>>  Or just getting rid of allowing the claim to increase as a
>> result of pages being removed from a domain?
> No.

Alejandro and I discussed this earlier in the week.

The claim infrastructure stuff is critical for a toolstack capable of
doing things in parallel.

However, it is also nonsensical for there to be a remaining claim by the
time domain construction is done.

If creation_finished were a concrete thing, rather than a bodge hacked
into domain_unpause_by_systemcontroller(), it ought to be made to fail
if there were an outstanding claim.  I suggested that we follow through
on a previous suggestion of making it a real hypercall (which is needed
by the encrypted VM crowd too).

~Andrew
Jan Beulich Feb. 26, 2025, 4:42 p.m. UTC | #12
On 26.02.2025 17:34, Andrew Cooper wrote:
> On 26/02/2025 4:06 pm, Jan Beulich wrote:
>> On 26.02.2025 17:04, Roger Pau Monné wrote:
>>> On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote:
>>>> On 26.02.2025 15:28, Roger Pau Monné wrote:
>>>>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote:
>>>>>> On 26.02.2025 14:56, Roger Pau Monné wrote:
>>>>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
>>>>>>>> --- a/xen/common/page_alloc.c
>>>>>>>> +++ b/xen/common/page_alloc.c
>>>>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
>>>>>>>>  
>>>>>>>>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>>>>>  {
>>>>>>>> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
>>>>>>>> -
>>>>>>>>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
>>>>>>>>      d->tot_pages += pages;
>>>>>>>>  
>>>>>>>>      /*
>>>>>>>> -     * can test d->claimed_pages race-free because it can only change
>>>>>>>> +     * can test d->outstanding_pages race-free because it can only change
>>>>>>>>       * if d->page_alloc_lock and heap_lock are both held, see also
>>>>>>>>       * domain_set_outstanding_pages below
>>>>>>>>       */
>>>>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>>>>>          goto out;
>>>>>>> I think you can probably short-circuit the logic below if pages == 0?
>>>>>>> (and avoid taking the heap_lock)
>>>>>> Are there callers passing in 0?
>>>>> Not sure, but if there are no callers expected we might add an ASSERT
>>>>> to that effect then.
>>>>>
>>>>>>>>      spin_lock(&heap_lock);
>>>>>>>> -    /* adjust domain outstanding pages; may not go negative */
>>>>>>>> -    dom_before = d->outstanding_pages;
>>>>>>>> -    dom_after = dom_before - pages;
>>>>>>>> -    BUG_ON(dom_before < 0);
>>>>>>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
>>>>>>>> -    d->outstanding_pages = dom_claimed;
>>>>>>>> -    /* flag accounting bug if system outstanding_claims would go negative */
>>>>>>>> -    sys_before = outstanding_claims;
>>>>>>>> -    sys_after = sys_before - (dom_before - dom_claimed);
>>>>>>>> -    BUG_ON(sys_after < 0);
>>>>>>>> -    outstanding_claims = sys_after;
>>>>>>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
>>>>>>>> +    if ( pages > 0 && d->outstanding_pages < pages )
>>>>>>>> +    {
>>>>>>>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
>>>>>>>> +        outstanding_claims -= d->outstanding_pages;
>>>>>>>> +        d->outstanding_pages = 0;
>>>>>>>> +    } else {
>>>>>>>> +        outstanding_claims -= pages;
>>>>>>>> +        d->outstanding_pages -= pages;
>>>>>>> I wonder if it's intentional for a pages < 0 value to modify
>>>>>>> outstanding_claims and d->outstanding_pages, I think those values
>>>>>>> should only be set from domain_set_outstanding_pages().
>>>>>>> domain_adjust_tot_pages() should only decrease the value, but never
>>>>>>> increase either outstanding_claims or d->outstanding_pages.
>>>>>>>
>>>>>>> At best the behavior is inconsistent, because once
>>>>>>> d->outstanding_pages reaches 0 there will be no further modification
>>>>>>> from domain_adjust_tot_pages().
>>>>>> Right, at that point the claim has run out. While freeing pages with an
>>>>>> active claim means that the claim gets bigger (which naturally needs
>>>>>> reflecting in the global).
>>>>> domain_adjust_tot_pages() is not exclusively called when freeing
>>>>> pages, see steal_page() for example.
>>>> Or also when pages were allocated. steal_page() ...
>>>>
>>>>> When called from steal_page() it's wrong to increase the claim, as
>>>>> it assumes that the page removed from d->tot_pages is freed, but
>>>>> that's not the case.  The domain might end up in a situation where
>>>>> the claim is bigger than the available amount of memory.
>>>> ... is a case that likely wasn't considered when the feature was added.
>>>>
>>>> I never really liked this; I'd be quite happy to see it ripped out, as
>>>> long as we'd be reasonably certain it isn't in active use by people.
>>> What do you mean with 'it' in the above sentence, the whole claim
>>> stuff?
>> Yes.
>>
>>>  Or just getting rid of allowing the claim to increase as a
>>> result of pages being removed from a domain?
>> No.
> 
> Alejandro and I discussed this earlier in the week.
> 
> The claim infrastructure stuff is critical for a toolstack capable of
> doing things in parallel.
> 
> However, it is also nonsensical for there to be a remaining claim by the
> time domain construction is done.

I'm not entirely sure about this. Iirc it was the tmem work where this was
added, and then pretty certainly it was needed also for already running
domains.

> If creation_finished were a concrete thing, rather than a bodge hacked
> into domain_unpause_by_systemcontroller(), it ought to be made to fail
> if there were an outstanding claim.  I suggested that we follow through
> on a previous suggestion of making it a real hypercall (which is needed
> by the encrypted VM crowd too).

Rather than failing we could simply zap the leftover?

Jan
Andrew Cooper Feb. 26, 2025, 4:51 p.m. UTC | #13
On 26/02/2025 4:42 pm, Jan Beulich wrote:
> On 26.02.2025 17:34, Andrew Cooper wrote:
>> On 26/02/2025 4:06 pm, Jan Beulich wrote:
>>> On 26.02.2025 17:04, Roger Pau Monné wrote:
>>>> On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote:
>>>>> On 26.02.2025 15:28, Roger Pau Monné wrote:
>>>>>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote:
>>>>>>> On 26.02.2025 14:56, Roger Pau Monné wrote:
>>>>>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
>>>>>>>>> --- a/xen/common/page_alloc.c
>>>>>>>>> +++ b/xen/common/page_alloc.c
>>>>>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
>>>>>>>>>  
>>>>>>>>>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>>>>>>  {
>>>>>>>>> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
>>>>>>>>> -
>>>>>>>>>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
>>>>>>>>>      d->tot_pages += pages;
>>>>>>>>>  
>>>>>>>>>      /*
>>>>>>>>> -     * can test d->claimed_pages race-free because it can only change
>>>>>>>>> +     * can test d->outstanding_pages race-free because it can only change
>>>>>>>>>       * if d->page_alloc_lock and heap_lock are both held, see also
>>>>>>>>>       * domain_set_outstanding_pages below
>>>>>>>>>       */
>>>>>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>>>>>>          goto out;
>>>>>>>> I think you can probably short-circuit the logic below if pages == 0?
>>>>>>>> (and avoid taking the heap_lock)
>>>>>>> Are there callers passing in 0?
>>>>>> Not sure, but if there are no callers expected we might add an ASSERT
>>>>>> to that effect then.
>>>>>>
>>>>>>>>>      spin_lock(&heap_lock);
>>>>>>>>> -    /* adjust domain outstanding pages; may not go negative */
>>>>>>>>> -    dom_before = d->outstanding_pages;
>>>>>>>>> -    dom_after = dom_before - pages;
>>>>>>>>> -    BUG_ON(dom_before < 0);
>>>>>>>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
>>>>>>>>> -    d->outstanding_pages = dom_claimed;
>>>>>>>>> -    /* flag accounting bug if system outstanding_claims would go negative */
>>>>>>>>> -    sys_before = outstanding_claims;
>>>>>>>>> -    sys_after = sys_before - (dom_before - dom_claimed);
>>>>>>>>> -    BUG_ON(sys_after < 0);
>>>>>>>>> -    outstanding_claims = sys_after;
>>>>>>>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
>>>>>>>>> +    if ( pages > 0 && d->outstanding_pages < pages )
>>>>>>>>> +    {
>>>>>>>>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
>>>>>>>>> +        outstanding_claims -= d->outstanding_pages;
>>>>>>>>> +        d->outstanding_pages = 0;
>>>>>>>>> +    } else {
>>>>>>>>> +        outstanding_claims -= pages;
>>>>>>>>> +        d->outstanding_pages -= pages;
>>>>>>>> I wonder if it's intentional for a pages < 0 value to modify
>>>>>>>> outstanding_claims and d->outstanding_pages, I think those values
>>>>>>>> should only be set from domain_set_outstanding_pages().
>>>>>>>> domain_adjust_tot_pages() should only decrease the value, but never
>>>>>>>> increase either outstanding_claims or d->outstanding_pages.
>>>>>>>>
>>>>>>>> At best the behavior is inconsistent, because once
>>>>>>>> d->outstanding_pages reaches 0 there will be no further modification
>>>>>>>> from domain_adjust_tot_pages().
>>>>>>> Right, at that point the claim has run out. While freeing pages with an
>>>>>>> active claim means that the claim gets bigger (which naturally needs
>>>>>>> reflecting in the global).
>>>>>> domain_adjust_tot_pages() is not exclusively called when freeing
>>>>>> pages, see steal_page() for example.
>>>>> Or also when pages were allocated. steal_page() ...
>>>>>
>>>>>> When called from steal_page() it's wrong to increase the claim, as
>>>>>> it assumes that the page removed from d->tot_pages is freed, but
>>>>>> that's not the case.  The domain might end up in a situation where
>>>>>> the claim is bigger than the available amount of memory.
>>>>> ... is a case that likely wasn't considered when the feature was added.
>>>>>
>>>>> I never really liked this; I'd be quite happy to see it ripped out, as
>>>>> long as we'd be reasonably certain it isn't in active use by people.
>>>> What do you mean with 'it' in the above sentence, the whole claim
>>>> stuff?
>>> Yes.
>>>
>>>>  Or just getting rid of allowing the claim to increase as a
>>>> result of pages being removed from a domain?
>>> No.
>> Alejandro and I discussed this earlier in the week.
>>
>> The claim infrastructure stuff is critical for a toolstack capable of
>> doing things in parallel.
>>
>> However, it is also nonsensical for there to be a remaining claim by the
>> time domain construction is done.
> I'm not entirely sure about this. Iirc it was the tmem work where this was
> added, and then pretty certainly it was needed also for already running
> domains.

It wasn't TMEM.  It was generally large-memory VMs.

The problem is if you've got 2x 2T VMs booting on a 3T system.

Previously, you'd start building both of them, and minutes later they
both fail because Xen was fully out of memory.

Claim was introduced to atomically reserve the memory you were intending
to build a domain with.

For XenServer, we're working on NUMA fixes, and something that's
important there is to be able to reserve memory on a specific NUMA node
(hence why this is all getting looked at).
>> If creation_finished were a concrete thing, rather than a bodge hacked
>> into domain_unpause_by_systemcontroller(), it ought to be made to fail
>> if there were an outstanding claim.  I suggested that we follow through
>> on a previous suggestion of making it a real hypercall (which is needed
>> by the encrypted VM crowd too).
> Rather than failing we could simply zap the leftover?

Hmm.  Perhaps.

I'd be slightly wary about zapping a claim, but there should only be an
outstanding claim if the toolstack did something wrong.

OTOH, we absolutely definitely do need a real hypercall here at some
point soon.

~Andrew
Alejandro Vallejo Feb. 27, 2025, 2:36 p.m. UTC | #14
On Wed Feb 26, 2025 at 2:05 PM GMT, Jan Beulich wrote:
> On 24.02.2025 15:49, Alejandro Vallejo wrote:
> > Open question to whoever reviews this...
> > 
> > On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote:
> >>      spin_lock(&heap_lock);
> >> -    /* adjust domain outstanding pages; may not go negative */
> >> -    dom_before = d->outstanding_pages;
> >> -    dom_after = dom_before - pages;
> >> -    BUG_ON(dom_before < 0);
> >> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> >> -    d->outstanding_pages = dom_claimed;
> >> -    /* flag accounting bug if system outstanding_claims would go negative */
> >> -    sys_before = outstanding_claims;
> >> -    sys_after = sys_before - (dom_before - dom_claimed);
> >> -    BUG_ON(sys_after < 0);
> >> -    outstanding_claims = sys_after;
> >> +    BUG_ON(outstanding_claims < d->outstanding_pages);
> >> +    if ( pages > 0 && d->outstanding_pages < pages )
> >> +    {
> >> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> >> +        outstanding_claims -= d->outstanding_pages;
> >> +        d->outstanding_pages = 0;
> > 
> > While this matches the previous behaviour, do we _really_ want it? It's weird,
> > quirky, and it hard to extend to NUMA-aware claims (which is something in
> > midway through).
> > 
> > Wouldn't it make sense to fail the allocation (earlier) if the claim has run
> > out? Do we even expect this to ever happen this late in the allocation call
> > chain?
>
> This goes back to what a "claim" means. Even without any claim, a domain may
> allocate memory. So a claim having run out doesn't imply allocation has to
> fail.

Hmmm... but that violates the purpose of the claim infra as far as I understand
it. If a domain may overallocate by (e.g) ballooning in memory it can distort the
ability of another domain to start up, even if it succeeded in its own claim.

We might also break the invariant that total claims are strictly >=
total_avail_pages.

I'm somewhat puzzled at the "why" of having separate concepts for max_mem and
claims. I guess it simply grew the way it did. Reinstating sanity would
probably involve making max_mem effectively the claim, but that's a ton of
work I really would rather not do for now.

>
> NUMA-aware claims require more than an adjustment just here, I expect. Tracking
> of claims (certainly the global, maybe also the per-domain value) would likely
> need to become per-node, for example.

A fair amount more, yes. I'm preparing a series on the side to address per-node
claims and it's far more invasive on page_alloc.c. This function was just
sufficiently impossible to read that I felt the urge to send it ahead of time
for my own mental health.

>
> Jan

Cheers,
Alejandro
Alejandro Vallejo Feb. 27, 2025, 2:39 p.m. UTC | #15
On Wed Feb 26, 2025 at 4:51 PM GMT, Andrew Cooper wrote:
> On 26/02/2025 4:42 pm, Jan Beulich wrote:
> > On 26.02.2025 17:34, Andrew Cooper wrote:
> >> On 26/02/2025 4:06 pm, Jan Beulich wrote:
> >>> On 26.02.2025 17:04, Roger Pau Monné wrote:
> >>>> On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote:
> >>>>> On 26.02.2025 15:28, Roger Pau Monné wrote:
> >>>>>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote:
> >>>>>>> On 26.02.2025 14:56, Roger Pau Monné wrote:
> >>>>>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
> >>>>>>>>> --- a/xen/common/page_alloc.c
> >>>>>>>>> +++ b/xen/common/page_alloc.c
> >>>>>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
> >>>>>>>>>  
> >>>>>>>>>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> >>>>>>>>>  {
> >>>>>>>>> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
> >>>>>>>>> -
> >>>>>>>>>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
> >>>>>>>>>      d->tot_pages += pages;
> >>>>>>>>>  
> >>>>>>>>>      /*
> >>>>>>>>> -     * can test d->claimed_pages race-free because it can only change
> >>>>>>>>> +     * can test d->outstanding_pages race-free because it can only change
> >>>>>>>>>       * if d->page_alloc_lock and heap_lock are both held, see also
> >>>>>>>>>       * domain_set_outstanding_pages below
> >>>>>>>>>       */
> >>>>>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> >>>>>>>>>          goto out;
> >>>>>>>> I think you can probably short-circuit the logic below if pages == 0?
> >>>>>>>> (and avoid taking the heap_lock)
> >>>>>>> Are there callers passing in 0?
> >>>>>> Not sure, but if there are no callers expected we might add an ASSERT
> >>>>>> to that effect then.
> >>>>>>
> >>>>>>>>>      spin_lock(&heap_lock);
> >>>>>>>>> -    /* adjust domain outstanding pages; may not go negative */
> >>>>>>>>> -    dom_before = d->outstanding_pages;
> >>>>>>>>> -    dom_after = dom_before - pages;
> >>>>>>>>> -    BUG_ON(dom_before < 0);
> >>>>>>>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> >>>>>>>>> -    d->outstanding_pages = dom_claimed;
> >>>>>>>>> -    /* flag accounting bug if system outstanding_claims would go negative */
> >>>>>>>>> -    sys_before = outstanding_claims;
> >>>>>>>>> -    sys_after = sys_before - (dom_before - dom_claimed);
> >>>>>>>>> -    BUG_ON(sys_after < 0);
> >>>>>>>>> -    outstanding_claims = sys_after;
> >>>>>>>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
> >>>>>>>>> +    if ( pages > 0 && d->outstanding_pages < pages )
> >>>>>>>>> +    {
> >>>>>>>>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> >>>>>>>>> +        outstanding_claims -= d->outstanding_pages;
> >>>>>>>>> +        d->outstanding_pages = 0;
> >>>>>>>>> +    } else {
> >>>>>>>>> +        outstanding_claims -= pages;
> >>>>>>>>> +        d->outstanding_pages -= pages;
> >>>>>>>> I wonder if it's intentional for a pages < 0 value to modify
> >>>>>>>> outstanding_claims and d->outstanding_pages, I think those values
> >>>>>>>> should only be set from domain_set_outstanding_pages().
> >>>>>>>> domain_adjust_tot_pages() should only decrease the value, but never
> >>>>>>>> increase either outstanding_claims or d->outstanding_pages.
> >>>>>>>>
> >>>>>>>> At best the behavior is inconsistent, because once
> >>>>>>>> d->outstanding_pages reaches 0 there will be no further modification
> >>>>>>>> from domain_adjust_tot_pages().
> >>>>>>> Right, at that point the claim has run out. While freeing pages with an
> >>>>>>> active claim means that the claim gets bigger (which naturally needs
> >>>>>>> reflecting in the global).
> >>>>>> domain_adjust_tot_pages() is not exclusively called when freeing
> >>>>>> pages, see steal_page() for example.
> >>>>> Or also when pages were allocated. steal_page() ...
> >>>>>
> >>>>>> When called from steal_page() it's wrong to increase the claim, as
> >>>>>> it assumes that the page removed from d->tot_pages is freed, but
> >>>>>> that's not the case.  The domain might end up in a situation where
> >>>>>> the claim is bigger than the available amount of memory.
> >>>>> ... is a case that likely wasn't considered when the feature was added.
> >>>>>
> >>>>> I never really liked this; I'd be quite happy to see it ripped out, as
> >>>>> long as we'd be reasonably certain it isn't in active use by people.
> >>>> What do you mean with 'it' in the above sentence, the whole claim
> >>>> stuff?
> >>> Yes.
> >>>
> >>>>  Or just getting rid of allowing the claim to increase as a
> >>>> result of pages being removed from a domain?
> >>> No.
> >> Alejandro and I discussed this earlier in the week.
> >>
> >> The claim infrastructure stuff is critical for a toolstack capable of
> >> doing things in parallel.
> >>
> >> However, it is also nonsensical for there to be a remaining claim by the
> >> time domain construction is done.
> > I'm not entirely sure about this. Iirc it was the tmem work where this was
> > added, and then pretty certainly it was needed also for already running
> > domains.
>
> It wasn't TMEM.  It was generally large-memory VMs.
>
> The problem is if you've got 2x 2T VMs booting on a 3T system.
>
> Previously, you'd start building both of them, and minutes later they
> both fail because Xen was fully out of memory.
>
> Claim was introduced to atomically reserve the memory you were intending
> to build a domain with.
>
> For XenServer, we're working on NUMA fixes, and something that's
> important there is to be able to reserve memory on a specific NUMA node
> (hence why this is all getting looked at).
> >> If creation_finished were a concrete thing, rather than a bodge hacked
> >> into domain_unpause_by_systemcontroller(), it ought to be made to fail
> >> if there were an outstanding claim.  I suggested that we follow through
> >> on a previous suggestion of making it a real hypercall (which is needed
> >> by the encrypted VM crowd too).
> > Rather than failing we could simply zap the leftover?
>
> Hmm.  Perhaps.
>
> I'd be slightly wary about zapping a claim, but there should only be an
> outstanding claim if the toolstack did something wrong.
>
> OTOH, we absolutely definitely do need a real hypercall here at some
> point soon.
>
> ~Andrew

It should probably be removed at the end. Otherwise we're effectively leaking
all claimed but non-used memory for the lifetime of the domain.

Cheers,
Alejandro
Alejandro Vallejo Feb. 27, 2025, 2:50 p.m. UTC | #16
On Wed Feb 26, 2025 at 2:28 PM GMT, Roger Pau Monné wrote:
> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote:
> > On 26.02.2025 14:56, Roger Pau Monné wrote:
> > > On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
> > >> --- a/xen/common/page_alloc.c
> > >> +++ b/xen/common/page_alloc.c
> > >> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
> > >>  
> > >>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> > >>  {
> > >> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
> > >> -
> > >>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
> > >>      d->tot_pages += pages;
> > >>  
> > >>      /*
> > >> -     * can test d->claimed_pages race-free because it can only change
> > >> +     * can test d->outstanding_pages race-free because it can only change
> > >>       * if d->page_alloc_lock and heap_lock are both held, see also
> > >>       * domain_set_outstanding_pages below
> > >>       */
> > >> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> > >>          goto out;
> > > 
> > > I think you can probably short-circuit the logic below if pages == 0?
> > > (and avoid taking the heap_lock)
> > 
> > Are there callers passing in 0?
>
> Not sure, but if there are no callers expected we might add an ASSERT
> to that effect then.
>
> > >>      spin_lock(&heap_lock);
> > >> -    /* adjust domain outstanding pages; may not go negative */
> > >> -    dom_before = d->outstanding_pages;
> > >> -    dom_after = dom_before - pages;
> > >> -    BUG_ON(dom_before < 0);
> > >> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> > >> -    d->outstanding_pages = dom_claimed;
> > >> -    /* flag accounting bug if system outstanding_claims would go negative */
> > >> -    sys_before = outstanding_claims;
> > >> -    sys_after = sys_before - (dom_before - dom_claimed);
> > >> -    BUG_ON(sys_after < 0);
> > >> -    outstanding_claims = sys_after;
> > >> +    BUG_ON(outstanding_claims < d->outstanding_pages);
> > >> +    if ( pages > 0 && d->outstanding_pages < pages )
> > >> +    {
> > >> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> > >> +        outstanding_claims -= d->outstanding_pages;
> > >> +        d->outstanding_pages = 0;
> > >> +    } else {
> > >> +        outstanding_claims -= pages;
> > >> +        d->outstanding_pages -= pages;
> > > 
> > > I wonder if it's intentional for a pages < 0 value to modify
> > > outstanding_claims and d->outstanding_pages, I think those values
> > > should only be set from domain_set_outstanding_pages().
> > > domain_adjust_tot_pages() should only decrease the value, but never
> > > increase either outstanding_claims or d->outstanding_pages.
> > > 
> > > At best the behavior is inconsistent, because once
> > > d->outstanding_pages reaches 0 there will be no further modification
> > > from domain_adjust_tot_pages().
> > 
> > Right, at that point the claim has run out. While freeing pages with an
> > active claim means that the claim gets bigger (which naturally needs
> > reflecting in the global).
>
> domain_adjust_tot_pages() is not exclusively called when freeing
> pages, see steal_page() for example.
>
> When called from steal_page() it's wrong to increase the claim, as
> it assumes that the page removed from d->tot_pages is freed, but
> that's not the case.  The domain might end up in a situation where
> the claim is bigger than the available amount of memory.
>
> Thanks, Roger.

This is what I meant by my initial reply questioning the logic itself.

It's all very dubious with memory_exchange and makes very little sense on the
tentative code I have for per-node claims.

I'd be quite happy to put an early exit before the spin_lock on pages <= 0.
That also covers your initial comment and prevents claims from growing after a
domain started running if it didn't happen to consume all of them.

Is anyone opposed?

Cheers,
Alejandro
Alejandro Vallejo Feb. 27, 2025, 2:59 p.m. UTC | #17
On Wed Feb 26, 2025 at 2:02 PM GMT, Jan Beulich wrote:
> On 24.02.2025 14:27, Alejandro Vallejo wrote:
> > @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> >          goto out;
> >  
> >      spin_lock(&heap_lock);
> > -    /* adjust domain outstanding pages; may not go negative */
> > -    dom_before = d->outstanding_pages;
> > -    dom_after = dom_before - pages;
> > -    BUG_ON(dom_before < 0);
> > -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> > -    d->outstanding_pages = dom_claimed;
> > -    /* flag accounting bug if system outstanding_claims would go negative */
> > -    sys_before = outstanding_claims;
> > -    sys_after = sys_before - (dom_before - dom_claimed);
> > -    BUG_ON(sys_after < 0);
> > -    outstanding_claims = sys_after;
> > +    BUG_ON(outstanding_claims < d->outstanding_pages);
> > +    if ( pages > 0 && d->outstanding_pages < pages )
>
> The lhs isn't needed, is it? d->outstanding_pages is an unsigned quantity,
> after all. Else dropping the earlier of the two BUG_ON() wouldn't be quite
> right.

d->outstanding pages is unsigned, but pages isn't.

It was originally like that, but I then got concerned about 32bit machines,
where you'd be comparing a signed and an unsigned integer of the same
not-very-large width. That seems like dangerous terrains if the unsigned number
grows large enough.

TL;DR: It's there for clarity and paranoia. Even if the overflowing into bit 31
would be rare in such a system.

>
> > +    {
> > +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> > +        outstanding_claims -= d->outstanding_pages;
> > +        d->outstanding_pages = 0;
> > +    } else {
>
> Nit: Braces on their own lines please.

Ack.

>
> In any event - yes, this reads quite a bit easier after the adjustment.
>
> With the adjustments (happy to make while committing, so long as you agree)
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks. I'd probably like to hold off and send a v2 if you're fine with the
adjustment I answered Roger with (returning ealy on pages <= 0, so claims are
never increased on free).

>
> Jan
>

Cheers,
Alejandro
diff mbox series

Patch

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 1bf070c8c5df..4306753f0bd2 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -490,13 +490,11 @@  static long outstanding_claims; /* total outstanding claims by all domains */
 
 unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
 {
-    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
-
     ASSERT(rspin_is_locked(&d->page_alloc_lock));
     d->tot_pages += pages;
 
     /*
-     * can test d->claimed_pages race-free because it can only change
+     * can test d->outstanding_pages race-free because it can only change
      * if d->page_alloc_lock and heap_lock are both held, see also
      * domain_set_outstanding_pages below
      */
@@ -504,17 +502,16 @@  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
         goto out;
 
     spin_lock(&heap_lock);
-    /* adjust domain outstanding pages; may not go negative */
-    dom_before = d->outstanding_pages;
-    dom_after = dom_before - pages;
-    BUG_ON(dom_before < 0);
-    dom_claimed = dom_after < 0 ? 0 : dom_after;
-    d->outstanding_pages = dom_claimed;
-    /* flag accounting bug if system outstanding_claims would go negative */
-    sys_before = outstanding_claims;
-    sys_after = sys_before - (dom_before - dom_claimed);
-    BUG_ON(sys_after < 0);
-    outstanding_claims = sys_after;
+    BUG_ON(outstanding_claims < d->outstanding_pages);
+    if ( pages > 0 && d->outstanding_pages < pages )
+    {
+        /* `pages` exceeds the domain's outstanding count. Zero it out. */
+        outstanding_claims -= d->outstanding_pages;
+        d->outstanding_pages = 0;
+    } else {
+        outstanding_claims -= pages;
+        d->outstanding_pages -= pages;
+    }
     spin_unlock(&heap_lock);
 
 out: