Message ID | 20250224132724.9074-1-alejandro.vallejo@cloud.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | xen/page_alloc: Simplify domain_adjust_tot_pages | expand |
Open question to whoever reviews this... On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote: > spin_lock(&heap_lock); > - /* adjust domain outstanding pages; may not go negative */ > - dom_before = d->outstanding_pages; > - dom_after = dom_before - pages; > - BUG_ON(dom_before < 0); > - dom_claimed = dom_after < 0 ? 0 : dom_after; > - d->outstanding_pages = dom_claimed; > - /* flag accounting bug if system outstanding_claims would go negative */ > - sys_before = outstanding_claims; > - sys_after = sys_before - (dom_before - dom_claimed); > - BUG_ON(sys_after < 0); > - outstanding_claims = sys_after; > + BUG_ON(outstanding_claims < d->outstanding_pages); > + if ( pages > 0 && d->outstanding_pages < pages ) > + { > + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > + outstanding_claims -= d->outstanding_pages; > + d->outstanding_pages = 0; While this matches the previous behaviour, do we _really_ want it? It's weird, quirky, and it hard to extend to NUMA-aware claims (which is something in midway through). Wouldn't it make sense to fail the allocation (earlier) if the claim has run out? Do we even expect this to ever happen this late in the allocation call chain? > + } else { > + outstanding_claims -= pages; > + d->outstanding_pages -= pages; > + } > spin_unlock(&heap_lock); > > out: Cheers, Alejandro
On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: > The logic has too many levels of indirection and it's very hard to > understand it its current form. Split it between the corner case where > the adjustment is bigger than the current claim and the rest to avoid 5 > auxiliary variables. > > While at it, fix incorrect field name in nearby comment. > > Not a functional change (although by its nature it might not seem > immediately obvious at first). > > Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> > --- > xen/common/page_alloc.c | 25 +++++++++++-------------- > 1 file changed, 11 insertions(+), 14 deletions(-) > > diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c > index 1bf070c8c5df..4306753f0bd2 100644 > --- a/xen/common/page_alloc.c > +++ b/xen/common/page_alloc.c > @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ > > unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > { > - long dom_before, dom_after, dom_claimed, sys_before, sys_after; > - > ASSERT(rspin_is_locked(&d->page_alloc_lock)); > d->tot_pages += pages; > > /* > - * can test d->claimed_pages race-free because it can only change > + * can test d->outstanding_pages race-free because it can only change > * if d->page_alloc_lock and heap_lock are both held, see also > * domain_set_outstanding_pages below > */ > @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > goto out; I think you can probably short-circuit the logic below if pages == 0? (and avoid taking the heap_lock) > > spin_lock(&heap_lock); > - /* adjust domain outstanding pages; may not go negative */ > - dom_before = d->outstanding_pages; > - dom_after = dom_before - pages; > - BUG_ON(dom_before < 0); > - dom_claimed = dom_after < 0 ? 0 : dom_after; > - d->outstanding_pages = dom_claimed; > - /* flag accounting bug if system outstanding_claims would go negative */ > - sys_before = outstanding_claims; > - sys_after = sys_before - (dom_before - dom_claimed); > - BUG_ON(sys_after < 0); > - outstanding_claims = sys_after; > + BUG_ON(outstanding_claims < d->outstanding_pages); > + if ( pages > 0 && d->outstanding_pages < pages ) > + { > + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > + outstanding_claims -= d->outstanding_pages; > + d->outstanding_pages = 0; > + } else { > + outstanding_claims -= pages; > + d->outstanding_pages -= pages; I wonder if it's intentional for a pages < 0 value to modify outstanding_claims and d->outstanding_pages, I think those values should only be set from domain_set_outstanding_pages(). domain_adjust_tot_pages() should only decrease the value, but never increase either outstanding_claims or d->outstanding_pages. At best the behavior is inconsistent, because once d->outstanding_pages reaches 0 there will be no further modification from domain_adjust_tot_pages(). I think it would be best if outstanding_claims and d->outstanding_pages was only modified if pages > 0. Thanks, Roger.
On 24.02.2025 14:27, Alejandro Vallejo wrote: > @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > goto out; > > spin_lock(&heap_lock); > - /* adjust domain outstanding pages; may not go negative */ > - dom_before = d->outstanding_pages; > - dom_after = dom_before - pages; > - BUG_ON(dom_before < 0); > - dom_claimed = dom_after < 0 ? 0 : dom_after; > - d->outstanding_pages = dom_claimed; > - /* flag accounting bug if system outstanding_claims would go negative */ > - sys_before = outstanding_claims; > - sys_after = sys_before - (dom_before - dom_claimed); > - BUG_ON(sys_after < 0); > - outstanding_claims = sys_after; > + BUG_ON(outstanding_claims < d->outstanding_pages); > + if ( pages > 0 && d->outstanding_pages < pages ) The lhs isn't needed, is it? d->outstanding_pages is an unsigned quantity, after all. Else dropping the earlier of the two BUG_ON() wouldn't be quite right. > + { > + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > + outstanding_claims -= d->outstanding_pages; > + d->outstanding_pages = 0; > + } else { Nit: Braces on their own lines please. In any event - yes, this reads quite a bit easier after the adjustment. With the adjustments (happy to make while committing, so long as you agree) Reviewed-by: Jan Beulich <jbeulich@suse.com> Jan > + outstanding_claims -= pages; > + d->outstanding_pages -= pages; > + } > spin_unlock(&heap_lock); > > out:
On 24.02.2025 15:49, Alejandro Vallejo wrote: > Open question to whoever reviews this... > > On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote: >> spin_lock(&heap_lock); >> - /* adjust domain outstanding pages; may not go negative */ >> - dom_before = d->outstanding_pages; >> - dom_after = dom_before - pages; >> - BUG_ON(dom_before < 0); >> - dom_claimed = dom_after < 0 ? 0 : dom_after; >> - d->outstanding_pages = dom_claimed; >> - /* flag accounting bug if system outstanding_claims would go negative */ >> - sys_before = outstanding_claims; >> - sys_after = sys_before - (dom_before - dom_claimed); >> - BUG_ON(sys_after < 0); >> - outstanding_claims = sys_after; >> + BUG_ON(outstanding_claims < d->outstanding_pages); >> + if ( pages > 0 && d->outstanding_pages < pages ) >> + { >> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ >> + outstanding_claims -= d->outstanding_pages; >> + d->outstanding_pages = 0; > > While this matches the previous behaviour, do we _really_ want it? It's weird, > quirky, and it hard to extend to NUMA-aware claims (which is something in > midway through). > > Wouldn't it make sense to fail the allocation (earlier) if the claim has run > out? Do we even expect this to ever happen this late in the allocation call > chain? This goes back to what a "claim" means. Even without any claim, a domain may allocate memory. So a claim having run out doesn't imply allocation has to fail. NUMA-aware claims require more than an adjustment just here, I expect. Tracking of claims (certainly the global, maybe also the per-domain value) would likely need to become per-node, for example. Jan
On 26.02.2025 14:56, Roger Pau Monné wrote: > On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: >> --- a/xen/common/page_alloc.c >> +++ b/xen/common/page_alloc.c >> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ >> >> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >> { >> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; >> - >> ASSERT(rspin_is_locked(&d->page_alloc_lock)); >> d->tot_pages += pages; >> >> /* >> - * can test d->claimed_pages race-free because it can only change >> + * can test d->outstanding_pages race-free because it can only change >> * if d->page_alloc_lock and heap_lock are both held, see also >> * domain_set_outstanding_pages below >> */ >> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >> goto out; > > I think you can probably short-circuit the logic below if pages == 0? > (and avoid taking the heap_lock) Are there callers passing in 0? >> spin_lock(&heap_lock); >> - /* adjust domain outstanding pages; may not go negative */ >> - dom_before = d->outstanding_pages; >> - dom_after = dom_before - pages; >> - BUG_ON(dom_before < 0); >> - dom_claimed = dom_after < 0 ? 0 : dom_after; >> - d->outstanding_pages = dom_claimed; >> - /* flag accounting bug if system outstanding_claims would go negative */ >> - sys_before = outstanding_claims; >> - sys_after = sys_before - (dom_before - dom_claimed); >> - BUG_ON(sys_after < 0); >> - outstanding_claims = sys_after; >> + BUG_ON(outstanding_claims < d->outstanding_pages); >> + if ( pages > 0 && d->outstanding_pages < pages ) >> + { >> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ >> + outstanding_claims -= d->outstanding_pages; >> + d->outstanding_pages = 0; >> + } else { >> + outstanding_claims -= pages; >> + d->outstanding_pages -= pages; > > I wonder if it's intentional for a pages < 0 value to modify > outstanding_claims and d->outstanding_pages, I think those values > should only be set from domain_set_outstanding_pages(). > domain_adjust_tot_pages() should only decrease the value, but never > increase either outstanding_claims or d->outstanding_pages. > > At best the behavior is inconsistent, because once > d->outstanding_pages reaches 0 there will be no further modification > from domain_adjust_tot_pages(). Right, at that point the claim has run out. While freeing pages with an active claim means that the claim gets bigger (which naturally needs reflecting in the global). Jan
On Mon, Feb 24, 2025 at 02:49:48PM +0000, Alejandro Vallejo wrote: > Open question to whoever reviews this... > > On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote: > > spin_lock(&heap_lock); > > - /* adjust domain outstanding pages; may not go negative */ > > - dom_before = d->outstanding_pages; > > - dom_after = dom_before - pages; > > - BUG_ON(dom_before < 0); > > - dom_claimed = dom_after < 0 ? 0 : dom_after; > > - d->outstanding_pages = dom_claimed; > > - /* flag accounting bug if system outstanding_claims would go negative */ > > - sys_before = outstanding_claims; > > - sys_after = sys_before - (dom_before - dom_claimed); > > - BUG_ON(sys_after < 0); > > - outstanding_claims = sys_after; > > + BUG_ON(outstanding_claims < d->outstanding_pages); > > + if ( pages > 0 && d->outstanding_pages < pages ) > > + { > > + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > > + outstanding_claims -= d->outstanding_pages; > > + d->outstanding_pages = 0; > > While this matches the previous behaviour, do we _really_ want it? It's weird, > quirky, and it hard to extend to NUMA-aware claims (which is something in > midway through). > > Wouldn't it make sense to fail the allocation (earlier) if the claim has run > out? Do we even expect this to ever happen this late in the allocation call > chain? I'm unsure. This is the case where more memory than initially claimed has been allocated, but by the time domain_adjust_tot_pages() gets called the memory has already been allocated, so it's kind of unhelpful to fail by then. I think any caller that requests more memory than what has been initially claimed for the domain should be prepared to deal with such allocation failing. This quirky handling is very likely a workaround for the miscellaneous differences between the memory accounted by the toolstack for a guest vs the memory really used by such guest. I bet if you limit a guest to strictly only allocate up to d->outstanding_pages domain creation will fail. In general the toolstack memory calculations are not fully accurate, see for example how vmx_alloc_vlapic_mapping() allocates a domheap page which very likely the toolstack won't have accounted for. There are likely other examples that would possibly break the accounting done by the toolstack. Thanks, Roger.
On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote: > On 26.02.2025 14:56, Roger Pau Monné wrote: > > On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: > >> --- a/xen/common/page_alloc.c > >> +++ b/xen/common/page_alloc.c > >> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ > >> > >> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > >> { > >> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; > >> - > >> ASSERT(rspin_is_locked(&d->page_alloc_lock)); > >> d->tot_pages += pages; > >> > >> /* > >> - * can test d->claimed_pages race-free because it can only change > >> + * can test d->outstanding_pages race-free because it can only change > >> * if d->page_alloc_lock and heap_lock are both held, see also > >> * domain_set_outstanding_pages below > >> */ > >> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > >> goto out; > > > > I think you can probably short-circuit the logic below if pages == 0? > > (and avoid taking the heap_lock) > > Are there callers passing in 0? Not sure, but if there are no callers expected we might add an ASSERT to that effect then. > >> spin_lock(&heap_lock); > >> - /* adjust domain outstanding pages; may not go negative */ > >> - dom_before = d->outstanding_pages; > >> - dom_after = dom_before - pages; > >> - BUG_ON(dom_before < 0); > >> - dom_claimed = dom_after < 0 ? 0 : dom_after; > >> - d->outstanding_pages = dom_claimed; > >> - /* flag accounting bug if system outstanding_claims would go negative */ > >> - sys_before = outstanding_claims; > >> - sys_after = sys_before - (dom_before - dom_claimed); > >> - BUG_ON(sys_after < 0); > >> - outstanding_claims = sys_after; > >> + BUG_ON(outstanding_claims < d->outstanding_pages); > >> + if ( pages > 0 && d->outstanding_pages < pages ) > >> + { > >> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > >> + outstanding_claims -= d->outstanding_pages; > >> + d->outstanding_pages = 0; > >> + } else { > >> + outstanding_claims -= pages; > >> + d->outstanding_pages -= pages; > > > > I wonder if it's intentional for a pages < 0 value to modify > > outstanding_claims and d->outstanding_pages, I think those values > > should only be set from domain_set_outstanding_pages(). > > domain_adjust_tot_pages() should only decrease the value, but never > > increase either outstanding_claims or d->outstanding_pages. > > > > At best the behavior is inconsistent, because once > > d->outstanding_pages reaches 0 there will be no further modification > > from domain_adjust_tot_pages(). > > Right, at that point the claim has run out. While freeing pages with an > active claim means that the claim gets bigger (which naturally needs > reflecting in the global). domain_adjust_tot_pages() is not exclusively called when freeing pages, see steal_page() for example. When called from steal_page() it's wrong to increase the claim, as it assumes that the page removed from d->tot_pages is freed, but that's not the case. The domain might end up in a situation where the claim is bigger than the available amount of memory. Thanks, Roger.
On 26.02.2025 15:28, Roger Pau Monné wrote: > On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote: >> On 26.02.2025 14:56, Roger Pau Monné wrote: >>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: >>>> --- a/xen/common/page_alloc.c >>>> +++ b/xen/common/page_alloc.c >>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ >>>> >>>> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>> { >>>> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; >>>> - >>>> ASSERT(rspin_is_locked(&d->page_alloc_lock)); >>>> d->tot_pages += pages; >>>> >>>> /* >>>> - * can test d->claimed_pages race-free because it can only change >>>> + * can test d->outstanding_pages race-free because it can only change >>>> * if d->page_alloc_lock and heap_lock are both held, see also >>>> * domain_set_outstanding_pages below >>>> */ >>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>> goto out; >>> >>> I think you can probably short-circuit the logic below if pages == 0? >>> (and avoid taking the heap_lock) >> >> Are there callers passing in 0? > > Not sure, but if there are no callers expected we might add an ASSERT > to that effect then. > >>>> spin_lock(&heap_lock); >>>> - /* adjust domain outstanding pages; may not go negative */ >>>> - dom_before = d->outstanding_pages; >>>> - dom_after = dom_before - pages; >>>> - BUG_ON(dom_before < 0); >>>> - dom_claimed = dom_after < 0 ? 0 : dom_after; >>>> - d->outstanding_pages = dom_claimed; >>>> - /* flag accounting bug if system outstanding_claims would go negative */ >>>> - sys_before = outstanding_claims; >>>> - sys_after = sys_before - (dom_before - dom_claimed); >>>> - BUG_ON(sys_after < 0); >>>> - outstanding_claims = sys_after; >>>> + BUG_ON(outstanding_claims < d->outstanding_pages); >>>> + if ( pages > 0 && d->outstanding_pages < pages ) >>>> + { >>>> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ >>>> + outstanding_claims -= d->outstanding_pages; >>>> + d->outstanding_pages = 0; >>>> + } else { >>>> + outstanding_claims -= pages; >>>> + d->outstanding_pages -= pages; >>> >>> I wonder if it's intentional for a pages < 0 value to modify >>> outstanding_claims and d->outstanding_pages, I think those values >>> should only be set from domain_set_outstanding_pages(). >>> domain_adjust_tot_pages() should only decrease the value, but never >>> increase either outstanding_claims or d->outstanding_pages. >>> >>> At best the behavior is inconsistent, because once >>> d->outstanding_pages reaches 0 there will be no further modification >>> from domain_adjust_tot_pages(). >> >> Right, at that point the claim has run out. While freeing pages with an >> active claim means that the claim gets bigger (which naturally needs >> reflecting in the global). > > domain_adjust_tot_pages() is not exclusively called when freeing > pages, see steal_page() for example. Or also when pages were allocated. steal_page() ... > When called from steal_page() it's wrong to increase the claim, as > it assumes that the page removed from d->tot_pages is freed, but > that's not the case. The domain might end up in a situation where > the claim is bigger than the available amount of memory. ... is a case that likely wasn't considered when the feature was added. I never really liked this; I'd be quite happy to see it ripped out, as long as we'd be reasonably certain it isn't in active use by people. Jan
On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote: > On 26.02.2025 15:28, Roger Pau Monné wrote: > > On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote: > >> On 26.02.2025 14:56, Roger Pau Monné wrote: > >>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: > >>>> --- a/xen/common/page_alloc.c > >>>> +++ b/xen/common/page_alloc.c > >>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ > >>>> > >>>> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > >>>> { > >>>> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; > >>>> - > >>>> ASSERT(rspin_is_locked(&d->page_alloc_lock)); > >>>> d->tot_pages += pages; > >>>> > >>>> /* > >>>> - * can test d->claimed_pages race-free because it can only change > >>>> + * can test d->outstanding_pages race-free because it can only change > >>>> * if d->page_alloc_lock and heap_lock are both held, see also > >>>> * domain_set_outstanding_pages below > >>>> */ > >>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > >>>> goto out; > >>> > >>> I think you can probably short-circuit the logic below if pages == 0? > >>> (and avoid taking the heap_lock) > >> > >> Are there callers passing in 0? > > > > Not sure, but if there are no callers expected we might add an ASSERT > > to that effect then. > > > >>>> spin_lock(&heap_lock); > >>>> - /* adjust domain outstanding pages; may not go negative */ > >>>> - dom_before = d->outstanding_pages; > >>>> - dom_after = dom_before - pages; > >>>> - BUG_ON(dom_before < 0); > >>>> - dom_claimed = dom_after < 0 ? 0 : dom_after; > >>>> - d->outstanding_pages = dom_claimed; > >>>> - /* flag accounting bug if system outstanding_claims would go negative */ > >>>> - sys_before = outstanding_claims; > >>>> - sys_after = sys_before - (dom_before - dom_claimed); > >>>> - BUG_ON(sys_after < 0); > >>>> - outstanding_claims = sys_after; > >>>> + BUG_ON(outstanding_claims < d->outstanding_pages); > >>>> + if ( pages > 0 && d->outstanding_pages < pages ) > >>>> + { > >>>> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > >>>> + outstanding_claims -= d->outstanding_pages; > >>>> + d->outstanding_pages = 0; > >>>> + } else { > >>>> + outstanding_claims -= pages; > >>>> + d->outstanding_pages -= pages; > >>> > >>> I wonder if it's intentional for a pages < 0 value to modify > >>> outstanding_claims and d->outstanding_pages, I think those values > >>> should only be set from domain_set_outstanding_pages(). > >>> domain_adjust_tot_pages() should only decrease the value, but never > >>> increase either outstanding_claims or d->outstanding_pages. > >>> > >>> At best the behavior is inconsistent, because once > >>> d->outstanding_pages reaches 0 there will be no further modification > >>> from domain_adjust_tot_pages(). > >> > >> Right, at that point the claim has run out. While freeing pages with an > >> active claim means that the claim gets bigger (which naturally needs > >> reflecting in the global). > > > > domain_adjust_tot_pages() is not exclusively called when freeing > > pages, see steal_page() for example. > > Or also when pages were allocated. steal_page() ... > > > When called from steal_page() it's wrong to increase the claim, as > > it assumes that the page removed from d->tot_pages is freed, but > > that's not the case. The domain might end up in a situation where > > the claim is bigger than the available amount of memory. > > ... is a case that likely wasn't considered when the feature was added. > > I never really liked this; I'd be quite happy to see it ripped out, as > long as we'd be reasonably certain it isn't in active use by people. What do you mean with 'it' in the above sentence, the whole claim stuff? Or just getting rid of allowing the claim to increase as a result of pages being removed from a domain? Thanks, Roger.
On 26.02.2025 17:04, Roger Pau Monné wrote: > On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote: >> On 26.02.2025 15:28, Roger Pau Monné wrote: >>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote: >>>> On 26.02.2025 14:56, Roger Pau Monné wrote: >>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: >>>>>> --- a/xen/common/page_alloc.c >>>>>> +++ b/xen/common/page_alloc.c >>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ >>>>>> >>>>>> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>>>> { >>>>>> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; >>>>>> - >>>>>> ASSERT(rspin_is_locked(&d->page_alloc_lock)); >>>>>> d->tot_pages += pages; >>>>>> >>>>>> /* >>>>>> - * can test d->claimed_pages race-free because it can only change >>>>>> + * can test d->outstanding_pages race-free because it can only change >>>>>> * if d->page_alloc_lock and heap_lock are both held, see also >>>>>> * domain_set_outstanding_pages below >>>>>> */ >>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>>>> goto out; >>>>> >>>>> I think you can probably short-circuit the logic below if pages == 0? >>>>> (and avoid taking the heap_lock) >>>> >>>> Are there callers passing in 0? >>> >>> Not sure, but if there are no callers expected we might add an ASSERT >>> to that effect then. >>> >>>>>> spin_lock(&heap_lock); >>>>>> - /* adjust domain outstanding pages; may not go negative */ >>>>>> - dom_before = d->outstanding_pages; >>>>>> - dom_after = dom_before - pages; >>>>>> - BUG_ON(dom_before < 0); >>>>>> - dom_claimed = dom_after < 0 ? 0 : dom_after; >>>>>> - d->outstanding_pages = dom_claimed; >>>>>> - /* flag accounting bug if system outstanding_claims would go negative */ >>>>>> - sys_before = outstanding_claims; >>>>>> - sys_after = sys_before - (dom_before - dom_claimed); >>>>>> - BUG_ON(sys_after < 0); >>>>>> - outstanding_claims = sys_after; >>>>>> + BUG_ON(outstanding_claims < d->outstanding_pages); >>>>>> + if ( pages > 0 && d->outstanding_pages < pages ) >>>>>> + { >>>>>> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ >>>>>> + outstanding_claims -= d->outstanding_pages; >>>>>> + d->outstanding_pages = 0; >>>>>> + } else { >>>>>> + outstanding_claims -= pages; >>>>>> + d->outstanding_pages -= pages; >>>>> >>>>> I wonder if it's intentional for a pages < 0 value to modify >>>>> outstanding_claims and d->outstanding_pages, I think those values >>>>> should only be set from domain_set_outstanding_pages(). >>>>> domain_adjust_tot_pages() should only decrease the value, but never >>>>> increase either outstanding_claims or d->outstanding_pages. >>>>> >>>>> At best the behavior is inconsistent, because once >>>>> d->outstanding_pages reaches 0 there will be no further modification >>>>> from domain_adjust_tot_pages(). >>>> >>>> Right, at that point the claim has run out. While freeing pages with an >>>> active claim means that the claim gets bigger (which naturally needs >>>> reflecting in the global). >>> >>> domain_adjust_tot_pages() is not exclusively called when freeing >>> pages, see steal_page() for example. >> >> Or also when pages were allocated. steal_page() ... >> >>> When called from steal_page() it's wrong to increase the claim, as >>> it assumes that the page removed from d->tot_pages is freed, but >>> that's not the case. The domain might end up in a situation where >>> the claim is bigger than the available amount of memory. >> >> ... is a case that likely wasn't considered when the feature was added. >> >> I never really liked this; I'd be quite happy to see it ripped out, as >> long as we'd be reasonably certain it isn't in active use by people. > > What do you mean with 'it' in the above sentence, the whole claim > stuff? Yes. > Or just getting rid of allowing the claim to increase as a > result of pages being removed from a domain? No. Jan
On 26/02/2025 4:06 pm, Jan Beulich wrote: > On 26.02.2025 17:04, Roger Pau Monné wrote: >> On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote: >>> On 26.02.2025 15:28, Roger Pau Monné wrote: >>>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote: >>>>> On 26.02.2025 14:56, Roger Pau Monné wrote: >>>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: >>>>>>> --- a/xen/common/page_alloc.c >>>>>>> +++ b/xen/common/page_alloc.c >>>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ >>>>>>> >>>>>>> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>>>>> { >>>>>>> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; >>>>>>> - >>>>>>> ASSERT(rspin_is_locked(&d->page_alloc_lock)); >>>>>>> d->tot_pages += pages; >>>>>>> >>>>>>> /* >>>>>>> - * can test d->claimed_pages race-free because it can only change >>>>>>> + * can test d->outstanding_pages race-free because it can only change >>>>>>> * if d->page_alloc_lock and heap_lock are both held, see also >>>>>>> * domain_set_outstanding_pages below >>>>>>> */ >>>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>>>>> goto out; >>>>>> I think you can probably short-circuit the logic below if pages == 0? >>>>>> (and avoid taking the heap_lock) >>>>> Are there callers passing in 0? >>>> Not sure, but if there are no callers expected we might add an ASSERT >>>> to that effect then. >>>> >>>>>>> spin_lock(&heap_lock); >>>>>>> - /* adjust domain outstanding pages; may not go negative */ >>>>>>> - dom_before = d->outstanding_pages; >>>>>>> - dom_after = dom_before - pages; >>>>>>> - BUG_ON(dom_before < 0); >>>>>>> - dom_claimed = dom_after < 0 ? 0 : dom_after; >>>>>>> - d->outstanding_pages = dom_claimed; >>>>>>> - /* flag accounting bug if system outstanding_claims would go negative */ >>>>>>> - sys_before = outstanding_claims; >>>>>>> - sys_after = sys_before - (dom_before - dom_claimed); >>>>>>> - BUG_ON(sys_after < 0); >>>>>>> - outstanding_claims = sys_after; >>>>>>> + BUG_ON(outstanding_claims < d->outstanding_pages); >>>>>>> + if ( pages > 0 && d->outstanding_pages < pages ) >>>>>>> + { >>>>>>> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ >>>>>>> + outstanding_claims -= d->outstanding_pages; >>>>>>> + d->outstanding_pages = 0; >>>>>>> + } else { >>>>>>> + outstanding_claims -= pages; >>>>>>> + d->outstanding_pages -= pages; >>>>>> I wonder if it's intentional for a pages < 0 value to modify >>>>>> outstanding_claims and d->outstanding_pages, I think those values >>>>>> should only be set from domain_set_outstanding_pages(). >>>>>> domain_adjust_tot_pages() should only decrease the value, but never >>>>>> increase either outstanding_claims or d->outstanding_pages. >>>>>> >>>>>> At best the behavior is inconsistent, because once >>>>>> d->outstanding_pages reaches 0 there will be no further modification >>>>>> from domain_adjust_tot_pages(). >>>>> Right, at that point the claim has run out. While freeing pages with an >>>>> active claim means that the claim gets bigger (which naturally needs >>>>> reflecting in the global). >>>> domain_adjust_tot_pages() is not exclusively called when freeing >>>> pages, see steal_page() for example. >>> Or also when pages were allocated. steal_page() ... >>> >>>> When called from steal_page() it's wrong to increase the claim, as >>>> it assumes that the page removed from d->tot_pages is freed, but >>>> that's not the case. The domain might end up in a situation where >>>> the claim is bigger than the available amount of memory. >>> ... is a case that likely wasn't considered when the feature was added. >>> >>> I never really liked this; I'd be quite happy to see it ripped out, as >>> long as we'd be reasonably certain it isn't in active use by people. >> What do you mean with 'it' in the above sentence, the whole claim >> stuff? > Yes. > >> Or just getting rid of allowing the claim to increase as a >> result of pages being removed from a domain? > No. Alejandro and I discussed this earlier in the week. The claim infrastructure stuff is critical for a toolstack capable of doing things in parallel. However, it is also nonsensical for there to be a remaining claim by the time domain construction is done. If creation_finished were a concrete thing, rather than a bodge hacked into domain_unpause_by_systemcontroller(), it ought to be made to fail if there were an outstanding claim. I suggested that we follow through on a previous suggestion of making it a real hypercall (which is needed by the encrypted VM crowd too). ~Andrew
On 26.02.2025 17:34, Andrew Cooper wrote: > On 26/02/2025 4:06 pm, Jan Beulich wrote: >> On 26.02.2025 17:04, Roger Pau Monné wrote: >>> On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote: >>>> On 26.02.2025 15:28, Roger Pau Monné wrote: >>>>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote: >>>>>> On 26.02.2025 14:56, Roger Pau Monné wrote: >>>>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: >>>>>>>> --- a/xen/common/page_alloc.c >>>>>>>> +++ b/xen/common/page_alloc.c >>>>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ >>>>>>>> >>>>>>>> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>>>>>> { >>>>>>>> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; >>>>>>>> - >>>>>>>> ASSERT(rspin_is_locked(&d->page_alloc_lock)); >>>>>>>> d->tot_pages += pages; >>>>>>>> >>>>>>>> /* >>>>>>>> - * can test d->claimed_pages race-free because it can only change >>>>>>>> + * can test d->outstanding_pages race-free because it can only change >>>>>>>> * if d->page_alloc_lock and heap_lock are both held, see also >>>>>>>> * domain_set_outstanding_pages below >>>>>>>> */ >>>>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>>>>>> goto out; >>>>>>> I think you can probably short-circuit the logic below if pages == 0? >>>>>>> (and avoid taking the heap_lock) >>>>>> Are there callers passing in 0? >>>>> Not sure, but if there are no callers expected we might add an ASSERT >>>>> to that effect then. >>>>> >>>>>>>> spin_lock(&heap_lock); >>>>>>>> - /* adjust domain outstanding pages; may not go negative */ >>>>>>>> - dom_before = d->outstanding_pages; >>>>>>>> - dom_after = dom_before - pages; >>>>>>>> - BUG_ON(dom_before < 0); >>>>>>>> - dom_claimed = dom_after < 0 ? 0 : dom_after; >>>>>>>> - d->outstanding_pages = dom_claimed; >>>>>>>> - /* flag accounting bug if system outstanding_claims would go negative */ >>>>>>>> - sys_before = outstanding_claims; >>>>>>>> - sys_after = sys_before - (dom_before - dom_claimed); >>>>>>>> - BUG_ON(sys_after < 0); >>>>>>>> - outstanding_claims = sys_after; >>>>>>>> + BUG_ON(outstanding_claims < d->outstanding_pages); >>>>>>>> + if ( pages > 0 && d->outstanding_pages < pages ) >>>>>>>> + { >>>>>>>> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ >>>>>>>> + outstanding_claims -= d->outstanding_pages; >>>>>>>> + d->outstanding_pages = 0; >>>>>>>> + } else { >>>>>>>> + outstanding_claims -= pages; >>>>>>>> + d->outstanding_pages -= pages; >>>>>>> I wonder if it's intentional for a pages < 0 value to modify >>>>>>> outstanding_claims and d->outstanding_pages, I think those values >>>>>>> should only be set from domain_set_outstanding_pages(). >>>>>>> domain_adjust_tot_pages() should only decrease the value, but never >>>>>>> increase either outstanding_claims or d->outstanding_pages. >>>>>>> >>>>>>> At best the behavior is inconsistent, because once >>>>>>> d->outstanding_pages reaches 0 there will be no further modification >>>>>>> from domain_adjust_tot_pages(). >>>>>> Right, at that point the claim has run out. While freeing pages with an >>>>>> active claim means that the claim gets bigger (which naturally needs >>>>>> reflecting in the global). >>>>> domain_adjust_tot_pages() is not exclusively called when freeing >>>>> pages, see steal_page() for example. >>>> Or also when pages were allocated. steal_page() ... >>>> >>>>> When called from steal_page() it's wrong to increase the claim, as >>>>> it assumes that the page removed from d->tot_pages is freed, but >>>>> that's not the case. The domain might end up in a situation where >>>>> the claim is bigger than the available amount of memory. >>>> ... is a case that likely wasn't considered when the feature was added. >>>> >>>> I never really liked this; I'd be quite happy to see it ripped out, as >>>> long as we'd be reasonably certain it isn't in active use by people. >>> What do you mean with 'it' in the above sentence, the whole claim >>> stuff? >> Yes. >> >>> Or just getting rid of allowing the claim to increase as a >>> result of pages being removed from a domain? >> No. > > Alejandro and I discussed this earlier in the week. > > The claim infrastructure stuff is critical for a toolstack capable of > doing things in parallel. > > However, it is also nonsensical for there to be a remaining claim by the > time domain construction is done. I'm not entirely sure about this. Iirc it was the tmem work where this was added, and then pretty certainly it was needed also for already running domains. > If creation_finished were a concrete thing, rather than a bodge hacked > into domain_unpause_by_systemcontroller(), it ought to be made to fail > if there were an outstanding claim. I suggested that we follow through > on a previous suggestion of making it a real hypercall (which is needed > by the encrypted VM crowd too). Rather than failing we could simply zap the leftover? Jan
On 26/02/2025 4:42 pm, Jan Beulich wrote: > On 26.02.2025 17:34, Andrew Cooper wrote: >> On 26/02/2025 4:06 pm, Jan Beulich wrote: >>> On 26.02.2025 17:04, Roger Pau Monné wrote: >>>> On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote: >>>>> On 26.02.2025 15:28, Roger Pau Monné wrote: >>>>>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote: >>>>>>> On 26.02.2025 14:56, Roger Pau Monné wrote: >>>>>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: >>>>>>>>> --- a/xen/common/page_alloc.c >>>>>>>>> +++ b/xen/common/page_alloc.c >>>>>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ >>>>>>>>> >>>>>>>>> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>>>>>>> { >>>>>>>>> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; >>>>>>>>> - >>>>>>>>> ASSERT(rspin_is_locked(&d->page_alloc_lock)); >>>>>>>>> d->tot_pages += pages; >>>>>>>>> >>>>>>>>> /* >>>>>>>>> - * can test d->claimed_pages race-free because it can only change >>>>>>>>> + * can test d->outstanding_pages race-free because it can only change >>>>>>>>> * if d->page_alloc_lock and heap_lock are both held, see also >>>>>>>>> * domain_set_outstanding_pages below >>>>>>>>> */ >>>>>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>>>>>>> goto out; >>>>>>>> I think you can probably short-circuit the logic below if pages == 0? >>>>>>>> (and avoid taking the heap_lock) >>>>>>> Are there callers passing in 0? >>>>>> Not sure, but if there are no callers expected we might add an ASSERT >>>>>> to that effect then. >>>>>> >>>>>>>>> spin_lock(&heap_lock); >>>>>>>>> - /* adjust domain outstanding pages; may not go negative */ >>>>>>>>> - dom_before = d->outstanding_pages; >>>>>>>>> - dom_after = dom_before - pages; >>>>>>>>> - BUG_ON(dom_before < 0); >>>>>>>>> - dom_claimed = dom_after < 0 ? 0 : dom_after; >>>>>>>>> - d->outstanding_pages = dom_claimed; >>>>>>>>> - /* flag accounting bug if system outstanding_claims would go negative */ >>>>>>>>> - sys_before = outstanding_claims; >>>>>>>>> - sys_after = sys_before - (dom_before - dom_claimed); >>>>>>>>> - BUG_ON(sys_after < 0); >>>>>>>>> - outstanding_claims = sys_after; >>>>>>>>> + BUG_ON(outstanding_claims < d->outstanding_pages); >>>>>>>>> + if ( pages > 0 && d->outstanding_pages < pages ) >>>>>>>>> + { >>>>>>>>> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ >>>>>>>>> + outstanding_claims -= d->outstanding_pages; >>>>>>>>> + d->outstanding_pages = 0; >>>>>>>>> + } else { >>>>>>>>> + outstanding_claims -= pages; >>>>>>>>> + d->outstanding_pages -= pages; >>>>>>>> I wonder if it's intentional for a pages < 0 value to modify >>>>>>>> outstanding_claims and d->outstanding_pages, I think those values >>>>>>>> should only be set from domain_set_outstanding_pages(). >>>>>>>> domain_adjust_tot_pages() should only decrease the value, but never >>>>>>>> increase either outstanding_claims or d->outstanding_pages. >>>>>>>> >>>>>>>> At best the behavior is inconsistent, because once >>>>>>>> d->outstanding_pages reaches 0 there will be no further modification >>>>>>>> from domain_adjust_tot_pages(). >>>>>>> Right, at that point the claim has run out. While freeing pages with an >>>>>>> active claim means that the claim gets bigger (which naturally needs >>>>>>> reflecting in the global). >>>>>> domain_adjust_tot_pages() is not exclusively called when freeing >>>>>> pages, see steal_page() for example. >>>>> Or also when pages were allocated. steal_page() ... >>>>> >>>>>> When called from steal_page() it's wrong to increase the claim, as >>>>>> it assumes that the page removed from d->tot_pages is freed, but >>>>>> that's not the case. The domain might end up in a situation where >>>>>> the claim is bigger than the available amount of memory. >>>>> ... is a case that likely wasn't considered when the feature was added. >>>>> >>>>> I never really liked this; I'd be quite happy to see it ripped out, as >>>>> long as we'd be reasonably certain it isn't in active use by people. >>>> What do you mean with 'it' in the above sentence, the whole claim >>>> stuff? >>> Yes. >>> >>>> Or just getting rid of allowing the claim to increase as a >>>> result of pages being removed from a domain? >>> No. >> Alejandro and I discussed this earlier in the week. >> >> The claim infrastructure stuff is critical for a toolstack capable of >> doing things in parallel. >> >> However, it is also nonsensical for there to be a remaining claim by the >> time domain construction is done. > I'm not entirely sure about this. Iirc it was the tmem work where this was > added, and then pretty certainly it was needed also for already running > domains. It wasn't TMEM. It was generally large-memory VMs. The problem is if you've got 2x 2T VMs booting on a 3T system. Previously, you'd start building both of them, and minutes later they both fail because Xen was fully out of memory. Claim was introduced to atomically reserve the memory you were intending to build a domain with. For XenServer, we're working on NUMA fixes, and something that's important there is to be able to reserve memory on a specific NUMA node (hence why this is all getting looked at). >> If creation_finished were a concrete thing, rather than a bodge hacked >> into domain_unpause_by_systemcontroller(), it ought to be made to fail >> if there were an outstanding claim. I suggested that we follow through >> on a previous suggestion of making it a real hypercall (which is needed >> by the encrypted VM crowd too). > Rather than failing we could simply zap the leftover? Hmm. Perhaps. I'd be slightly wary about zapping a claim, but there should only be an outstanding claim if the toolstack did something wrong. OTOH, we absolutely definitely do need a real hypercall here at some point soon. ~Andrew
On Wed Feb 26, 2025 at 2:05 PM GMT, Jan Beulich wrote: > On 24.02.2025 15:49, Alejandro Vallejo wrote: > > Open question to whoever reviews this... > > > > On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote: > >> spin_lock(&heap_lock); > >> - /* adjust domain outstanding pages; may not go negative */ > >> - dom_before = d->outstanding_pages; > >> - dom_after = dom_before - pages; > >> - BUG_ON(dom_before < 0); > >> - dom_claimed = dom_after < 0 ? 0 : dom_after; > >> - d->outstanding_pages = dom_claimed; > >> - /* flag accounting bug if system outstanding_claims would go negative */ > >> - sys_before = outstanding_claims; > >> - sys_after = sys_before - (dom_before - dom_claimed); > >> - BUG_ON(sys_after < 0); > >> - outstanding_claims = sys_after; > >> + BUG_ON(outstanding_claims < d->outstanding_pages); > >> + if ( pages > 0 && d->outstanding_pages < pages ) > >> + { > >> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > >> + outstanding_claims -= d->outstanding_pages; > >> + d->outstanding_pages = 0; > > > > While this matches the previous behaviour, do we _really_ want it? It's weird, > > quirky, and it hard to extend to NUMA-aware claims (which is something in > > midway through). > > > > Wouldn't it make sense to fail the allocation (earlier) if the claim has run > > out? Do we even expect this to ever happen this late in the allocation call > > chain? > > This goes back to what a "claim" means. Even without any claim, a domain may > allocate memory. So a claim having run out doesn't imply allocation has to > fail. Hmmm... but that violates the purpose of the claim infra as far as I understand it. If a domain may overallocate by (e.g) ballooning in memory it can distort the ability of another domain to start up, even if it succeeded in its own claim. We might also break the invariant that total claims are strictly >= total_avail_pages. I'm somewhat puzzled at the "why" of having separate concepts for max_mem and claims. I guess it simply grew the way it did. Reinstating sanity would probably involve making max_mem effectively the claim, but that's a ton of work I really would rather not do for now. > > NUMA-aware claims require more than an adjustment just here, I expect. Tracking > of claims (certainly the global, maybe also the per-domain value) would likely > need to become per-node, for example. A fair amount more, yes. I'm preparing a series on the side to address per-node claims and it's far more invasive on page_alloc.c. This function was just sufficiently impossible to read that I felt the urge to send it ahead of time for my own mental health. > > Jan Cheers, Alejandro
On Wed Feb 26, 2025 at 4:51 PM GMT, Andrew Cooper wrote: > On 26/02/2025 4:42 pm, Jan Beulich wrote: > > On 26.02.2025 17:34, Andrew Cooper wrote: > >> On 26/02/2025 4:06 pm, Jan Beulich wrote: > >>> On 26.02.2025 17:04, Roger Pau Monné wrote: > >>>> On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote: > >>>>> On 26.02.2025 15:28, Roger Pau Monné wrote: > >>>>>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote: > >>>>>>> On 26.02.2025 14:56, Roger Pau Monné wrote: > >>>>>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: > >>>>>>>>> --- a/xen/common/page_alloc.c > >>>>>>>>> +++ b/xen/common/page_alloc.c > >>>>>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ > >>>>>>>>> > >>>>>>>>> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > >>>>>>>>> { > >>>>>>>>> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; > >>>>>>>>> - > >>>>>>>>> ASSERT(rspin_is_locked(&d->page_alloc_lock)); > >>>>>>>>> d->tot_pages += pages; > >>>>>>>>> > >>>>>>>>> /* > >>>>>>>>> - * can test d->claimed_pages race-free because it can only change > >>>>>>>>> + * can test d->outstanding_pages race-free because it can only change > >>>>>>>>> * if d->page_alloc_lock and heap_lock are both held, see also > >>>>>>>>> * domain_set_outstanding_pages below > >>>>>>>>> */ > >>>>>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > >>>>>>>>> goto out; > >>>>>>>> I think you can probably short-circuit the logic below if pages == 0? > >>>>>>>> (and avoid taking the heap_lock) > >>>>>>> Are there callers passing in 0? > >>>>>> Not sure, but if there are no callers expected we might add an ASSERT > >>>>>> to that effect then. > >>>>>> > >>>>>>>>> spin_lock(&heap_lock); > >>>>>>>>> - /* adjust domain outstanding pages; may not go negative */ > >>>>>>>>> - dom_before = d->outstanding_pages; > >>>>>>>>> - dom_after = dom_before - pages; > >>>>>>>>> - BUG_ON(dom_before < 0); > >>>>>>>>> - dom_claimed = dom_after < 0 ? 0 : dom_after; > >>>>>>>>> - d->outstanding_pages = dom_claimed; > >>>>>>>>> - /* flag accounting bug if system outstanding_claims would go negative */ > >>>>>>>>> - sys_before = outstanding_claims; > >>>>>>>>> - sys_after = sys_before - (dom_before - dom_claimed); > >>>>>>>>> - BUG_ON(sys_after < 0); > >>>>>>>>> - outstanding_claims = sys_after; > >>>>>>>>> + BUG_ON(outstanding_claims < d->outstanding_pages); > >>>>>>>>> + if ( pages > 0 && d->outstanding_pages < pages ) > >>>>>>>>> + { > >>>>>>>>> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > >>>>>>>>> + outstanding_claims -= d->outstanding_pages; > >>>>>>>>> + d->outstanding_pages = 0; > >>>>>>>>> + } else { > >>>>>>>>> + outstanding_claims -= pages; > >>>>>>>>> + d->outstanding_pages -= pages; > >>>>>>>> I wonder if it's intentional for a pages < 0 value to modify > >>>>>>>> outstanding_claims and d->outstanding_pages, I think those values > >>>>>>>> should only be set from domain_set_outstanding_pages(). > >>>>>>>> domain_adjust_tot_pages() should only decrease the value, but never > >>>>>>>> increase either outstanding_claims or d->outstanding_pages. > >>>>>>>> > >>>>>>>> At best the behavior is inconsistent, because once > >>>>>>>> d->outstanding_pages reaches 0 there will be no further modification > >>>>>>>> from domain_adjust_tot_pages(). > >>>>>>> Right, at that point the claim has run out. While freeing pages with an > >>>>>>> active claim means that the claim gets bigger (which naturally needs > >>>>>>> reflecting in the global). > >>>>>> domain_adjust_tot_pages() is not exclusively called when freeing > >>>>>> pages, see steal_page() for example. > >>>>> Or also when pages were allocated. steal_page() ... > >>>>> > >>>>>> When called from steal_page() it's wrong to increase the claim, as > >>>>>> it assumes that the page removed from d->tot_pages is freed, but > >>>>>> that's not the case. The domain might end up in a situation where > >>>>>> the claim is bigger than the available amount of memory. > >>>>> ... is a case that likely wasn't considered when the feature was added. > >>>>> > >>>>> I never really liked this; I'd be quite happy to see it ripped out, as > >>>>> long as we'd be reasonably certain it isn't in active use by people. > >>>> What do you mean with 'it' in the above sentence, the whole claim > >>>> stuff? > >>> Yes. > >>> > >>>> Or just getting rid of allowing the claim to increase as a > >>>> result of pages being removed from a domain? > >>> No. > >> Alejandro and I discussed this earlier in the week. > >> > >> The claim infrastructure stuff is critical for a toolstack capable of > >> doing things in parallel. > >> > >> However, it is also nonsensical for there to be a remaining claim by the > >> time domain construction is done. > > I'm not entirely sure about this. Iirc it was the tmem work where this was > > added, and then pretty certainly it was needed also for already running > > domains. > > It wasn't TMEM. It was generally large-memory VMs. > > The problem is if you've got 2x 2T VMs booting on a 3T system. > > Previously, you'd start building both of them, and minutes later they > both fail because Xen was fully out of memory. > > Claim was introduced to atomically reserve the memory you were intending > to build a domain with. > > For XenServer, we're working on NUMA fixes, and something that's > important there is to be able to reserve memory on a specific NUMA node > (hence why this is all getting looked at). > >> If creation_finished were a concrete thing, rather than a bodge hacked > >> into domain_unpause_by_systemcontroller(), it ought to be made to fail > >> if there were an outstanding claim. I suggested that we follow through > >> on a previous suggestion of making it a real hypercall (which is needed > >> by the encrypted VM crowd too). > > Rather than failing we could simply zap the leftover? > > Hmm. Perhaps. > > I'd be slightly wary about zapping a claim, but there should only be an > outstanding claim if the toolstack did something wrong. > > OTOH, we absolutely definitely do need a real hypercall here at some > point soon. > > ~Andrew It should probably be removed at the end. Otherwise we're effectively leaking all claimed but non-used memory for the lifetime of the domain. Cheers, Alejandro
On Wed Feb 26, 2025 at 2:28 PM GMT, Roger Pau Monné wrote: > On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote: > > On 26.02.2025 14:56, Roger Pau Monné wrote: > > > On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: > > >> --- a/xen/common/page_alloc.c > > >> +++ b/xen/common/page_alloc.c > > >> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ > > >> > > >> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > > >> { > > >> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; > > >> - > > >> ASSERT(rspin_is_locked(&d->page_alloc_lock)); > > >> d->tot_pages += pages; > > >> > > >> /* > > >> - * can test d->claimed_pages race-free because it can only change > > >> + * can test d->outstanding_pages race-free because it can only change > > >> * if d->page_alloc_lock and heap_lock are both held, see also > > >> * domain_set_outstanding_pages below > > >> */ > > >> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > > >> goto out; > > > > > > I think you can probably short-circuit the logic below if pages == 0? > > > (and avoid taking the heap_lock) > > > > Are there callers passing in 0? > > Not sure, but if there are no callers expected we might add an ASSERT > to that effect then. > > > >> spin_lock(&heap_lock); > > >> - /* adjust domain outstanding pages; may not go negative */ > > >> - dom_before = d->outstanding_pages; > > >> - dom_after = dom_before - pages; > > >> - BUG_ON(dom_before < 0); > > >> - dom_claimed = dom_after < 0 ? 0 : dom_after; > > >> - d->outstanding_pages = dom_claimed; > > >> - /* flag accounting bug if system outstanding_claims would go negative */ > > >> - sys_before = outstanding_claims; > > >> - sys_after = sys_before - (dom_before - dom_claimed); > > >> - BUG_ON(sys_after < 0); > > >> - outstanding_claims = sys_after; > > >> + BUG_ON(outstanding_claims < d->outstanding_pages); > > >> + if ( pages > 0 && d->outstanding_pages < pages ) > > >> + { > > >> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > > >> + outstanding_claims -= d->outstanding_pages; > > >> + d->outstanding_pages = 0; > > >> + } else { > > >> + outstanding_claims -= pages; > > >> + d->outstanding_pages -= pages; > > > > > > I wonder if it's intentional for a pages < 0 value to modify > > > outstanding_claims and d->outstanding_pages, I think those values > > > should only be set from domain_set_outstanding_pages(). > > > domain_adjust_tot_pages() should only decrease the value, but never > > > increase either outstanding_claims or d->outstanding_pages. > > > > > > At best the behavior is inconsistent, because once > > > d->outstanding_pages reaches 0 there will be no further modification > > > from domain_adjust_tot_pages(). > > > > Right, at that point the claim has run out. While freeing pages with an > > active claim means that the claim gets bigger (which naturally needs > > reflecting in the global). > > domain_adjust_tot_pages() is not exclusively called when freeing > pages, see steal_page() for example. > > When called from steal_page() it's wrong to increase the claim, as > it assumes that the page removed from d->tot_pages is freed, but > that's not the case. The domain might end up in a situation where > the claim is bigger than the available amount of memory. > > Thanks, Roger. This is what I meant by my initial reply questioning the logic itself. It's all very dubious with memory_exchange and makes very little sense on the tentative code I have for per-node claims. I'd be quite happy to put an early exit before the spin_lock on pages <= 0. That also covers your initial comment and prevents claims from growing after a domain started running if it didn't happen to consume all of them. Is anyone opposed? Cheers, Alejandro
On Wed Feb 26, 2025 at 2:02 PM GMT, Jan Beulich wrote: > On 24.02.2025 14:27, Alejandro Vallejo wrote: > > @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > > goto out; > > > > spin_lock(&heap_lock); > > - /* adjust domain outstanding pages; may not go negative */ > > - dom_before = d->outstanding_pages; > > - dom_after = dom_before - pages; > > - BUG_ON(dom_before < 0); > > - dom_claimed = dom_after < 0 ? 0 : dom_after; > > - d->outstanding_pages = dom_claimed; > > - /* flag accounting bug if system outstanding_claims would go negative */ > > - sys_before = outstanding_claims; > > - sys_after = sys_before - (dom_before - dom_claimed); > > - BUG_ON(sys_after < 0); > > - outstanding_claims = sys_after; > > + BUG_ON(outstanding_claims < d->outstanding_pages); > > + if ( pages > 0 && d->outstanding_pages < pages ) > > The lhs isn't needed, is it? d->outstanding_pages is an unsigned quantity, > after all. Else dropping the earlier of the two BUG_ON() wouldn't be quite > right. d->outstanding pages is unsigned, but pages isn't. It was originally like that, but I then got concerned about 32bit machines, where you'd be comparing a signed and an unsigned integer of the same not-very-large width. That seems like dangerous terrains if the unsigned number grows large enough. TL;DR: It's there for clarity and paranoia. Even if the overflowing into bit 31 would be rare in such a system. > > > + { > > + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > > + outstanding_claims -= d->outstanding_pages; > > + d->outstanding_pages = 0; > > + } else { > > Nit: Braces on their own lines please. Ack. > > In any event - yes, this reads quite a bit easier after the adjustment. > > With the adjustments (happy to make while committing, so long as you agree) > Reviewed-by: Jan Beulich <jbeulich@suse.com> Thanks. I'd probably like to hold off and send a v2 if you're fine with the adjustment I answered Roger with (returning ealy on pages <= 0, so claims are never increased on free). > > Jan > Cheers, Alejandro
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c index 1bf070c8c5df..4306753f0bd2 100644 --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) { - long dom_before, dom_after, dom_claimed, sys_before, sys_after; - ASSERT(rspin_is_locked(&d->page_alloc_lock)); d->tot_pages += pages; /* - * can test d->claimed_pages race-free because it can only change + * can test d->outstanding_pages race-free because it can only change * if d->page_alloc_lock and heap_lock are both held, see also * domain_set_outstanding_pages below */ @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) goto out; spin_lock(&heap_lock); - /* adjust domain outstanding pages; may not go negative */ - dom_before = d->outstanding_pages; - dom_after = dom_before - pages; - BUG_ON(dom_before < 0); - dom_claimed = dom_after < 0 ? 0 : dom_after; - d->outstanding_pages = dom_claimed; - /* flag accounting bug if system outstanding_claims would go negative */ - sys_before = outstanding_claims; - sys_after = sys_before - (dom_before - dom_claimed); - BUG_ON(sys_after < 0); - outstanding_claims = sys_after; + BUG_ON(outstanding_claims < d->outstanding_pages); + if ( pages > 0 && d->outstanding_pages < pages ) + { + /* `pages` exceeds the domain's outstanding count. Zero it out. */ + outstanding_claims -= d->outstanding_pages; + d->outstanding_pages = 0; + } else { + outstanding_claims -= pages; + d->outstanding_pages -= pages; + } spin_unlock(&heap_lock); out:
The logic has too many levels of indirection and it's very hard to understand it its current form. Split it between the corner case where the adjustment is bigger than the current claim and the rest to avoid 5 auxiliary variables. While at it, fix incorrect field name in nearby comment. Not a functional change (although by its nature it might not seem immediately obvious at first). Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> --- xen/common/page_alloc.c | 25 +++++++++++-------------- 1 file changed, 11 insertions(+), 14 deletions(-)