diff mbox series

[XEN,v2,1/2] Check zone before merging adjacent blocks in heap

Message ID 20200204151441.10626-1-stewart.hildebrand@dornerworks.com (mailing list archive)
State New, archived
Headers show
Series [XEN,v2,1/2] Check zone before merging adjacent blocks in heap | expand

Commit Message

Stewart Hildebrand Feb. 4, 2020, 3:14 p.m. UTC
From: Jeff Kubascik <jeff.kubascik@dornerworks.com>

The Xen heap is split up into nodes and zones. Each node + zone is
managed as a separate pool of memory.

When returning pages to the heap, free_heap_pages will check adjacent
blocks to see if they can be combined into a larger block. However, the
zone of the adjacent block is not checked. This results in blocks that
migrate from one zone to another.

When a block migrates to the adjacent zone, the avail counters for the
old and new node + zone is not updated accordingly. The avail counter
is used when allocating pages to determine whether to skip over a zone.
With this behavior, it is possible for free pages to collect in a zone
with the avail counter smaller than the actual page count, resulting
in free pages that are not allocable.

This commit adds a check to compare the adjacent block's zone with the
current zone before merging them.

Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@dornerworks.com>

---
v2: s/pg-mask/predecessor/
---
 xen/common/page_alloc.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Jan Beulich Feb. 4, 2020, 3:22 p.m. UTC | #1
On 04.02.2020 16:14, Stewart Hildebrand wrote:
> From: Jeff Kubascik <jeff.kubascik@dornerworks.com>
> 
> The Xen heap is split up into nodes and zones. Each node + zone is
> managed as a separate pool of memory.
> 
> When returning pages to the heap, free_heap_pages will check adjacent
> blocks to see if they can be combined into a larger block. However, the
> zone of the adjacent block is not checked. This results in blocks that
> migrate from one zone to another.
> 
> When a block migrates to the adjacent zone, the avail counters for the
> old and new node + zone is not updated accordingly. The avail counter
> is used when allocating pages to determine whether to skip over a zone.
> With this behavior, it is possible for free pages to collect in a zone
> with the avail counter smaller than the actual page count, resulting
> in free pages that are not allocable.

"When a block migrates" - fine. But is this situation possible to
occur, without "xen/page_alloc: Keep away MFN 0 from the buddy
allocator" reverted? If not, there's no bug, no need for a change,
and even less so ...

> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -1462,6 +1462,7 @@ static void free_heap_pages(
>              if ( !mfn_valid(page_to_mfn(predecessor)) ||
>                   !page_state_is(predecessor, free) ||
>                   (PFN_ORDER(predecessor) != order) ||
> +                 (page_to_zone(predecessor) != zone) ||
>                   (phys_to_nid(page_to_maddr(predecessor)) != node) )
>                  break;
>  
> @@ -1485,6 +1486,7 @@ static void free_heap_pages(
>              if ( !mfn_valid(page_to_mfn(successor)) ||
>                   !page_state_is(successor, free) ||
>                   (PFN_ORDER(successor) != order) ||
> +                 (page_to_zone(successor) != zone) ||
>                   (phys_to_nid(page_to_maddr(successor)) != node) )
>                  break;

... for one that slows down many free operations, even if just
slightly. IOW afaict either the change is not needed, or its
description needs updating.

Jan
George Dunlap Feb. 4, 2020, 3:37 p.m. UTC | #2
On 2/4/20 3:22 PM, Jan Beulich wrote:
> On 04.02.2020 16:14, Stewart Hildebrand wrote:
>> From: Jeff Kubascik <jeff.kubascik@dornerworks.com>
>>
>> The Xen heap is split up into nodes and zones. Each node + zone is
>> managed as a separate pool of memory.
>>
>> When returning pages to the heap, free_heap_pages will check adjacent
>> blocks to see if they can be combined into a larger block. However, the
>> zone of the adjacent block is not checked. This results in blocks that
>> migrate from one zone to another.
>>
>> When a block migrates to the adjacent zone, the avail counters for the
>> old and new node + zone is not updated accordingly. The avail counter
>> is used when allocating pages to determine whether to skip over a zone.
>> With this behavior, it is possible for free pages to collect in a zone
>> with the avail counter smaller than the actual page count, resulting
>> in free pages that are not allocable.
> 
> "When a block migrates" - fine. But is this situation possible to
> occur, without "xen/page_alloc: Keep away MFN 0 from the buddy
> allocator" reverted? If not, there's no bug, no need for a change,
> and even less so ...
> 
>> --- a/xen/common/page_alloc.c
>> +++ b/xen/common/page_alloc.c
>> @@ -1462,6 +1462,7 @@ static void free_heap_pages(
>>              if ( !mfn_valid(page_to_mfn(predecessor)) ||
>>                   !page_state_is(predecessor, free) ||
>>                   (PFN_ORDER(predecessor) != order) ||
>> +                 (page_to_zone(predecessor) != zone) ||
>>                   (phys_to_nid(page_to_maddr(predecessor)) != node) )
>>                  break;
>>  
>> @@ -1485,6 +1486,7 @@ static void free_heap_pages(
>>              if ( !mfn_valid(page_to_mfn(successor)) ||
>>                   !page_state_is(successor, free) ||
>>                   (PFN_ORDER(successor) != order) ||
>> +                 (page_to_zone(successor) != zone) ||
>>                   (phys_to_nid(page_to_maddr(successor)) != node) )
>>                  break;
> 
> ... for one that slows down many free operations, even if just
> slightly. IOW afaict either the change is not needed, or its
> description needs updating.

At very least it's more robust this way; the algorithm is also less
"magic".  We just had a long discussion this morning trying to re-create
the logic for why "Remove MFN 0" was sufficient to prevent this issue,
and even then David wasn't sure it was correct at first.

 -George
Stewart Hildebrand Feb. 4, 2020, 3:37 p.m. UTC | #3
On Tuesday, February 4, 2020 10:22 AM, Jan Beulich wrote:
>On 04.02.2020 16:14, Stewart Hildebrand wrote:
>> From: Jeff Kubascik <jeff.kubascik@dornerworks.com>
>>
>> The Xen heap is split up into nodes and zones. Each node + zone is
>> managed as a separate pool of memory.
>>
>> When returning pages to the heap, free_heap_pages will check adjacent
>> blocks to see if they can be combined into a larger block. However, the
>> zone of the adjacent block is not checked. This results in blocks that
>> migrate from one zone to another.
>>
>> When a block migrates to the adjacent zone, the avail counters for the
>> old and new node + zone is not updated accordingly. The avail counter
>> is used when allocating pages to determine whether to skip over a zone.
>> With this behavior, it is possible for free pages to collect in a zone
>> with the avail counter smaller than the actual page count, resulting
>> in free pages that are not allocable.
>
>"When a block migrates" - fine. But is this situation possible to
>occur, without "xen/page_alloc: Keep away MFN 0 from the buddy
>allocator" reverted?

No, not as far as I'm aware, though I have not studied this code in detail so I don't feel fully confident in my "no".

> If not, there's no bug, no need for a change,
>and even less so ...
>
>> --- a/xen/common/page_alloc.c
>> +++ b/xen/common/page_alloc.c
>> @@ -1462,6 +1462,7 @@ static void free_heap_pages(
>>              if ( !mfn_valid(page_to_mfn(predecessor)) ||
>>                   !page_state_is(predecessor, free) ||
>>                   (PFN_ORDER(predecessor) != order) ||
>> +                 (page_to_zone(predecessor) != zone) ||
>>                   (phys_to_nid(page_to_maddr(predecessor)) != node) )
>>                  break;
>>
>> @@ -1485,6 +1486,7 @@ static void free_heap_pages(
>>              if ( !mfn_valid(page_to_mfn(successor)) ||
>>                   !page_state_is(successor, free) ||
>>                   (PFN_ORDER(successor) != order) ||
>> +                 (page_to_zone(successor) != zone) ||
>>                   (phys_to_nid(page_to_maddr(successor)) != node) )
>>                  break;
>
>... for one that slows down many free operations, even if just
>slightly. IOW afaict either the change is not needed, or its
>description needs updating.

Right. An alternative that wouldn't potentially slow things down in production builds would be to apply the ASSERT from patch 2. I don't have any performance metrics regarding exactly how much of a performance hit this would incur.

Stew

>
>Jan
David Woodhouse Feb. 5, 2020, 9:50 a.m. UTC | #4
On Tue, 2020-02-04 at 15:37 +0000, George Dunlap wrote:
> At very least it's more robust this way; the algorithm is also less
> "magic".  We just had a long discussion this morning trying to re-create
> the logic for why "Remove MFN 0" was sufficient to prevent this issue,
> and even then David wasn't sure it was correct at first.

Right. So the real reason I'm staring hard at this is because I can't
convince myself there aren't places where memory allocated by the boot
allocator is later freed with free_xenheap_pages().

We have a few pieces of code which decide whether to use the boot
allocator vs. heap based on system_state >= SYS_STATE_boot, and *if*
the rule is "thou shalt not allocate with boot allocator and free
later" then it's *extremely* fragile and probably being violated —
especially because it actually *works* most of the time, except in some
esoteric corner cases like MFN#0, boot allocations which cross
zones/regions, etc.

So because we want to make that *more* likely by allowing vmap() to
happen earlier, I'd like to clean things up by addressing those corner
cases and making it unconditionally OK to free boot-allocated pages
into the heap.

I think might be as simple as checking for (first_pg)->count_info == 0
in free_xenheap_pages(). That's quick enough, and if the count_info is
zero then I think it does indicate a boot-allocated page, because pages
from alloc_xenheap_pages() would have PGC_xen_heap set?

It would suffice just to pass such pages to init_heap_pages() instead
of directly to free_heap_pages(), I think. Julien?

The straw man version of that looks a bit like this...

--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -2304,6 +2304,12 @@ void free_xenheap_pages(void *v, unsigned int order)
 
     pg = virt_to_page(v);
 
+    /* Pages from the boot allocator need to pass through init_heap_pages() */
+    if ( unlikely(!pg->count_info) )
+    {
+        init_heap_pages(pg, 1 << order);
+        return;
+    }
     for ( i = 0; i < (1u << order); i++ )
         pg[i].count_info &= ~PGC_xen_heap;
Jan Beulich Feb. 5, 2020, 10:02 a.m. UTC | #5
On 05.02.2020 10:50, David Woodhouse wrote:
> On Tue, 2020-02-04 at 15:37 +0000, George Dunlap wrote:
>> At very least it's more robust this way; the algorithm is also less
>> "magic".  We just had a long discussion this morning trying to re-create
>> the logic for why "Remove MFN 0" was sufficient to prevent this issue,
>> and even then David wasn't sure it was correct at first.
> 
> Right. So the real reason I'm staring hard at this is because I can't
> convince myself there aren't places where memory allocated by the boot
> allocator is later freed with free_xenheap_pages().
> 
> We have a few pieces of code which decide whether to use the boot
> allocator vs. heap based on system_state >= SYS_STATE_boot, and *if*
> the rule is "thou shalt not allocate with boot allocator and free
> later" then it's *extremely* fragile and probably being violated —
> especially because it actually *works* most of the time, except in some
> esoteric corner cases like MFN#0, boot allocations which cross
> zones/regions, etc.
> 
> So because we want to make that *more* likely by allowing vmap() to
> happen earlier, I'd like to clean things up by addressing those corner
> cases and making it unconditionally OK to free boot-allocated pages
> into the heap.
> 
> I think might be as simple as checking for (first_pg)->count_info == 0
> in free_xenheap_pages(). That's quick enough, and if the count_info is
> zero then I think it does indicate a boot-allocated page, because pages
> from alloc_xenheap_pages() would have PGC_xen_heap set?

They would, but that leaves {alloc,free}_domheap_pages() out of
the picture. I.e. ...

> It would suffice just to pass such pages to init_heap_pages() instead
> of directly to free_heap_pages(), I think. Julien?
> 
> The straw man version of that looks a bit like this...
> 
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -2304,6 +2304,12 @@ void free_xenheap_pages(void *v, unsigned int order)
>  
>      pg = virt_to_page(v);
>  
> +    /* Pages from the boot allocator need to pass through init_heap_pages() */
> +    if ( unlikely(!pg->count_info) )

... while I think this check may be fine here, no similar one
can be used in free_domheap_pages(), yet pages getting handed
there isn't less likely than ones getting handed to
free_xenheap_pages() (if we already fear mismatch).

Jan
Julien Grall Feb. 5, 2020, 10:22 a.m. UTC | #6
Hi,

On 05/02/2020 09:50, David Woodhouse wrote:
> On Tue, 2020-02-04 at 15:37 +0000, George Dunlap wrote:
>> At very least it's more robust this way; the algorithm is also less
>> "magic".  We just had a long discussion this morning trying to re-create
>> the logic for why "Remove MFN 0" was sufficient to prevent this issue,
>> and even then David wasn't sure it was correct at first.
> 
> Right. So the real reason I'm staring hard at this is because I can't
> convince myself there aren't places where memory allocated by the boot
> allocator is later freed with free_xenheap_pages().
> 
> We have a few pieces of code which decide whether to use the boot
> allocator vs. heap based on system_state >= SYS_STATE_boot, and *if*
> the rule is "thou shalt not allocate with boot allocator and free
> later" then it's *extremely* fragile and probably being violated —
> especially because it actually *works* most of the time, except in some
> esoteric corner cases like MFN#0, boot allocations which cross
> zones/regions, etc.
> 
> So because we want to make that *more* likely by allowing vmap() to
> happen earlier, I'd like to clean things up by addressing those corner
> cases and making it unconditionally OK to free boot-allocated pages
> into the heap.
> 
> I think might be as simple as checking for (first_pg)->count_info == 0
> in free_xenheap_pages(). That's quick enough, and if the count_info is
> zero then I think it does indicate a boot-allocated page, because pages
> from alloc_xenheap_pages() would have PGC_xen_heap set?
> 
> It would suffice just to pass such pages to init_heap_pages() instead
> of directly to free_heap_pages(), I think. Julien?
> 
> The straw man version of that looks a bit like this...
> 
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -2304,6 +2304,12 @@ void free_xenheap_pages(void *v, unsigned int order)
>   
>       pg = virt_to_page(v);
>   
> +    /* Pages from the boot allocator need to pass through init_heap_pages() */
> +    if ( unlikely(!pg->count_info) )

Note that there is two versions of free_xenheap_pages(). This one only 
cover the case where the domheap and xenheap are the same.

But you can't use the same trick when xenheap is separated (like on 
Arm32) because PGC_xen_heap is not set. So you would be calling 
init_heap_pages() everytime.

However, I don't like the idea of relying on count_info == 0. Indeed, 
there are valid case where count_info == 0 because it means the page is 
inuse (PCC_state_inuse).

It might be best if we introduce a new page state that would be the 
default value set in the frametable.

0 may be an option if we assign a different value to PGC_state_inuse but 
we would need to check if there are places relying on PGC_state_inuse == 
0. I know that assign_pages() does rely on it, I have sent a patch for 
it yesterday.

Another option is to use an all 1s value and initialize the frametable 
to all 1s. But I am not entirely sure whether the allocator would be 
able to cope with it.

Cheers,
David Woodhouse Feb. 5, 2020, 10:24 a.m. UTC | #7
On Wed, 2020-02-05 at 11:02 +0100, Jan Beulich wrote:
> > +    /* Pages from the boot allocator need to pass through init_heap_pages() */
> > +    if ( unlikely(!pg->count_info) )
> 
> ... while I think this check may be fine here, no similar one
> can be used in free_domheap_pages(), yet pages getting handed
> there isn't less likely than ones getting handed to
> free_xenheap_pages() (if we already fear mismatch).

Do we care about that?

ICBW but I don't think I've seen a case where boot-allocated pages get
handed to free_domheap_pages() later. I've only seen them handed to
free_xenheap_pages(). These are pages which are mapped to Xen, not
domheap pages.

You are already expected *not* to conflate free_xenheap_pages() and
free_domheap_pages().

I think it should be OK to declare that freeing boot-allocated pages
with free_xenheap_pages() is permitted, but freeing them with
free_domheap_pages() isn't.

(Note my straw man patch didn't fix the CONFIG_SEPARATE_XENHEAP case
and doesn't trivially port to that version of free_xenheap_pages()
because the PGC_xen_heap flag isn't used there. But it's fixable
relatively easily.
David Woodhouse Feb. 5, 2020, 10:32 a.m. UTC | #8
On Wed, 2020-02-05 at 10:22 +0000, Julien Grall wrote:
> Hi,
> 
> On 05/02/2020 09:50, David Woodhouse wrote:
> > On Tue, 2020-02-04 at 15:37 +0000, George Dunlap wrote:
> > > At very least it's more robust this way; the algorithm is also less
> > > "magic".  We just had a long discussion this morning trying to re-create
> > > the logic for why "Remove MFN 0" was sufficient to prevent this issue,
> > > and even then David wasn't sure it was correct at first.
> > 
> > Right. So the real reason I'm staring hard at this is because I can't
> > convince myself there aren't places where memory allocated by the boot
> > allocator is later freed with free_xenheap_pages().
> > 
> > We have a few pieces of code which decide whether to use the boot
> > allocator vs. heap based on system_state >= SYS_STATE_boot, and *if*
> > the rule is "thou shalt not allocate with boot allocator and free
> > later" then it's *extremely* fragile and probably being violated —
> > especially because it actually *works* most of the time, except in some
> > esoteric corner cases like MFN#0, boot allocations which cross
> > zones/regions, etc.
> > 
> > So because we want to make that *more* likely by allowing vmap() to
> > happen earlier, I'd like to clean things up by addressing those corner
> > cases and making it unconditionally OK to free boot-allocated pages
> > into the heap.
> > 
> > I think might be as simple as checking for (first_pg)->count_info == 0
> > in free_xenheap_pages(). That's quick enough, and if the count_info is
> > zero then I think it does indicate a boot-allocated page, because pages
> > from alloc_xenheap_pages() would have PGC_xen_heap set?
> > 
> > It would suffice just to pass such pages to init_heap_pages() instead
> > of directly to free_heap_pages(), I think. Julien?
> > 
> > The straw man version of that looks a bit like this...
> > 
> > --- a/xen/common/page_alloc.c
> > +++ b/xen/common/page_alloc.c
> > @@ -2304,6 +2304,12 @@ void free_xenheap_pages(void *v, unsigned int order)
> >   
> >       pg = virt_to_page(v);
> >   
> > +    /* Pages from the boot allocator need to pass through init_heap_pages() */
> > +    if ( unlikely(!pg->count_info) )
> 
> Note that there is two versions of free_xenheap_pages(). This one only 
> cover the case where the domheap and xenheap are the same.
> 
> But you can't use the same trick when xenheap is separated (like on 
> Arm32) because PGC_xen_heap is not set. So you would be calling 
> init_heap_pages() everytime.

Right. We'd want to set PGC_xen_heap there too on the corresponding
pages.

> However, I don't like the idea of relying on count_info == 0. Indeed, 
> there are valid case where count_info == 0 because it means the page is 
> inuse (PCC_state_inuse).

For xenheap pages, not just domheap pages?

> It might be best if we introduce a new page state that would be the 
> default value set in the frametable.

... and which is easily set with memset() so it's fast, which I think
you considered since you were suggesting all 0xFF, but which we haven't
explciitly stated out loud.

The other thing to note is that it's easy to make a default state for
pages which *weren't* given out by the boot allocator, but we don't
have a simple way to do anything to the pages which *were* given out by
the boot allocator, since we don't track those and — fairly much by
definition — we don't have the frametable yet when we start the boot
allocator.
Jan Beulich Feb. 5, 2020, 10:49 a.m. UTC | #9
On 05.02.2020 11:24, David Woodhouse wrote:
> On Wed, 2020-02-05 at 11:02 +0100, Jan Beulich wrote:
>>> +    /* Pages from the boot allocator need to pass through init_heap_pages() */
>>> +    if ( unlikely(!pg->count_info) )
>>
>> ... while I think this check may be fine here, no similar one
>> can be used in free_domheap_pages(), yet pages getting handed
>> there isn't less likely than ones getting handed to
>> free_xenheap_pages() (if we already fear mismatch).
> 
> Do we care about that?
> 
> ICBW but I don't think I've seen a case where boot-allocated pages get
> handed to free_domheap_pages() later. I've only seen them handed to
> free_xenheap_pages(). These are pages which are mapped to Xen, not
> domheap pages.
> 
> You are already expected *not* to conflate free_xenheap_pages() and
> free_domheap_pages().

That's the case now, but with us wanting to get rid of the direct map,
to sole difference will be whether a mapping gets established.
There'll likely be no need for a PGC_xen_heap flag anymore, as we
don't need to tell apart the underlying pages (at least as far as the
allocator is concerned).

> I think it should be OK to declare that freeing boot-allocated pages
> with free_xenheap_pages() is permitted, but freeing them with
> free_domheap_pages() isn't.

I don't think this is going to be a good idea. free_xenheap_pages()
right now expects an address within the direct map to be passed.
alloc_boot_pages(), however, returns an MFN, which may get mapped
to arbitrary linear addresses (or none at all). There's quite a bit
more similarity to alloc_domheap_pages(), as that returns
struct page_info *, which in turn has a reliable translation
to/from MFN. Yet, as you say elsewhere, whether an MFN has an
entry in frame_table[] is entirely unclear, so permitting boot-
allocator pages to be freed via alloc_domheap_pages() nevertheless
still doesn't look any better an idea to me.

Jan
David Woodhouse Feb. 5, 2020, 11:23 a.m. UTC | #10
On Wed, 2020-02-05 at 11:49 +0100, Jan Beulich wrote:
> Yet, as you say elsewhere, whether an MFN has an
> entry in frame_table[] is entirely unclear, so permitting boot-
> allocator pages to be freed via alloc_domheap_pages() nevertheless
> still doesn't look any better an idea to me.

Hm, I don't think I said that, did I? That would be a new and exciting
complication.

I think every MFN handed out by the boot allocator *should* have a
corresponding entry in the frame table. All the PDX setup is done for
those pages, just as it is for heap pages. In fact, some of that setup
is *only* done in init_boot_pages() right now, and if page ranges don't
go through the boot allocator and end up being transferred to the heap
in end_boot_allocator(), things (like badpages= on the command line)
don't work right.
David Woodhouse Feb. 5, 2020, 11:36 a.m. UTC | #11
On Wed, 2020-02-05 at 10:22 +0000, Julien Grall wrote:
> However, I don't like the idea of relying on count_info == 0. Indeed, 
> there are valid case where count_info == 0 because it means the page is 
> inuse (PCC_state_inuse).
> 
> It might be best if we introduce a new page state that would be the 
> default value set in the frametable.
> 
> 0 may be an option if we assign a different value to PGC_state_inuse but 
> we would need to check if there are places relying on PGC_state_inuse == 
> 0. I know that assign_pages() does rely on it, I have sent a patch for 
> it yesterday.

We'd need an extra bit for that.

Perhaps we could recognise that PGC_broken is not valid in combination
with some of the four PGC_state states, and take that bit back to turn
PGC_state into a 3-bit field, which could hold 8 states of which I
think we only currently need 7?

 { zero, inuse, offlining, offlining_broken, offline, offline_broken, free }

> Another option is to use an all 1s value and initialize the frametable 
> to all 1s. But I am not entirely sure whether the allocator would be 
> able to cope with it.

We'd zero it in init_heap_pages() for every page, and it would only be
still set to all 1s for pages from the boot allocator; as long as all
1s *is* proven to be an invalid state (I think it is); then the same
kind of trick checking unlikely(pg->count_info == -1) in the free
functions as in my earlier straw man, would probably work.

I think I prefer zero being the "untouched" state though, rather than
all 1s.
Jan Beulich Feb. 5, 2020, 1:37 p.m. UTC | #12
On 05.02.2020 12:23, David Woodhouse wrote:
> On Wed, 2020-02-05 at 11:49 +0100, Jan Beulich wrote:
>> Yet, as you say elsewhere, whether an MFN has an
>> entry in frame_table[] is entirely unclear, so permitting boot-
>> allocator pages to be freed via alloc_domheap_pages() nevertheless
>> still doesn't look any better an idea to me.
> 
> Hm, I don't think I said that, did I? That would be a new and exciting
> complication.
> 
> I think every MFN handed out by the boot allocator *should* have a
> corresponding entry in the frame table. All the PDX setup is done for
> those pages, just as it is for heap pages. In fact, some of that setup
> is *only* done in init_boot_pages() right now, and if page ranges don't
> go through the boot allocator and end up being transferred to the heap
> in end_boot_allocator(), things (like badpages= on the command line)
> don't work right.

I guess I should have said "whether an MFN has a properly initialized
entry in frame_table[] is entirely unclear".

Jan
David Woodhouse Feb. 5, 2020, 2:12 p.m. UTC | #13
> On 05.02.2020 12:23, David Woodhouse wrote:
>> On Wed, 2020-02-05 at 11:49 +0100, Jan Beulich wrote:
>>> Yet, as you say elsewhere, whether an MFN has an
>>> entry in frame_table[] is entirely unclear, so permitting boot-
>>> allocator pages to be freed via alloc_domheap_pages() nevertheless
>>> still doesn't look any better an idea to me.
>>
>> Hm, I don't think I said that, did I? That would be a new and exciting
>> complication.
>>
>> I think every MFN handed out by the boot allocator *should* have a
>> corresponding entry in the frame table. All the PDX setup is done for
>> those pages, just as it is for heap pages. In fact, some of that setup
>> is *only* done in init_boot_pages() right now, and if page ranges don't
>> go through the boot allocator and end up being transferred to the heap
>> in end_boot_allocator(), things (like badpages= on the command line)
>> don't work right.
>
> I guess I should have said "whether an MFN has a properly initialized
> entry in frame_table[] is entirely unclear".

Aha, right. Yes, I admit to having said that :)

I think we have a viable path to fixing that, by folding PGC_broken in to
the state bits so that we can disambiguate. Will experiment.
David Woodhouse Feb. 7, 2020, 3:49 p.m. UTC | #14
On Wed, 2020-02-05 at 14:12 +0000, David Woodhouse wrote:
> I think we have a viable path to fixing that, by folding PGC_broken in to
> the state bits so that we can disambiguate. Will experiment.

Here, it looks something like this. First we fold PGC_broken into the
state bits giving us 8 possible states of which only 6 are currently in
use.

Then in the second patch we can move PGC_state_inuse from zero (the
default contents of the frame table at startup), and make a new state
PGC_state_uninitialised with the value zero.

We can make free_xenheap_pages() and free_domheap_pages() call
init_heap_pages() instead of free_heap_pages() if they see a page range
which is still in PGC_state_uninitialised.

We need to fix up a couple of asserts not only to cope with
PGC_state_inuse not being zero (as Julien has already looked at) but
also to take PGC_state_uninitialised pages.

In assign_pages() that's because we map the multiboot module containing
the initramfs to dom0. That might actually cross node boundaries,
contain MFN#0, etc. — so if/when that gets released by dom0 we'd want
those pages to be passed to init_heap_pages() just the same as boot-
allocated memory.

In share_xen_page_with_guest() it happens because we share all non-RAM
page frames with dom0.

(2 patches follow...)
diff mbox series

Patch

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 97902d42c1..735049fe3e 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -1462,6 +1462,7 @@  static void free_heap_pages(
             if ( !mfn_valid(page_to_mfn(predecessor)) ||
                  !page_state_is(predecessor, free) ||
                  (PFN_ORDER(predecessor) != order) ||
+                 (page_to_zone(predecessor) != zone) ||
                  (phys_to_nid(page_to_maddr(predecessor)) != node) )
                 break;
 
@@ -1485,6 +1486,7 @@  static void free_heap_pages(
             if ( !mfn_valid(page_to_mfn(successor)) ||
                  !page_state_is(successor, free) ||
                  (PFN_ORDER(successor) != order) ||
+                 (page_to_zone(successor) != zone) ||
                  (phys_to_nid(page_to_maddr(successor)) != node) )
                 break;