Message ID | 20200204151441.10626-1-stewart.hildebrand@dornerworks.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [XEN,v2,1/2] Check zone before merging adjacent blocks in heap | expand |
On 04.02.2020 16:14, Stewart Hildebrand wrote: > From: Jeff Kubascik <jeff.kubascik@dornerworks.com> > > The Xen heap is split up into nodes and zones. Each node + zone is > managed as a separate pool of memory. > > When returning pages to the heap, free_heap_pages will check adjacent > blocks to see if they can be combined into a larger block. However, the > zone of the adjacent block is not checked. This results in blocks that > migrate from one zone to another. > > When a block migrates to the adjacent zone, the avail counters for the > old and new node + zone is not updated accordingly. The avail counter > is used when allocating pages to determine whether to skip over a zone. > With this behavior, it is possible for free pages to collect in a zone > with the avail counter smaller than the actual page count, resulting > in free pages that are not allocable. "When a block migrates" - fine. But is this situation possible to occur, without "xen/page_alloc: Keep away MFN 0 from the buddy allocator" reverted? If not, there's no bug, no need for a change, and even less so ... > --- a/xen/common/page_alloc.c > +++ b/xen/common/page_alloc.c > @@ -1462,6 +1462,7 @@ static void free_heap_pages( > if ( !mfn_valid(page_to_mfn(predecessor)) || > !page_state_is(predecessor, free) || > (PFN_ORDER(predecessor) != order) || > + (page_to_zone(predecessor) != zone) || > (phys_to_nid(page_to_maddr(predecessor)) != node) ) > break; > > @@ -1485,6 +1486,7 @@ static void free_heap_pages( > if ( !mfn_valid(page_to_mfn(successor)) || > !page_state_is(successor, free) || > (PFN_ORDER(successor) != order) || > + (page_to_zone(successor) != zone) || > (phys_to_nid(page_to_maddr(successor)) != node) ) > break; ... for one that slows down many free operations, even if just slightly. IOW afaict either the change is not needed, or its description needs updating. Jan
On 2/4/20 3:22 PM, Jan Beulich wrote: > On 04.02.2020 16:14, Stewart Hildebrand wrote: >> From: Jeff Kubascik <jeff.kubascik@dornerworks.com> >> >> The Xen heap is split up into nodes and zones. Each node + zone is >> managed as a separate pool of memory. >> >> When returning pages to the heap, free_heap_pages will check adjacent >> blocks to see if they can be combined into a larger block. However, the >> zone of the adjacent block is not checked. This results in blocks that >> migrate from one zone to another. >> >> When a block migrates to the adjacent zone, the avail counters for the >> old and new node + zone is not updated accordingly. The avail counter >> is used when allocating pages to determine whether to skip over a zone. >> With this behavior, it is possible for free pages to collect in a zone >> with the avail counter smaller than the actual page count, resulting >> in free pages that are not allocable. > > "When a block migrates" - fine. But is this situation possible to > occur, without "xen/page_alloc: Keep away MFN 0 from the buddy > allocator" reverted? If not, there's no bug, no need for a change, > and even less so ... > >> --- a/xen/common/page_alloc.c >> +++ b/xen/common/page_alloc.c >> @@ -1462,6 +1462,7 @@ static void free_heap_pages( >> if ( !mfn_valid(page_to_mfn(predecessor)) || >> !page_state_is(predecessor, free) || >> (PFN_ORDER(predecessor) != order) || >> + (page_to_zone(predecessor) != zone) || >> (phys_to_nid(page_to_maddr(predecessor)) != node) ) >> break; >> >> @@ -1485,6 +1486,7 @@ static void free_heap_pages( >> if ( !mfn_valid(page_to_mfn(successor)) || >> !page_state_is(successor, free) || >> (PFN_ORDER(successor) != order) || >> + (page_to_zone(successor) != zone) || >> (phys_to_nid(page_to_maddr(successor)) != node) ) >> break; > > ... for one that slows down many free operations, even if just > slightly. IOW afaict either the change is not needed, or its > description needs updating. At very least it's more robust this way; the algorithm is also less "magic". We just had a long discussion this morning trying to re-create the logic for why "Remove MFN 0" was sufficient to prevent this issue, and even then David wasn't sure it was correct at first. -George
On Tuesday, February 4, 2020 10:22 AM, Jan Beulich wrote: >On 04.02.2020 16:14, Stewart Hildebrand wrote: >> From: Jeff Kubascik <jeff.kubascik@dornerworks.com> >> >> The Xen heap is split up into nodes and zones. Each node + zone is >> managed as a separate pool of memory. >> >> When returning pages to the heap, free_heap_pages will check adjacent >> blocks to see if they can be combined into a larger block. However, the >> zone of the adjacent block is not checked. This results in blocks that >> migrate from one zone to another. >> >> When a block migrates to the adjacent zone, the avail counters for the >> old and new node + zone is not updated accordingly. The avail counter >> is used when allocating pages to determine whether to skip over a zone. >> With this behavior, it is possible for free pages to collect in a zone >> with the avail counter smaller than the actual page count, resulting >> in free pages that are not allocable. > >"When a block migrates" - fine. But is this situation possible to >occur, without "xen/page_alloc: Keep away MFN 0 from the buddy >allocator" reverted? No, not as far as I'm aware, though I have not studied this code in detail so I don't feel fully confident in my "no". > If not, there's no bug, no need for a change, >and even less so ... > >> --- a/xen/common/page_alloc.c >> +++ b/xen/common/page_alloc.c >> @@ -1462,6 +1462,7 @@ static void free_heap_pages( >> if ( !mfn_valid(page_to_mfn(predecessor)) || >> !page_state_is(predecessor, free) || >> (PFN_ORDER(predecessor) != order) || >> + (page_to_zone(predecessor) != zone) || >> (phys_to_nid(page_to_maddr(predecessor)) != node) ) >> break; >> >> @@ -1485,6 +1486,7 @@ static void free_heap_pages( >> if ( !mfn_valid(page_to_mfn(successor)) || >> !page_state_is(successor, free) || >> (PFN_ORDER(successor) != order) || >> + (page_to_zone(successor) != zone) || >> (phys_to_nid(page_to_maddr(successor)) != node) ) >> break; > >... for one that slows down many free operations, even if just >slightly. IOW afaict either the change is not needed, or its >description needs updating. Right. An alternative that wouldn't potentially slow things down in production builds would be to apply the ASSERT from patch 2. I don't have any performance metrics regarding exactly how much of a performance hit this would incur. Stew > >Jan
On Tue, 2020-02-04 at 15:37 +0000, George Dunlap wrote: > At very least it's more robust this way; the algorithm is also less > "magic". We just had a long discussion this morning trying to re-create > the logic for why "Remove MFN 0" was sufficient to prevent this issue, > and even then David wasn't sure it was correct at first. Right. So the real reason I'm staring hard at this is because I can't convince myself there aren't places where memory allocated by the boot allocator is later freed with free_xenheap_pages(). We have a few pieces of code which decide whether to use the boot allocator vs. heap based on system_state >= SYS_STATE_boot, and *if* the rule is "thou shalt not allocate with boot allocator and free later" then it's *extremely* fragile and probably being violated — especially because it actually *works* most of the time, except in some esoteric corner cases like MFN#0, boot allocations which cross zones/regions, etc. So because we want to make that *more* likely by allowing vmap() to happen earlier, I'd like to clean things up by addressing those corner cases and making it unconditionally OK to free boot-allocated pages into the heap. I think might be as simple as checking for (first_pg)->count_info == 0 in free_xenheap_pages(). That's quick enough, and if the count_info is zero then I think it does indicate a boot-allocated page, because pages from alloc_xenheap_pages() would have PGC_xen_heap set? It would suffice just to pass such pages to init_heap_pages() instead of directly to free_heap_pages(), I think. Julien? The straw man version of that looks a bit like this... --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -2304,6 +2304,12 @@ void free_xenheap_pages(void *v, unsigned int order) pg = virt_to_page(v); + /* Pages from the boot allocator need to pass through init_heap_pages() */ + if ( unlikely(!pg->count_info) ) + { + init_heap_pages(pg, 1 << order); + return; + } for ( i = 0; i < (1u << order); i++ ) pg[i].count_info &= ~PGC_xen_heap;
On 05.02.2020 10:50, David Woodhouse wrote: > On Tue, 2020-02-04 at 15:37 +0000, George Dunlap wrote: >> At very least it's more robust this way; the algorithm is also less >> "magic". We just had a long discussion this morning trying to re-create >> the logic for why "Remove MFN 0" was sufficient to prevent this issue, >> and even then David wasn't sure it was correct at first. > > Right. So the real reason I'm staring hard at this is because I can't > convince myself there aren't places where memory allocated by the boot > allocator is later freed with free_xenheap_pages(). > > We have a few pieces of code which decide whether to use the boot > allocator vs. heap based on system_state >= SYS_STATE_boot, and *if* > the rule is "thou shalt not allocate with boot allocator and free > later" then it's *extremely* fragile and probably being violated — > especially because it actually *works* most of the time, except in some > esoteric corner cases like MFN#0, boot allocations which cross > zones/regions, etc. > > So because we want to make that *more* likely by allowing vmap() to > happen earlier, I'd like to clean things up by addressing those corner > cases and making it unconditionally OK to free boot-allocated pages > into the heap. > > I think might be as simple as checking for (first_pg)->count_info == 0 > in free_xenheap_pages(). That's quick enough, and if the count_info is > zero then I think it does indicate a boot-allocated page, because pages > from alloc_xenheap_pages() would have PGC_xen_heap set? They would, but that leaves {alloc,free}_domheap_pages() out of the picture. I.e. ... > It would suffice just to pass such pages to init_heap_pages() instead > of directly to free_heap_pages(), I think. Julien? > > The straw man version of that looks a bit like this... > > --- a/xen/common/page_alloc.c > +++ b/xen/common/page_alloc.c > @@ -2304,6 +2304,12 @@ void free_xenheap_pages(void *v, unsigned int order) > > pg = virt_to_page(v); > > + /* Pages from the boot allocator need to pass through init_heap_pages() */ > + if ( unlikely(!pg->count_info) ) ... while I think this check may be fine here, no similar one can be used in free_domheap_pages(), yet pages getting handed there isn't less likely than ones getting handed to free_xenheap_pages() (if we already fear mismatch). Jan
Hi, On 05/02/2020 09:50, David Woodhouse wrote: > On Tue, 2020-02-04 at 15:37 +0000, George Dunlap wrote: >> At very least it's more robust this way; the algorithm is also less >> "magic". We just had a long discussion this morning trying to re-create >> the logic for why "Remove MFN 0" was sufficient to prevent this issue, >> and even then David wasn't sure it was correct at first. > > Right. So the real reason I'm staring hard at this is because I can't > convince myself there aren't places where memory allocated by the boot > allocator is later freed with free_xenheap_pages(). > > We have a few pieces of code which decide whether to use the boot > allocator vs. heap based on system_state >= SYS_STATE_boot, and *if* > the rule is "thou shalt not allocate with boot allocator and free > later" then it's *extremely* fragile and probably being violated — > especially because it actually *works* most of the time, except in some > esoteric corner cases like MFN#0, boot allocations which cross > zones/regions, etc. > > So because we want to make that *more* likely by allowing vmap() to > happen earlier, I'd like to clean things up by addressing those corner > cases and making it unconditionally OK to free boot-allocated pages > into the heap. > > I think might be as simple as checking for (first_pg)->count_info == 0 > in free_xenheap_pages(). That's quick enough, and if the count_info is > zero then I think it does indicate a boot-allocated page, because pages > from alloc_xenheap_pages() would have PGC_xen_heap set? > > It would suffice just to pass such pages to init_heap_pages() instead > of directly to free_heap_pages(), I think. Julien? > > The straw man version of that looks a bit like this... > > --- a/xen/common/page_alloc.c > +++ b/xen/common/page_alloc.c > @@ -2304,6 +2304,12 @@ void free_xenheap_pages(void *v, unsigned int order) > > pg = virt_to_page(v); > > + /* Pages from the boot allocator need to pass through init_heap_pages() */ > + if ( unlikely(!pg->count_info) ) Note that there is two versions of free_xenheap_pages(). This one only cover the case where the domheap and xenheap are the same. But you can't use the same trick when xenheap is separated (like on Arm32) because PGC_xen_heap is not set. So you would be calling init_heap_pages() everytime. However, I don't like the idea of relying on count_info == 0. Indeed, there are valid case where count_info == 0 because it means the page is inuse (PCC_state_inuse). It might be best if we introduce a new page state that would be the default value set in the frametable. 0 may be an option if we assign a different value to PGC_state_inuse but we would need to check if there are places relying on PGC_state_inuse == 0. I know that assign_pages() does rely on it, I have sent a patch for it yesterday. Another option is to use an all 1s value and initialize the frametable to all 1s. But I am not entirely sure whether the allocator would be able to cope with it. Cheers,
On Wed, 2020-02-05 at 11:02 +0100, Jan Beulich wrote: > > + /* Pages from the boot allocator need to pass through init_heap_pages() */ > > + if ( unlikely(!pg->count_info) ) > > ... while I think this check may be fine here, no similar one > can be used in free_domheap_pages(), yet pages getting handed > there isn't less likely than ones getting handed to > free_xenheap_pages() (if we already fear mismatch). Do we care about that? ICBW but I don't think I've seen a case where boot-allocated pages get handed to free_domheap_pages() later. I've only seen them handed to free_xenheap_pages(). These are pages which are mapped to Xen, not domheap pages. You are already expected *not* to conflate free_xenheap_pages() and free_domheap_pages(). I think it should be OK to declare that freeing boot-allocated pages with free_xenheap_pages() is permitted, but freeing them with free_domheap_pages() isn't. (Note my straw man patch didn't fix the CONFIG_SEPARATE_XENHEAP case and doesn't trivially port to that version of free_xenheap_pages() because the PGC_xen_heap flag isn't used there. But it's fixable relatively easily.
On Wed, 2020-02-05 at 10:22 +0000, Julien Grall wrote: > Hi, > > On 05/02/2020 09:50, David Woodhouse wrote: > > On Tue, 2020-02-04 at 15:37 +0000, George Dunlap wrote: > > > At very least it's more robust this way; the algorithm is also less > > > "magic". We just had a long discussion this morning trying to re-create > > > the logic for why "Remove MFN 0" was sufficient to prevent this issue, > > > and even then David wasn't sure it was correct at first. > > > > Right. So the real reason I'm staring hard at this is because I can't > > convince myself there aren't places where memory allocated by the boot > > allocator is later freed with free_xenheap_pages(). > > > > We have a few pieces of code which decide whether to use the boot > > allocator vs. heap based on system_state >= SYS_STATE_boot, and *if* > > the rule is "thou shalt not allocate with boot allocator and free > > later" then it's *extremely* fragile and probably being violated — > > especially because it actually *works* most of the time, except in some > > esoteric corner cases like MFN#0, boot allocations which cross > > zones/regions, etc. > > > > So because we want to make that *more* likely by allowing vmap() to > > happen earlier, I'd like to clean things up by addressing those corner > > cases and making it unconditionally OK to free boot-allocated pages > > into the heap. > > > > I think might be as simple as checking for (first_pg)->count_info == 0 > > in free_xenheap_pages(). That's quick enough, and if the count_info is > > zero then I think it does indicate a boot-allocated page, because pages > > from alloc_xenheap_pages() would have PGC_xen_heap set? > > > > It would suffice just to pass such pages to init_heap_pages() instead > > of directly to free_heap_pages(), I think. Julien? > > > > The straw man version of that looks a bit like this... > > > > --- a/xen/common/page_alloc.c > > +++ b/xen/common/page_alloc.c > > @@ -2304,6 +2304,12 @@ void free_xenheap_pages(void *v, unsigned int order) > > > > pg = virt_to_page(v); > > > > + /* Pages from the boot allocator need to pass through init_heap_pages() */ > > + if ( unlikely(!pg->count_info) ) > > Note that there is two versions of free_xenheap_pages(). This one only > cover the case where the domheap and xenheap are the same. > > But you can't use the same trick when xenheap is separated (like on > Arm32) because PGC_xen_heap is not set. So you would be calling > init_heap_pages() everytime. Right. We'd want to set PGC_xen_heap there too on the corresponding pages. > However, I don't like the idea of relying on count_info == 0. Indeed, > there are valid case where count_info == 0 because it means the page is > inuse (PCC_state_inuse). For xenheap pages, not just domheap pages? > It might be best if we introduce a new page state that would be the > default value set in the frametable. ... and which is easily set with memset() so it's fast, which I think you considered since you were suggesting all 0xFF, but which we haven't explciitly stated out loud. The other thing to note is that it's easy to make a default state for pages which *weren't* given out by the boot allocator, but we don't have a simple way to do anything to the pages which *were* given out by the boot allocator, since we don't track those and — fairly much by definition — we don't have the frametable yet when we start the boot allocator.
On 05.02.2020 11:24, David Woodhouse wrote: > On Wed, 2020-02-05 at 11:02 +0100, Jan Beulich wrote: >>> + /* Pages from the boot allocator need to pass through init_heap_pages() */ >>> + if ( unlikely(!pg->count_info) ) >> >> ... while I think this check may be fine here, no similar one >> can be used in free_domheap_pages(), yet pages getting handed >> there isn't less likely than ones getting handed to >> free_xenheap_pages() (if we already fear mismatch). > > Do we care about that? > > ICBW but I don't think I've seen a case where boot-allocated pages get > handed to free_domheap_pages() later. I've only seen them handed to > free_xenheap_pages(). These are pages which are mapped to Xen, not > domheap pages. > > You are already expected *not* to conflate free_xenheap_pages() and > free_domheap_pages(). That's the case now, but with us wanting to get rid of the direct map, to sole difference will be whether a mapping gets established. There'll likely be no need for a PGC_xen_heap flag anymore, as we don't need to tell apart the underlying pages (at least as far as the allocator is concerned). > I think it should be OK to declare that freeing boot-allocated pages > with free_xenheap_pages() is permitted, but freeing them with > free_domheap_pages() isn't. I don't think this is going to be a good idea. free_xenheap_pages() right now expects an address within the direct map to be passed. alloc_boot_pages(), however, returns an MFN, which may get mapped to arbitrary linear addresses (or none at all). There's quite a bit more similarity to alloc_domheap_pages(), as that returns struct page_info *, which in turn has a reliable translation to/from MFN. Yet, as you say elsewhere, whether an MFN has an entry in frame_table[] is entirely unclear, so permitting boot- allocator pages to be freed via alloc_domheap_pages() nevertheless still doesn't look any better an idea to me. Jan
On Wed, 2020-02-05 at 11:49 +0100, Jan Beulich wrote: > Yet, as you say elsewhere, whether an MFN has an > entry in frame_table[] is entirely unclear, so permitting boot- > allocator pages to be freed via alloc_domheap_pages() nevertheless > still doesn't look any better an idea to me. Hm, I don't think I said that, did I? That would be a new and exciting complication. I think every MFN handed out by the boot allocator *should* have a corresponding entry in the frame table. All the PDX setup is done for those pages, just as it is for heap pages. In fact, some of that setup is *only* done in init_boot_pages() right now, and if page ranges don't go through the boot allocator and end up being transferred to the heap in end_boot_allocator(), things (like badpages= on the command line) don't work right.
On Wed, 2020-02-05 at 10:22 +0000, Julien Grall wrote: > However, I don't like the idea of relying on count_info == 0. Indeed, > there are valid case where count_info == 0 because it means the page is > inuse (PCC_state_inuse). > > It might be best if we introduce a new page state that would be the > default value set in the frametable. > > 0 may be an option if we assign a different value to PGC_state_inuse but > we would need to check if there are places relying on PGC_state_inuse == > 0. I know that assign_pages() does rely on it, I have sent a patch for > it yesterday. We'd need an extra bit for that. Perhaps we could recognise that PGC_broken is not valid in combination with some of the four PGC_state states, and take that bit back to turn PGC_state into a 3-bit field, which could hold 8 states of which I think we only currently need 7? { zero, inuse, offlining, offlining_broken, offline, offline_broken, free } > Another option is to use an all 1s value and initialize the frametable > to all 1s. But I am not entirely sure whether the allocator would be > able to cope with it. We'd zero it in init_heap_pages() for every page, and it would only be still set to all 1s for pages from the boot allocator; as long as all 1s *is* proven to be an invalid state (I think it is); then the same kind of trick checking unlikely(pg->count_info == -1) in the free functions as in my earlier straw man, would probably work. I think I prefer zero being the "untouched" state though, rather than all 1s.
On 05.02.2020 12:23, David Woodhouse wrote: > On Wed, 2020-02-05 at 11:49 +0100, Jan Beulich wrote: >> Yet, as you say elsewhere, whether an MFN has an >> entry in frame_table[] is entirely unclear, so permitting boot- >> allocator pages to be freed via alloc_domheap_pages() nevertheless >> still doesn't look any better an idea to me. > > Hm, I don't think I said that, did I? That would be a new and exciting > complication. > > I think every MFN handed out by the boot allocator *should* have a > corresponding entry in the frame table. All the PDX setup is done for > those pages, just as it is for heap pages. In fact, some of that setup > is *only* done in init_boot_pages() right now, and if page ranges don't > go through the boot allocator and end up being transferred to the heap > in end_boot_allocator(), things (like badpages= on the command line) > don't work right. I guess I should have said "whether an MFN has a properly initialized entry in frame_table[] is entirely unclear". Jan
> On 05.02.2020 12:23, David Woodhouse wrote: >> On Wed, 2020-02-05 at 11:49 +0100, Jan Beulich wrote: >>> Yet, as you say elsewhere, whether an MFN has an >>> entry in frame_table[] is entirely unclear, so permitting boot- >>> allocator pages to be freed via alloc_domheap_pages() nevertheless >>> still doesn't look any better an idea to me. >> >> Hm, I don't think I said that, did I? That would be a new and exciting >> complication. >> >> I think every MFN handed out by the boot allocator *should* have a >> corresponding entry in the frame table. All the PDX setup is done for >> those pages, just as it is for heap pages. In fact, some of that setup >> is *only* done in init_boot_pages() right now, and if page ranges don't >> go through the boot allocator and end up being transferred to the heap >> in end_boot_allocator(), things (like badpages= on the command line) >> don't work right. > > I guess I should have said "whether an MFN has a properly initialized > entry in frame_table[] is entirely unclear". Aha, right. Yes, I admit to having said that :) I think we have a viable path to fixing that, by folding PGC_broken in to the state bits so that we can disambiguate. Will experiment.
On Wed, 2020-02-05 at 14:12 +0000, David Woodhouse wrote: > I think we have a viable path to fixing that, by folding PGC_broken in to > the state bits so that we can disambiguate. Will experiment. Here, it looks something like this. First we fold PGC_broken into the state bits giving us 8 possible states of which only 6 are currently in use. Then in the second patch we can move PGC_state_inuse from zero (the default contents of the frame table at startup), and make a new state PGC_state_uninitialised with the value zero. We can make free_xenheap_pages() and free_domheap_pages() call init_heap_pages() instead of free_heap_pages() if they see a page range which is still in PGC_state_uninitialised. We need to fix up a couple of asserts not only to cope with PGC_state_inuse not being zero (as Julien has already looked at) but also to take PGC_state_uninitialised pages. In assign_pages() that's because we map the multiboot module containing the initramfs to dom0. That might actually cross node boundaries, contain MFN#0, etc. — so if/when that gets released by dom0 we'd want those pages to be passed to init_heap_pages() just the same as boot- allocated memory. In share_xen_page_with_guest() it happens because we share all non-RAM page frames with dom0. (2 patches follow...)
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c index 97902d42c1..735049fe3e 100644 --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -1462,6 +1462,7 @@ static void free_heap_pages( if ( !mfn_valid(page_to_mfn(predecessor)) || !page_state_is(predecessor, free) || (PFN_ORDER(predecessor) != order) || + (page_to_zone(predecessor) != zone) || (phys_to_nid(page_to_maddr(predecessor)) != node) ) break; @@ -1485,6 +1486,7 @@ static void free_heap_pages( if ( !mfn_valid(page_to_mfn(successor)) || !page_state_is(successor, free) || (PFN_ORDER(successor) != order) || + (page_to_zone(successor) != zone) || (phys_to_nid(page_to_maddr(successor)) != node) ) break;