Message ID | 20201121194506.13464-2-aarcange@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | VM_BUG_ON_PAGE(!zone_spans_pfn) in set_pfnblock_flags_mask | expand |
On Sat, Nov 21, 2020 at 02:45:06PM -0500, Andrea Arcangeli wrote:
> + if (likely(!PageReserved(page)))
NOTE: this line will have to become "likely(page &&
!PageReserved(page))" to handle the case of non contiguous zones,
since pageblock_pfn_to_page might return NULL in that case.. I noticed
right after sending, but I'll wait some feedback before resending the
correction in case it gets fixed in another way.
On 21.11.20 20:53, Andrea Arcangeli wrote: > On Sat, Nov 21, 2020 at 02:45:06PM -0500, Andrea Arcangeli wrote: >> + if (likely(!PageReserved(page))) > > NOTE: this line will have to become "likely(page && > !PageReserved(page))" to handle the case of non contiguous zones, > since pageblock_pfn_to_page might return NULL in that case.. I noticed > right after sending, but I'll wait some feedback before resending the > correction in case it gets fixed in another way. With that fixed, this looks sane to me.
On 11/21/20 8:45 PM, Andrea Arcangeli wrote: > A corollary issue was fixed in > e577c8b64d58fe307ea4d5149d31615df2d90861. A second issue remained in > v5.7: > > https://lkml.kernel.org/r/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw > > == > page:ffffea0000aa0000 refcount:1 mapcount:0 mapping:000000002243743b index:0x0 > flags: 0x1fffe000001000(reserved) > == > > 73a6e474cb376921a311786652782155eac2fdf0 was applied to supposedly the > second issue, but I still reproduced it twice with v5.9 on two > different systems: > > == > page:0000000062b3e92f refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x39800 > flags: 0x1000(reserved) > == > page:000000002a7114f8 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7a200 > flags: 0x1fff000000001000(reserved) > == > > I actually never reproduced it until v5.9, but it's still the same bug > as it was reported first for v5.7. > > See the page is "reserved" in all 3 cases. In the last two crashes > with the pfn: > > pfn 0x39800 -> 0x39800000 min_pfn hit non-RAM: > > 39639000-39814fff : Unknown E820 type > > pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM: > > 7a17b000-7a216fff : Unknown E820 type It would be nice to also provide a /proc/zoneinfo and how exactly the "zone_spans_pfn" was violated. I assume we end up below zone's start_pfn, but is it true? > This actually seems a false positive bugcheck, the page structures are > valid and the zones are correct, just it's non-RAM but setting > pageblockskip should do no harm. However it's possible to solve the > crash without lifting the bugcheck, by enforcing the invariant that > the free_pfn cursor doesn't point to reserved pages (which would be > otherwise implicitly achieved through the PageBuddy check, except in > the new fast_isolate_around() path). > > Fixes: 5a811889de10 ("mm, compaction: use free lists to quickly locate a migration target") > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> > --- > mm/compaction.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/mm/compaction.c b/mm/compaction.c > index 13cb7a961b31..d17e69549d34 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -1433,7 +1433,10 @@ fast_isolate_freepages(struct compact_control *cc) > page = pageblock_pfn_to_page(min_pfn, > pageblock_end_pfn(min_pfn), > cc->zone); > - cc->free_pfn = min_pfn; > + if (likely(!PageReserved(page))) PageReserved check seems rather awkward solution to me. Wouldn't it be more obvious if we made sure we don't end up below zone's start_pfn (if my assumption is correct) in the first place? When I check the code: unsigned long distance; distance = (cc->free_pfn - cc->migrate_pfn); low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2)); min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1)); I think what can happen is that cc->free_pfn <= cc->migrate_pfn after the very last isolate_migratepages(). Then compact_finished() detects that in compact_zone(), but only after migrate_pages() and thus fast_isolate_freepages() is called. That would mean distance can be negative, or rather a large unsigned number and low_pfn and min_pfn end up away from the zone? Or maybe the above doesn't happen, but cc->free_pfn gets so close to start of the zone, that the calculations above make min_pfn go below start_pfn? In any case I would rather make sure we stay within the expected zone boundaries, than play tricks with PageReserved. Mel? > + cc->free_pfn = min_pfn; > + else > + page = NULL; > } > } > } >
On Mon, Nov 23, 2020 at 02:01:16PM +0100, Vlastimil Babka wrote: > > I actually never reproduced it until v5.9, but it's still the same bug > > as it was reported first for v5.7. > > > > See the page is "reserved" in all 3 cases. In the last two crashes > > with the pfn: > > > > pfn 0x39800 -> 0x39800000 min_pfn hit non-RAM: > > > > 39639000-39814fff : Unknown E820 type > > > > pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM: > > > > 7a17b000-7a216fff : Unknown E820 type > > It would be nice to also provide a /proc/zoneinfo and how exactly the > "zone_spans_pfn" was violated. I assume we end up below zone's start_pfn, > but is it true? > It must be if zone_spans_pfn is getting triggered. > > This actually seems a false positive bugcheck, the page structures are > > valid and the zones are correct, just it's non-RAM but setting > > pageblockskip should do no harm. However it's possible to solve the > > crash without lifting the bugcheck, by enforcing the invariant that > > the free_pfn cursor doesn't point to reserved pages (which would be > > otherwise implicitly achieved through the PageBuddy check, except in > > the new fast_isolate_around() path). > > > > Fixes: 5a811889de10 ("mm, compaction: use free lists to quickly locate a migration target") > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> > > --- > > mm/compaction.c | 5 ++++- > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > diff --git a/mm/compaction.c b/mm/compaction.c > > index 13cb7a961b31..d17e69549d34 100644 > > --- a/mm/compaction.c > > +++ b/mm/compaction.c > > @@ -1433,7 +1433,10 @@ fast_isolate_freepages(struct compact_control *cc) > > page = pageblock_pfn_to_page(min_pfn, > > pageblock_end_pfn(min_pfn), > > cc->zone); > > - cc->free_pfn = min_pfn; > > + if (likely(!PageReserved(page))) > > PageReserved check seems rather awkward solution to me. Wouldn't it be more > obvious if we made sure we don't end up below zone's start_pfn (if my > assumption is correct) in the first place? > > When I check the code: > > unsigned long distance; > distance = (cc->free_pfn - cc->migrate_pfn); > low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2)); > min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1)); > > I think what can happen is that cc->free_pfn <= cc->migrate_pfn after the > very last isolate_migratepages(). I would hope that is not the case because they are not meant to overlap. However, if the beginning of the pageblock was not the start of a zone then the pages would be valid but the pfn would still be outside the zone boundary. If it was reserved, the struct page is valid but not suitable for set_pfnblock_flags_mask. However, it is a concern in general because the potential is there that pages are isolated from the wrong zone. > Then compact_finished() detects that in > compact_zone(), but only after migrate_pages() and thus > fast_isolate_freepages() is called. > > That would mean distance can be negative, or rather a large unsigned number > and low_pfn and min_pfn end up away from the zone? > > Or maybe the above doesn't happen, but cc->free_pfn gets so close to start > of the zone, that the calculations above make min_pfn go below start_pfn? > I think the last part is more likely, going below start_pfn > In any case I would rather make sure we stay within the expected zone > boundaries, than play tricks with PageReserved. Mel? > It would be preferable because this time it's PageReserved that happened to trip up an assumption in set_pfnblock_flags_mask but if it was a real zone and real page then compaction is migrating cross-zone which would be surprising. Maybe this untested patch? diff --git a/mm/compaction.c b/mm/compaction.c index 13cb7a961b31..ef1b5dacc289 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1330,6 +1330,10 @@ fast_isolate_freepages(struct compact_control *cc) low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2)); min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1)); + /* Ensure the PFN is within the zone */ + low_pfn = max(cc->zone->zone_start_pfn, low_pfn); + min_pfn = max(cc->zone->zone_start_pfn, min_pfn); + if (WARN_ON_ONCE(min_pfn > low_pfn)) low_pfn = min_pfn;
Hello, On Tue, Nov 24, 2020 at 01:32:05PM +0000, Mel Gorman wrote: > I would hope that is not the case because they are not meant to overlap. > However, if the beginning of the pageblock was not the start of a zone > then the pages would be valid but the pfn would still be outside the > zone boundary. If it was reserved, the struct page is valid but not > suitable for set_pfnblock_flags_mask. However, it is a concern in > general because the potential is there that pages are isolated from the > wrong zone. I guess we have more than one issue to correct in that function because the same BUG_ON reproduced again even with the tentative patch I posted earlier. So my guess is that the problematic reserved page isn't pointed by the min_pfn, but it must have been pointed by the "highest" variable calculated below? if (pfn >= highest) highest = pageblock_start_pfn(pfn); When I looked at where "highest" comes from, it lacks pageblock_pfn_to_page check (which was added around v5.7 to min_pfn). Is that the real bug, which may be fixed by something like this? (untested) == From 262671e88723b3074251189004ceae39dcd1689d Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli <aarcange@redhat.com> Date: Sat, 21 Nov 2020 12:55:58 -0500 Subject: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages A corollary issue was fixed in e577c8b64d58fe307ea4d5149d31615df2d90861. A second issue remained in v5.7: https://lkml.kernel.org/r/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw == page:ffffea0000aa0000 refcount:1 mapcount:0 mapping:000000002243743b index:0x0 flags: 0x1fffe000001000(reserved) == 73a6e474cb376921a311786652782155eac2fdf0 was applied to supposedly fix the second issue, but it still reproduces with v5.9 on two different systems: == page:0000000062b3e92f refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x39800 flags: 0x1000(reserved) raw: 0000000000001000 fffff5b880e60008 fffff5b880e60008 0000000000000000 raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)) == page:000000004d32675d refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7a200 flags: 0x1fff000000001000(reserved) raw: 1fff000000001000 ffffe6c5c1e88008 ffffe6c5c1e88008 0000000000000000 raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)) == The page is "reserved" in all cases. In the last two crashes with the pfn: pfn 0x39800 -> 0x39800000 min_pfn hit non-RAM: 39639000-39814fff : Unknown E820 type pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM: 7a17b000-7a216fff : Unknown E820 type The pageblock_pfn_to_page check was missing when rewinding pfn to the start of the pageblock to calculate the "highest" value. So the "highest" pfn could point to a non valid pfn or not within the zone, checking the pageblock_pfn_to_page fixes it. Fixes: 5a811889de10 ("mm, compaction: use free lists to quickly locate a migration target") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- mm/compaction.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 653862aba266..76f061af8f22 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1418,8 +1418,14 @@ fast_isolate_freepages(struct compact_control *cc) nr_scanned++; pfn = page_to_pfn(freepage); - if (pfn >= highest) - highest = pageblock_start_pfn(pfn); + if (pfn >= highest) { + unsigned long start_pfn, end_pfn; + start_pfn = pageblock_start_pfn(pfn); + end_pfn = pageblock_end_pfn(start_pfn); + if (pageblock_pfn_to_page(start_pfn, end_pfn, + cc->zone)) + highest = pfn; + } if (pfn >= low_pfn) { cc->fast_search_fail = 0; == Can't we also try to scan in between start_pfn and "pfn" to see if there's one pfn that passes the pageblock_pfn_to_page test or isn't it worth it for the fast isolate variant? > > Then compact_finished() detects that in > > compact_zone(), but only after migrate_pages() and thus > > fast_isolate_freepages() is called. > > > > That would mean distance can be negative, or rather a large unsigned number > > and low_pfn and min_pfn end up away from the zone? > > > > Or maybe the above doesn't happen, but cc->free_pfn gets so close to start > > of the zone, that the calculations above make min_pfn go below start_pfn? > > > > I think the last part is more likely, going below start_pfn Would it help if I dump the whole status of the zone and pages around those addresses in the two systems that are reproducing this in v5.9 as extra check? I was going to do that right now, to validate all zone->zone_start_pfn and zone_end_pfn() were correct around that non-RAM reserved page physical address. > > In any case I would rather make sure we stay within the expected zone > > boundaries, than play tricks with PageReserved. Mel? > > > > It would be preferable because this time it's PageReserved that happened > to trip up an assumption in set_pfnblock_flags_mask but if it was a real > zone and real page then compaction is migrating cross-zone which would > be surprising. > > Maybe this untested patch? "highest" is not influenced by either low_pfn or min_pfn so it may very well be needed, but for another case, I don't think this can help this specific VM_BUG_ON_PAGE if it's caused by "highest" pfn after all? > diff --git a/mm/compaction.c b/mm/compaction.c > index 13cb7a961b31..ef1b5dacc289 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -1330,6 +1330,10 @@ fast_isolate_freepages(struct compact_control *cc) > low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2)); > min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1)); > > + /* Ensure the PFN is within the zone */ > + low_pfn = max(cc->zone->zone_start_pfn, low_pfn); > + min_pfn = max(cc->zone->zone_start_pfn, min_pfn); > + > if (WARN_ON_ONCE(min_pfn > low_pfn)) > low_pfn = min_pfn; > > > > -- > Mel Gorman > SUSE Labs >
Hello, On Mon, Nov 23, 2020 at 02:01:16PM +0100, Vlastimil Babka wrote: > On 11/21/20 8:45 PM, Andrea Arcangeli wrote: > > A corollary issue was fixed in > > 39639000-39814fff : Unknown E820 type > > > > pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM: > > > > 7a17b000-7a216fff : Unknown E820 type > > It would be nice to also provide a /proc/zoneinfo and how exactly the > "zone_spans_pfn" was violated. I assume we end up below zone's > start_pfn, but is it true? Agreed, I was about to grab that info along with all page struct around the pfn 0x7a200 and phys address 0x7a216fff. # grep -A1 E820 /proc/iomem 7a17b000-7a216fff : Unknown E820 type 7a217000-7bffffff : System RAM DMA zone_start_pfn 1 zone_end_pfn() 4096 contiguous 1 DMA32 zone_start_pfn 4096 zone_end_pfn() 1048576 contiguous 0 Normal zone_start_pfn 1048576 zone_end_pfn() 4715392 contiguous 1 Movable zone_start_pfn 0 zone_end_pfn() 0 contiguous 0 500222 0x7a1fe000 0x1fff000000001000 reserved True 500223 0x7a1ff000 0x1fff000000001000 reserved True # I suspect "highest pfn" was somewhere in the RAM range # 0x7a217000-0x7a400000 and the pageblock_start_pfn(pfn) # made highest point to pfn 0x7a200 physaddr 0x7a200000 # below, which is reserved indeed since it's non-RAM # first number is pfn hex(500224) == 0x7a200 pfn physaddr page->flags 500224 0x7a200000 0x1fff000000001000 reserved True 500225 0x7a201000 0x1fff000000001000 reserved True *snip* 500245 0x7a215000 0x1fff000000001000 reserved True 500246 0x7a216000 0x1fff000000001000 reserved True 500247 0x7a217000 0x3fff000000000000 reserved False 500248 0x7a218000 0x3fff000000000000 reserved False All RAM pages non-reserved are starting at 0x7a217000 as expected. The non-RAM page_zonenum(pfn_to_page(0x7a200)) points to ZONE_DMA and page_zone(page) below was the DMA zone despite the pfn of 0x7a200 is in DMA32. VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page); So the patch I sent earlier should prevent the above BUG_ON by not setting highest to 0x7a200 when pfn is in the phys RAM range 0x7a217000-0x7a400000, because pageblock_pfn_to_page will notice that the zone is the wrong one. if (page_zone(start_page) != zone) return NULL; However the real bug seems that reserved pages have a zero zone_id in the page->flags when it should have the real zone id/nid. The patch I sent earlier to validate highest would only be needed to deal with pfn_valid. Something must have changed more recently than v5.1 that caused the zoneid of reserved pages to be wrong, a possible candidate for the real would be this change below: + __init_single_page(pfn_to_page(pfn), pfn, 0, 0); Even if it may not be it, at the light of how the reserved page zoneid/nid initialized went wrong, the above line like it's too flakey to stay. It'd be preferable if the pfn_valid fails and the pfn_to_section_nr(pfn) returns an invalid section for the intermediate step. Even better memset 0xff over the whole page struct until the second stage comes around. Whenever pfn_valid is true, it's better that the zoneid/nid is correct all times, otherwise if the second stage fails we end up in a bug with weird side effects. Maybe it's not the above that left a zero zoneid though, I haven't tried to bisect it yet to look how the page->flags looked like on a older kernel that didn't seem to reproduce this crash, I'm just guessing. Thanks, Andrea
> Am 25.11.2020 um 06:34 schrieb Andrea Arcangeli <aarcange@redhat.com>: > > Hello, > >> On Mon, Nov 23, 2020 at 02:01:16PM +0100, Vlastimil Babka wrote: >>> On 11/21/20 8:45 PM, Andrea Arcangeli wrote: >>> A corollary issue was fixed in >>> 39639000-39814fff : Unknown E820 type >>> >>> pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM: >>> >>> 7a17b000-7a216fff : Unknown E820 type >> >> It would be nice to also provide a /proc/zoneinfo and how exactly the >> "zone_spans_pfn" was violated. I assume we end up below zone's >> start_pfn, but is it true? > > Agreed, I was about to grab that info along with all page struct > around the pfn 0x7a200 and phys address 0x7a216fff. > > # grep -A1 E820 /proc/iomem > 7a17b000-7a216fff : Unknown E820 type > 7a217000-7bffffff : System RAM > > DMA zone_start_pfn 1 zone_end_pfn() 4096 contiguous 1 > DMA32 zone_start_pfn 4096 zone_end_pfn() 1048576 contiguous 0 > Normal zone_start_pfn 1048576 zone_end_pfn() 4715392 contiguous 1 > Movable zone_start_pfn 0 zone_end_pfn() 0 contiguous 0 > > 500222 0x7a1fe000 0x1fff000000001000 reserved True > 500223 0x7a1ff000 0x1fff000000001000 reserved True > > # I suspect "highest pfn" was somewhere in the RAM range > # 0x7a217000-0x7a400000 and the pageblock_start_pfn(pfn) > # made highest point to pfn 0x7a200 physaddr 0x7a200000 > # below, which is reserved indeed since it's non-RAM > # first number is pfn hex(500224) == 0x7a200 > > pfn physaddr page->flags > 500224 0x7a200000 0x1fff000000001000 reserved True > 500225 0x7a201000 0x1fff000000001000 reserved True > *snip* > 500245 0x7a215000 0x1fff000000001000 reserved True > 500246 0x7a216000 0x1fff000000001000 reserved True > 500247 0x7a217000 0x3fff000000000000 reserved False > 500248 0x7a218000 0x3fff000000000000 reserved False > > All RAM pages non-reserved are starting at 0x7a217000 as expected. > > The non-RAM page_zonenum(pfn_to_page(0x7a200)) points to ZONE_DMA and > page_zone(page) below was the DMA zone despite the pfn of 0x7a200 is > in DMA32. > > VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page); > > So the patch I sent earlier should prevent the above BUG_ON by not > setting highest to 0x7a200 when pfn is in the phys RAM range > 0x7a217000-0x7a400000, because pageblock_pfn_to_page will notice that > the zone is the wrong one. > > if (page_zone(start_page) != zone) > return NULL; > > However the real bug seems that reserved pages have a zero zone_id in > the page->flags when it should have the real zone id/nid. The patch I > sent earlier to validate highest would only be needed to deal with > pfn_valid. > > Something must have changed more recently than v5.1 that caused the > zoneid of reserved pages to be wrong, a possible candidate for the > real would be this change below: > > + __init_single_page(pfn_to_page(pfn), pfn, 0, 0); > Before that change, the memmap of memory holes were only zeroed out. So the zones/nid was 0, however, pages were not reserved and had a refcount of zero - resulting in other issues. Most pfn walkers shouldn‘t mess with reserved pages and simply skip them. That would be the right fix here. > Even if it may not be it, at the light of how the reserved page > zoneid/nid initialized went wrong, the above line like it's too flakey > to stay. > > It'd be preferable if the pfn_valid fails and the > pfn_to_section_nr(pfn) returns an invalid section for the intermediate > step. Even better memset 0xff over the whole page struct until the > second stage comes around. I recently discussed with Baoquan to 1. Using a new pagetype to mark memory holes 2. Using special nid/zid to mark memory holes Setting the memmap to 0xff would be even more dangerous - pfn_zone() might simply BUG_ON. > > Whenever pfn_valid is true, it's better that the zoneid/nid is correct > all times, otherwise if the second stage fails we end up in a bug with > weird side effects. Memory holes with a valid memmap might not have a zone/nid. For now, skipping reserved pages should be good enough, no? > > Maybe it's not the above that left a zero zoneid though, I haven't > tried to bisect it yet to look how the page->flags looked like on a > older kernel that didn't seem to reproduce this crash, I'm just > guessing. > > Thanks, > Andrea
On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote: > > > Am 25.11.2020 um 06:34 schrieb Andrea Arcangeli <aarcange@redhat.com>: > > > > Hello, > > > >> On Mon, Nov 23, 2020 at 02:01:16PM +0100, Vlastimil Babka wrote: > >>> On 11/21/20 8:45 PM, Andrea Arcangeli wrote: > >>> A corollary issue was fixed in > >>> 39639000-39814fff : Unknown E820 type > >>> > >>> pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM: > >>> > >>> 7a17b000-7a216fff : Unknown E820 type > >> > >> It would be nice to also provide a /proc/zoneinfo and how exactly the > >> "zone_spans_pfn" was violated. I assume we end up below zone's > >> start_pfn, but is it true? > > > > Agreed, I was about to grab that info along with all page struct > > around the pfn 0x7a200 and phys address 0x7a216fff. > > > > # grep -A1 E820 /proc/iomem > > 7a17b000-7a216fff : Unknown E820 type > > 7a217000-7bffffff : System RAM > > > > DMA zone_start_pfn 1 zone_end_pfn() 4096 contiguous 1 > > DMA32 zone_start_pfn 4096 zone_end_pfn() 1048576 contiguous 0 > > Normal zone_start_pfn 1048576 zone_end_pfn() 4715392 contiguous 1 > > Movable zone_start_pfn 0 zone_end_pfn() 0 contiguous 0 > > > > 500222 0x7a1fe000 0x1fff000000001000 reserved True > > 500223 0x7a1ff000 0x1fff000000001000 reserved True > > > > # I suspect "highest pfn" was somewhere in the RAM range > > # 0x7a217000-0x7a400000 and the pageblock_start_pfn(pfn) > > # made highest point to pfn 0x7a200 physaddr 0x7a200000 > > # below, which is reserved indeed since it's non-RAM > > # first number is pfn hex(500224) == 0x7a200 > > > > pfn physaddr page->flags > > 500224 0x7a200000 0x1fff000000001000 reserved True > > 500225 0x7a201000 0x1fff000000001000 reserved True > > *snip* > > 500245 0x7a215000 0x1fff000000001000 reserved True > > 500246 0x7a216000 0x1fff000000001000 reserved True > > 500247 0x7a217000 0x3fff000000000000 reserved False > > 500248 0x7a218000 0x3fff000000000000 reserved False > > > > All RAM pages non-reserved are starting at 0x7a217000 as expected. > > > > The non-RAM page_zonenum(pfn_to_page(0x7a200)) points to ZONE_DMA and > > page_zone(page) below was the DMA zone despite the pfn of 0x7a200 is > > in DMA32. > > > > VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page); > > > > So the patch I sent earlier should prevent the above BUG_ON by not > > setting highest to 0x7a200 when pfn is in the phys RAM range > > 0x7a217000-0x7a400000, because pageblock_pfn_to_page will notice that > > the zone is the wrong one. > > > > if (page_zone(start_page) != zone) > > return NULL; > > > > However the real bug seems that reserved pages have a zero zone_id in > > the page->flags when it should have the real zone id/nid. The patch I > > sent earlier to validate highest would only be needed to deal with > > pfn_valid. > > > > Something must have changed more recently than v5.1 that caused the > > zoneid of reserved pages to be wrong, a possible candidate for the > > real would be this change below: > > > > + __init_single_page(pfn_to_page(pfn), pfn, 0, 0); > > > > Before that change, the memmap of memory holes were only zeroed out. > So the zones/nid was 0, however, pages were not reserved and had a > refcount of zero - resulting in other issues. > > Most pfn walkers shouldn‘t mess with reserved pages and simply skip > them. That would be the right fix here. My guest would be that it is me and Baoquan: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN") Until then reserved pages were traversed in memmap_init_zone() and after the change they are not because on x86 reserved memory is not considered memory for some reason. Can you please check if this untested patch helps: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index eaa227a479e4..be3c85a6714e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6191,7 +6191,9 @@ void __meminit __weak memmap_init(unsigned long size, int nid, { unsigned long start_pfn, end_pfn; unsigned long range_end_pfn = range_start_pfn + size; + phys_addr_t start, end; int i; + u64 j; for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn); @@ -6203,6 +6205,19 @@ void __meminit __weak memmap_init(unsigned long size, int nid, MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); } } + + __for_each_mem_range(j, &memblock.reserved, NULL, nid, MEMBLOCK_NONE, + &start, &end, NULL) { + start_pfn = clamp(PHYS_PFN(start), range_start_pfn, + range_end_pfn); + end_pfn = clamp(PHYS_PFN(end), range_start_pfn, range_end_pfn); + + if (end_pfn > start_pfn) { + size = end_pfn - start_pfn; + memmap_init_zone(size, nid, zone, start_pfn, + MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); + } + } } static int zone_batchsize(struct zone *zone) > > Even if it may not be it, at the light of how the reserved page > > zoneid/nid initialized went wrong, the above line like it's too flakey > > to stay. > > > > It'd be preferable if the pfn_valid fails and the > > pfn_to_section_nr(pfn) returns an invalid section for the intermediate > > step. Even better memset 0xff over the whole page struct until the > > second stage comes around. > > I recently discussed with Baoquan to > 1. Using a new pagetype to mark memory holes > 2. Using special nid/zid to mark memory holes > > Setting the memmap to 0xff would be even more dangerous - pfn_zone() might simply BUG_ON. > > > > > Whenever pfn_valid is true, it's better that the zoneid/nid is correct > > all times, otherwise if the second stage fails we end up in a bug with > > weird side effects. > > Memory holes with a valid memmap might not have a zone/nid. For now, skipping reserved pages should be good enough, no? > > > > > Maybe it's not the above that left a zero zoneid though, I haven't > > tried to bisect it yet to look how the page->flags looked like on a > > older kernel that didn't seem to reproduce this crash, I'm just > > guessing. > > > > Thanks, > > Andrea >
On Tue, Nov 24, 2020 at 03:56:22PM -0500, Andrea Arcangeli wrote: > Hello, > > On Tue, Nov 24, 2020 at 01:32:05PM +0000, Mel Gorman wrote: > > I would hope that is not the case because they are not meant to overlap. > > However, if the beginning of the pageblock was not the start of a zone > > then the pages would be valid but the pfn would still be outside the > > zone boundary. If it was reserved, the struct page is valid but not > > suitable for set_pfnblock_flags_mask. However, it is a concern in > > general because the potential is there that pages are isolated from the > > wrong zone. > > I guess we have more than one issue to correct in that function > because the same BUG_ON reproduced again even with the tentative patch > I posted earlier. > > So my guess is that the problematic reserved page isn't pointed by the > min_pfn, but it must have been pointed by the "highest" variable > calculated below? > > if (pfn >= highest) > highest = pageblock_start_pfn(pfn); > > When I looked at where "highest" comes from, it lacks > pageblock_pfn_to_page check (which was added around v5.7 to min_pfn). > > Is that the real bug, which may be fixed by something like this? (untested) > It's plausible as it is a potential source of leaking but as you note in another mail, it's surprising to me that valid struct pages, even if within memory holes and reserved would have broken node/zone information in the page flags.
On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote: > > Something must have changed more recently than v5.1 that caused the > > zoneid of reserved pages to be wrong, a possible candidate for the > > real would be this change below: > > > > + __init_single_page(pfn_to_page(pfn), pfn, 0, 0); > > > > Before that change, the memmap of memory holes were only zeroed out. So the zones/nid was 0, however, pages were not reserved and had a refcount of zero - resulting in other issues. > > Most pfn walkers shouldn???t mess with reserved pages and simply skip them. That would be the right fix here. > Ordinarily yes, pfn walkers should not care about reserved pages but it's still surprising that the node/zone linkages would be wrong for memory holes. If they are in the middle of a zone, it means that a hole with valid struct pages could be mistaken for overlapping nodes (if the hole was in node 1 for example) or overlapping zones which is just broken. > > > > Whenever pfn_valid is true, it's better that the zoneid/nid is correct > > all times, otherwise if the second stage fails we end up in a bug with > > weird side effects. > > Memory holes with a valid memmap might not have a zone/nid. For now, skipping reserved pages should be good enough, no? > It would partially paper over the issue that setting the pageblock type based on a reserved page. I agree that compaction should not be returning pfns that are outside of the zone range because that is buggy in itself but valid struct pages should have valid information. I don't think we want to paper over that with unnecessary PageReserved checks.
On 25.11.20 11:39, Mel Gorman wrote: > On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote: >>> Something must have changed more recently than v5.1 that caused the >>> zoneid of reserved pages to be wrong, a possible candidate for the >>> real would be this change below: >>> >>> + __init_single_page(pfn_to_page(pfn), pfn, 0, 0); >>> >> >> Before that change, the memmap of memory holes were only zeroed out. So the zones/nid was 0, however, pages were not reserved and had a refcount of zero - resulting in other issues. >> >> Most pfn walkers shouldn???t mess with reserved pages and simply skip them. That would be the right fix here. >> > > Ordinarily yes, pfn walkers should not care about reserved pages but it's > still surprising that the node/zone linkages would be wrong for memory > holes. If they are in the middle of a zone, it means that a hole with > valid struct pages could be mistaken for overlapping nodes (if the hole > was in node 1 for example) or overlapping zones which is just broken. I agree within zones - but AFAIU, the issue is reserved memory between zones, right? Assume your end of memory falls within a section - what would be the right node/zone for such a memory hole at the end of the section? With memory hotplug after such a hole, we can easily have multiple nodes/zones spanning such a hole, unknown before hotplug. IMHO, marking memory holes properly (as discussed) would be the cleanest approach. For now, we use node/zone 0 + PageReserved - because memory hotunplug (zone shrinking etc.) doesn't really care about ZONE_DMA. > >>> >>> Whenever pfn_valid is true, it's better that the zoneid/nid is correct >>> all times, otherwise if the second stage fails we end up in a bug with >>> weird side effects. >> >> Memory holes with a valid memmap might not have a zone/nid. For now, skipping reserved pages should be good enough, no? >> > > It would partially paper over the issue that setting the pageblock type > based on a reserved page. I agree that compaction should not be returning > pfns that are outside of the zone range because that is buggy in itself > but valid struct pages should have valid information. I don't think we > want to paper over that with unnecessary PageReserved checks. Agreed as long as we can handle that issue using range checks.
On 25.11.20 12:04, David Hildenbrand wrote: > On 25.11.20 11:39, Mel Gorman wrote: >> On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote: >>>> Something must have changed more recently than v5.1 that caused the >>>> zoneid of reserved pages to be wrong, a possible candidate for the >>>> real would be this change below: >>>> >>>> + __init_single_page(pfn_to_page(pfn), pfn, 0, 0); >>>> >>> >>> Before that change, the memmap of memory holes were only zeroed out. So the zones/nid was 0, however, pages were not reserved and had a refcount of zero - resulting in other issues. >>> >>> Most pfn walkers shouldn???t mess with reserved pages and simply skip them. That would be the right fix here. >>> >> >> Ordinarily yes, pfn walkers should not care about reserved pages but it's >> still surprising that the node/zone linkages would be wrong for memory >> holes. If they are in the middle of a zone, it means that a hole with >> valid struct pages could be mistaken for overlapping nodes (if the hole >> was in node 1 for example) or overlapping zones which is just broken. > > I agree within zones - but AFAIU, the issue is reserved memory between > zones, right? Double checking, I was confused. This applies also to memory holes within zones in x86.
On 11/25/20 6:34 AM, Andrea Arcangeli wrote: > Hello, > > On Mon, Nov 23, 2020 at 02:01:16PM +0100, Vlastimil Babka wrote: >> On 11/21/20 8:45 PM, Andrea Arcangeli wrote: >> > A corollary issue was fixed in >> > 39639000-39814fff : Unknown E820 type >> > >> > pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM: >> > >> > 7a17b000-7a216fff : Unknown E820 type >> >> It would be nice to also provide a /proc/zoneinfo and how exactly the >> "zone_spans_pfn" was violated. I assume we end up below zone's >> start_pfn, but is it true? > > Agreed, I was about to grab that info along with all page struct > around the pfn 0x7a200 and phys address 0x7a216fff. > > # grep -A1 E820 /proc/iomem > 7a17b000-7a216fff : Unknown E820 type > 7a217000-7bffffff : System RAM > > DMA zone_start_pfn 1 zone_end_pfn() 4096 contiguous 1 > DMA32 zone_start_pfn 4096 zone_end_pfn() 1048576 contiguous 0 > Normal zone_start_pfn 1048576 zone_end_pfn() 4715392 contiguous 1 > Movable zone_start_pfn 0 zone_end_pfn() 0 contiguous 0 So the above means that around the "Unknown E820 type" we have: pfn 499712 - start of pageblock in ZONE_DMA32 pfn 500091 - start of the "Unknown E820 type" range pfn 500224 - start of another pageblock pfn 500246 - end of "Unknown E820 type" So this is indeed not a zone boundary issue, but basically a hole not aligned to pageblock boundary and really unexpected. We have CONFIG_HOLES_IN_ZONE (that x86 doesn't set) for architectures that do this, and even that config only affects pfn_valid_within(). But here pfn_valid() is true, but the zone/node linkage is unexpected. > However the real bug seems that reserved pages have a zero zone_id in > the page->flags when it should have the real zone id/nid. The patch I > sent earlier to validate highest would only be needed to deal with > pfn_valid. > > Something must have changed more recently than v5.1 that caused the > zoneid of reserved pages to be wrong, a possible candidate for the > real would be this change below: > > + __init_single_page(pfn_to_page(pfn), pfn, 0, 0); > > Even if it may not be it, at the light of how the reserved page > zoneid/nid initialized went wrong, the above line like it's too flakey > to stay. > > It'd be preferable if the pfn_valid fails and the > pfn_to_section_nr(pfn) returns an invalid section for the intermediate > step. Even better memset 0xff over the whole page struct until the > second stage comes around. > > Whenever pfn_valid is true, it's better that the zoneid/nid is correct > all times, otherwise if the second stage fails we end up in a bug with > weird side effects. Yeah I guess it would be simpler if zoneid/nid was correct for pfn_valid() pfns within a zone's range, even if they are reserved due not not being really usable memory. I don't think we want to introduce CONFIG_HOLES_IN_ZONE to x86. If the chosen solution is to make this to a real hole, the hole should be extended to MAX_ORDER_NR_PAGES aligned boundaries. In any case, compaction code can't fix this with better range checks. > Maybe it's not the above that left a zero zoneid though, I haven't > tried to bisect it yet to look how the page->flags looked like on a > older kernel that didn't seem to reproduce this crash, I'm just > guessing. > > Thanks, > Andrea >
On 25.11.20 13:08, Vlastimil Babka wrote: > On 11/25/20 6:34 AM, Andrea Arcangeli wrote: >> Hello, >> >> On Mon, Nov 23, 2020 at 02:01:16PM +0100, Vlastimil Babka wrote: >>> On 11/21/20 8:45 PM, Andrea Arcangeli wrote: >>>> A corollary issue was fixed in >>>> 39639000-39814fff : Unknown E820 type >>>> >>>> pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM: >>>> >>>> 7a17b000-7a216fff : Unknown E820 type >>> >>> It would be nice to also provide a /proc/zoneinfo and how exactly the >>> "zone_spans_pfn" was violated. I assume we end up below zone's >>> start_pfn, but is it true? >> >> Agreed, I was about to grab that info along with all page struct >> around the pfn 0x7a200 and phys address 0x7a216fff. >> >> # grep -A1 E820 /proc/iomem >> 7a17b000-7a216fff : Unknown E820 type >> 7a217000-7bffffff : System RAM >> >> DMA zone_start_pfn 1 zone_end_pfn() 4096 contiguous 1 >> DMA32 zone_start_pfn 4096 zone_end_pfn() 1048576 contiguous 0 >> Normal zone_start_pfn 1048576 zone_end_pfn() 4715392 contiguous 1 >> Movable zone_start_pfn 0 zone_end_pfn() 0 contiguous 0 > > So the above means that around the "Unknown E820 type" we have: > > pfn 499712 - start of pageblock in ZONE_DMA32 > pfn 500091 - start of the "Unknown E820 type" range > pfn 500224 - start of another pageblock > pfn 500246 - end of "Unknown E820 type" > > So this is indeed not a zone boundary issue, but basically a hole not > aligned to pageblock boundary and really unexpected. > We have CONFIG_HOLES_IN_ZONE (that x86 doesn't set) for architectures > that do this, and even that config only affects pfn_valid_within(). But > here pfn_valid() is true, but the zone/node linkage is unexpected. > >> However the real bug seems that reserved pages have a zero zone_id in >> the page->flags when it should have the real zone id/nid. The patch I >> sent earlier to validate highest would only be needed to deal with >> pfn_valid. >> >> Something must have changed more recently than v5.1 that caused the >> zoneid of reserved pages to be wrong, a possible candidate for the >> real would be this change below: >> >> + __init_single_page(pfn_to_page(pfn), pfn, 0, 0); >> >> Even if it may not be it, at the light of how the reserved page >> zoneid/nid initialized went wrong, the above line like it's too flakey >> to stay. >> >> It'd be preferable if the pfn_valid fails and the >> pfn_to_section_nr(pfn) returns an invalid section for the intermediate >> step. Even better memset 0xff over the whole page struct until the >> second stage comes around. >> >> Whenever pfn_valid is true, it's better that the zoneid/nid is correct >> all times, otherwise if the second stage fails we end up in a bug with >> weird side effects. > > Yeah I guess it would be simpler if zoneid/nid was correct for > pfn_valid() pfns within a zone's range, even if they are reserved due > not not being really usable memory. > > I don't think we want to introduce CONFIG_HOLES_IN_ZONE to x86. If the > chosen solution is to make this to a real hole, the hole should be > extended to MAX_ORDER_NR_PAGES aligned boundaries. As we don't punch out pages of the memmap on x86-64, pfn_valid() keeps working as expected. There is a memmap that can be accessed and that was initialized. It's really just a matter of how to handle memory holes in this scenario. a) Try initializing them to the covering node/zone (I gave one example that might be tricky with hotplug) b) Mark such pages (either special node/zone or pagetype) and make pfn walkers ignore these holes. For now, this can only be done via the reserved flag.
On Wed, Nov 25, 2020 at 12:04:15PM +0100, David Hildenbrand wrote: > On 25.11.20 11:39, Mel Gorman wrote: > > On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote: > >>> Something must have changed more recently than v5.1 that caused the > >>> zoneid of reserved pages to be wrong, a possible candidate for the > >>> real would be this change below: > >>> > >>> + __init_single_page(pfn_to_page(pfn), pfn, 0, 0); > >>> > >> > >> Before that change, the memmap of memory holes were only zeroed out. So the zones/nid was 0, however, pages were not reserved and had a refcount of zero - resulting in other issues. > >> > >> Most pfn walkers shouldn???t mess with reserved pages and simply skip them. That would be the right fix here. > >> > > > > Ordinarily yes, pfn walkers should not care about reserved pages but it's > > still surprising that the node/zone linkages would be wrong for memory > > holes. If they are in the middle of a zone, it means that a hole with > > valid struct pages could be mistaken for overlapping nodes (if the hole > > was in node 1 for example) or overlapping zones which is just broken. > > I agree within zones - but AFAIU, the issue is reserved memory between > zones, right? > It can also occur in the middle of the zone. > Assume your end of memory falls within a section - what would be the > right node/zone for such a memory hole at the end of the section? Assuming a hole is not MAX_ORDER-aligned but there is real memory within the page block, then the node/zone for the struct pages backing the hole should match the real memorys node and zone. As it stands, with the uninitialised node/zone, certain checks like page_is_buddy(): page_zone_id(page) != page_zone_id(buddy) may only work by co-incidence. page_is_buddy() happens to work anyway because PageBuddy(buddy) would never be true for a PageReserved page. > With > memory hotplug after such a hole, we can easily have multiple > nodes/zones spanning such a hole, unknown before hotplug. > When hotplugged, the same logic would apply. Where the hole is not aligned, the struct page linkages should match the "real" memory". > > It would partially paper over the issue that setting the pageblock type > > based on a reserved page. I agree that compaction should not be returning > > pfns that are outside of the zone range because that is buggy in itself > > but valid struct pages should have valid information. I don't think we > > want to paper over that with unnecessary PageReserved checks. > > Agreed as long as we can handle that issue using range checks. > I think it'll be ok as long as the struct pages within a 1<<(MAX_ORDER-1) range have proper linkages.
On 25.11.20 14:33, Mel Gorman wrote: > On Wed, Nov 25, 2020 at 12:04:15PM +0100, David Hildenbrand wrote: >> On 25.11.20 11:39, Mel Gorman wrote: >>> On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote: >>>>> Something must have changed more recently than v5.1 that caused the >>>>> zoneid of reserved pages to be wrong, a possible candidate for the >>>>> real would be this change below: >>>>> >>>>> + __init_single_page(pfn_to_page(pfn), pfn, 0, 0); >>>>> >>>> >>>> Before that change, the memmap of memory holes were only zeroed out. So the zones/nid was 0, however, pages were not reserved and had a refcount of zero - resulting in other issues. >>>> >>>> Most pfn walkers shouldn???t mess with reserved pages and simply skip them. That would be the right fix here. >>>> >>> >>> Ordinarily yes, pfn walkers should not care about reserved pages but it's >>> still surprising that the node/zone linkages would be wrong for memory >>> holes. If they are in the middle of a zone, it means that a hole with >>> valid struct pages could be mistaken for overlapping nodes (if the hole >>> was in node 1 for example) or overlapping zones which is just broken. >> >> I agree within zones - but AFAIU, the issue is reserved memory between >> zones, right? >> > > It can also occur in the middle of the zone. > >> Assume your end of memory falls within a section - what would be the >> right node/zone for such a memory hole at the end of the section? > > Assuming a hole is not MAX_ORDER-aligned but there is real memory within > the page block, then the node/zone for the struct pages backing the hole > should match the real memorys node and zone. > > As it stands, with the uninitialised node/zone, certain checks like > page_is_buddy(): page_zone_id(page) != page_zone_id(buddy) may only > work by co-incidence. page_is_buddy() happens to work anyway because > PageBuddy(buddy) would never be true for a PageReserved page. > >> With >> memory hotplug after such a hole, we can easily have multiple >> nodes/zones spanning such a hole, unknown before hotplug. >> > > When hotplugged, the same logic would apply. Where the hole is not aligned, > the struct page linkages should match the "real" memory". > >>> It would partially paper over the issue that setting the pageblock type >>> based on a reserved page. I agree that compaction should not be returning >>> pfns that are outside of the zone range because that is buggy in itself >>> but valid struct pages should have valid information. I don't think we >>> want to paper over that with unnecessary PageReserved checks. >> >> Agreed as long as we can handle that issue using range checks. >> > > I think it'll be ok as long as the struct pages within a 1<<(MAX_ORDER-1) > range have proper linkages. Agreed.
On Wed, Nov 25, 2020 at 02:32:02PM +0100, David Hildenbrand wrote: > On 25.11.20 13:08, Vlastimil Babka wrote: > > On 11/25/20 6:34 AM, Andrea Arcangeli wrote: > >> Hello, > >> > >> On Mon, Nov 23, 2020 at 02:01:16PM +0100, Vlastimil Babka wrote: > >>> On 11/21/20 8:45 PM, Andrea Arcangeli wrote: > >>>> A corollary issue was fixed in > >>>> 39639000-39814fff : Unknown E820 type > >>>> > >>>> pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM: > >>>> > >>>> 7a17b000-7a216fff : Unknown E820 type > >>> > >>> It would be nice to also provide a /proc/zoneinfo and how exactly the > >>> "zone_spans_pfn" was violated. I assume we end up below zone's > >>> start_pfn, but is it true? > >> > >> Agreed, I was about to grab that info along with all page struct > >> around the pfn 0x7a200 and phys address 0x7a216fff. > >> > >> # grep -A1 E820 /proc/iomem > >> 7a17b000-7a216fff : Unknown E820 type > >> 7a217000-7bffffff : System RAM > >> > >> DMA zone_start_pfn 1 zone_end_pfn() 4096 contiguous 1 > >> DMA32 zone_start_pfn 4096 zone_end_pfn() 1048576 contiguous 0 > >> Normal zone_start_pfn 1048576 zone_end_pfn() 4715392 contiguous 1 > >> Movable zone_start_pfn 0 zone_end_pfn() 0 contiguous 0 > > > > So the above means that around the "Unknown E820 type" we have: > > > > pfn 499712 - start of pageblock in ZONE_DMA32 > > pfn 500091 - start of the "Unknown E820 type" range > > pfn 500224 - start of another pageblock > > pfn 500246 - end of "Unknown E820 type" > > > > So this is indeed not a zone boundary issue, but basically a hole not > > aligned to pageblock boundary and really unexpected. > > We have CONFIG_HOLES_IN_ZONE (that x86 doesn't set) for architectures > > that do this, and even that config only affects pfn_valid_within(). But > > here pfn_valid() is true, but the zone/node linkage is unexpected. > > > >> However the real bug seems that reserved pages have a zero zone_id in > >> the page->flags when it should have the real zone id/nid. The patch I > >> sent earlier to validate highest would only be needed to deal with > >> pfn_valid. > >> > >> Something must have changed more recently than v5.1 that caused the > >> zoneid of reserved pages to be wrong, a possible candidate for the > >> real would be this change below: > >> > >> + __init_single_page(pfn_to_page(pfn), pfn, 0, 0); > >> > >> Even if it may not be it, at the light of how the reserved page > >> zoneid/nid initialized went wrong, the above line like it's too flakey > >> to stay. > >> > >> It'd be preferable if the pfn_valid fails and the > >> pfn_to_section_nr(pfn) returns an invalid section for the intermediate > >> step. Even better memset 0xff over the whole page struct until the > >> second stage comes around. > >> > >> Whenever pfn_valid is true, it's better that the zoneid/nid is correct > >> all times, otherwise if the second stage fails we end up in a bug with > >> weird side effects. > > > > Yeah I guess it would be simpler if zoneid/nid was correct for > > pfn_valid() pfns within a zone's range, even if they are reserved due > > not not being really usable memory. > > > > I don't think we want to introduce CONFIG_HOLES_IN_ZONE to x86. If the > > chosen solution is to make this to a real hole, the hole should be > > extended to MAX_ORDER_NR_PAGES aligned boundaries. > > As we don't punch out pages of the memmap on x86-64, pfn_valid() keeps > working as expected. There is a memmap that can be accessed and that was > initialized. I suspect that memmap for the reserved pages is not properly initialized after recent changes in free_area_init(). They are cleared at init_unavailable_mem() to have zone=0 and node=0, but they seem to be never re-initialized with proper zone and node links which was not the case before commit 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN"). Back then, memmap_init_zone() looped from zone_start_pfn till zone_end_pfn and struct page for reserved pages with pfns inside the zone would be initialized. Now the loop is for interesection of [zone_start_pfn, zone_end_pfn] with memblock.memory and for x86 reserved ranges are not in memblock.memory, so the memmap for them remains semi-initialized. > It's really just a matter of how to handle memory holes in > this scenario. > > a) Try initializing them to the covering node/zone (I gave one example > that might be tricky with hotplug) > b) Mark such pages (either special node/zone or pagetype) and make pfn > walkers ignore these holes. For now, this can only be done via the > reserved flag. > > -- > Thanks, > > David / dhildenb >
On 25.11.20 15:13, Mike Rapoport wrote: > On Wed, Nov 25, 2020 at 02:32:02PM +0100, David Hildenbrand wrote: >> On 25.11.20 13:08, Vlastimil Babka wrote: >>> On 11/25/20 6:34 AM, Andrea Arcangeli wrote: >>>> Hello, >>>> >>>> On Mon, Nov 23, 2020 at 02:01:16PM +0100, Vlastimil Babka wrote: >>>>> On 11/21/20 8:45 PM, Andrea Arcangeli wrote: >>>>>> A corollary issue was fixed in >>>>>> 39639000-39814fff : Unknown E820 type >>>>>> >>>>>> pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM: >>>>>> >>>>>> 7a17b000-7a216fff : Unknown E820 type >>>>> >>>>> It would be nice to also provide a /proc/zoneinfo and how exactly the >>>>> "zone_spans_pfn" was violated. I assume we end up below zone's >>>>> start_pfn, but is it true? >>>> >>>> Agreed, I was about to grab that info along with all page struct >>>> around the pfn 0x7a200 and phys address 0x7a216fff. >>>> >>>> # grep -A1 E820 /proc/iomem >>>> 7a17b000-7a216fff : Unknown E820 type >>>> 7a217000-7bffffff : System RAM >>>> >>>> DMA zone_start_pfn 1 zone_end_pfn() 4096 contiguous 1 >>>> DMA32 zone_start_pfn 4096 zone_end_pfn() 1048576 contiguous 0 >>>> Normal zone_start_pfn 1048576 zone_end_pfn() 4715392 contiguous 1 >>>> Movable zone_start_pfn 0 zone_end_pfn() 0 contiguous 0 >>> >>> So the above means that around the "Unknown E820 type" we have: >>> >>> pfn 499712 - start of pageblock in ZONE_DMA32 >>> pfn 500091 - start of the "Unknown E820 type" range >>> pfn 500224 - start of another pageblock >>> pfn 500246 - end of "Unknown E820 type" >>> >>> So this is indeed not a zone boundary issue, but basically a hole not >>> aligned to pageblock boundary and really unexpected. >>> We have CONFIG_HOLES_IN_ZONE (that x86 doesn't set) for architectures >>> that do this, and even that config only affects pfn_valid_within(). But >>> here pfn_valid() is true, but the zone/node linkage is unexpected. >>> >>>> However the real bug seems that reserved pages have a zero zone_id in >>>> the page->flags when it should have the real zone id/nid. The patch I >>>> sent earlier to validate highest would only be needed to deal with >>>> pfn_valid. >>>> >>>> Something must have changed more recently than v5.1 that caused the >>>> zoneid of reserved pages to be wrong, a possible candidate for the >>>> real would be this change below: >>>> >>>> + __init_single_page(pfn_to_page(pfn), pfn, 0, 0); >>>> >>>> Even if it may not be it, at the light of how the reserved page >>>> zoneid/nid initialized went wrong, the above line like it's too flakey >>>> to stay. >>>> >>>> It'd be preferable if the pfn_valid fails and the >>>> pfn_to_section_nr(pfn) returns an invalid section for the intermediate >>>> step. Even better memset 0xff over the whole page struct until the >>>> second stage comes around. >>>> >>>> Whenever pfn_valid is true, it's better that the zoneid/nid is correct >>>> all times, otherwise if the second stage fails we end up in a bug with >>>> weird side effects. >>> >>> Yeah I guess it would be simpler if zoneid/nid was correct for >>> pfn_valid() pfns within a zone's range, even if they are reserved due >>> not not being really usable memory. >>> >>> I don't think we want to introduce CONFIG_HOLES_IN_ZONE to x86. If the >>> chosen solution is to make this to a real hole, the hole should be >>> extended to MAX_ORDER_NR_PAGES aligned boundaries. >> >> As we don't punch out pages of the memmap on x86-64, pfn_valid() keeps >> working as expected. There is a memmap that can be accessed and that was >> initialized. > > I suspect that memmap for the reserved pages is not properly initialized > after recent changes in free_area_init(). They are cleared at > init_unavailable_mem() to have zone=0 and node=0, but they seem to be > never re-initialized with proper zone and node links which was not the > case before commit 73a6e474cb37 ("mm: memmap_init: iterate over memblock > regions rather that check each PFN"). > > Back then, memmap_init_zone() looped from zone_start_pfn till > zone_end_pfn and struct page for reserved pages with pfns inside the > zone would be initialized. > > Now the loop is for interesection of [zone_start_pfn, zone_end_pfn] with > memblock.memory and for x86 reserved ranges are not in memblock.memory, > so the memmap for them remains semi-initialized. As far as I understood Mel, rounding these ranges up/down to cover full MAX_ORDER blocks/pageblocks might work.
On Wed, Nov 25, 2020 at 10:30:53AM +0000, Mel Gorman wrote: > On Tue, Nov 24, 2020 at 03:56:22PM -0500, Andrea Arcangeli wrote: > > Hello, > > > > On Tue, Nov 24, 2020 at 01:32:05PM +0000, Mel Gorman wrote: > > > I would hope that is not the case because they are not meant to overlap. > > > However, if the beginning of the pageblock was not the start of a zone > > > then the pages would be valid but the pfn would still be outside the > > > zone boundary. If it was reserved, the struct page is valid but not > > > suitable for set_pfnblock_flags_mask. However, it is a concern in > > > general because the potential is there that pages are isolated from the > > > wrong zone. > > > > I guess we have more than one issue to correct in that function > > because the same BUG_ON reproduced again even with the tentative patch > > I posted earlier. > > > > So my guess is that the problematic reserved page isn't pointed by the > > min_pfn, but it must have been pointed by the "highest" variable > > calculated below? > > > > if (pfn >= highest) > > highest = pageblock_start_pfn(pfn); > > > > When I looked at where "highest" comes from, it lacks > > pageblock_pfn_to_page check (which was added around v5.7 to min_pfn). > > > > Is that the real bug, which may be fixed by something like this? (untested) > > > > It's plausible as it is a potential source of leaking but as you note > in another mail, it's surprising to me that valid struct pages, even if > within memory holes and reserved would have broken node/zone information > in the page flags. I think the patch to add pageblock_pfn_to_page is still needed to cope with !pfn_valid or a pageblock in between zones, but pfn_valid or pageblock in between zones is not what happens here. So the patch adding pageblock_pfn_to_page would have had the undesired side effect of hiding the bug so it's best to deal with the other bug first.
On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote: > Before that change, the memmap of memory holes were only zeroed > out. So the zones/nid was 0, however, pages were not reserved and > had a refcount of zero - resulting in other issues. So maybe that "0,0" zoneid/nid was not actually the thing that introduced the regression? Note: I didn't bisect anything yet, it was just a guess. In fact, I need first thing to boot with an old v5.3 kernel and to dump the page->flags around the "Unknown E820 type" region to be 100% certain that the older kernels had a correct page->flags for reserved pages. I will clear this up before the end of the day. > Most pfn walkers shouldn‘t mess with reserved pages and simply skip > them. That would be the right fix here. What would then be the advantage of keeping wrong page->flags in reserved pages compared to actually set it correct? How do you plan to enforce that every VM bit will call PageReserved (as it should be introduced in pageblock_pfn_to_page in that case) if it's not possible to disambiguate a real DMA zone and nid 0 from the uninitialized value? How can we patch page_nodenum to enforce it won't read an uninitialized value because a PageReserved check was missing in the caller? I don't see this easily enforceable by code review, it'd be pretty flakey to leave a 0,0 wrong value in there with no way to verify if a PageResered check in the caller was missing. > > It'd be preferable if the pfn_valid fails and the > > pfn_to_section_nr(pfn) returns an invalid section for the intermediate > > step. Even better memset 0xff over the whole page struct until the > > second stage comes around. > > I recently discussed with Baoquan to > 1. Using a new pagetype to mark memory holes > 2. Using special nid/zid to mark memory holes > > Setting the memmap to 0xff would be even more dangerous - pfn_zone() might simply BUG_ON. What would need to call pfn_zone in between first and second stage? If something calls pfn_zone in between first and second stage isn't it a feature if it crashes the kernel at boot? Note: I suggested 0xff kernel crashing "until the second stage comes around" during meminit at boot, not permanently. /* * Use a fake node/zone (0) for now. Some of these pages * (in memblock.reserved but not in memblock.memory) will * get re-initialized via reserve_bootmem_region() later. */ Specifically I relied on the comment "get re-initialized via reserve_bootmem_region() later". I assumed the second stage overwrites the 0,0 to the real zoneid/nid value, which is clearly not happening, hence it'd be preferable to get a crash at boot reliably. Now I have CONFIG_DEFERRED_STRUCT_PAGE_INIT=n so the second stage calling init_reserved_page(start_pfn) won't do much with CONFIG_DEFERRED_STRUCT_PAGE_INIT=n but I already tried to enable CONFIG_DEFERRED_STRUCT_PAGE_INIT=y yesterday and it didn't help, the page->flags were still wrong for reserved pages in the "Unknown E820 type" region. > > Whenever pfn_valid is true, it's better that the zoneid/nid is correct > > all times, otherwise if the second stage fails we end up in a bug with > > weird side effects. > > Memory holes with a valid memmap might not have a zone/nid. For now, skipping reserved pages should be good enough, no? zoneid is always a pure abstract concept, not just the RAM-holes but the RAM itself doesn't have a zoneid at all. Nid is hardware defined for RAM, but for reserved RAM holes it becomes a pure software construct as the zoneid. So while it's right that they have no zone/nid in the hardware, they still have a very well defined zoneid/nid in the software implementation. The page struct belongs to one and only one mem_map array, that has a specific nodeid and nid. So it can be always initialized right even for non-RAM. Only if the pfn is !pfn_valid, then it has no page struct and then there's no zone/nid association, but in that case there's no reason to even worry about it since nobody can possibly get a wrong value out of the page struct because there's no page struct in the case. Last but not the least, RAM pages can be marked reserved and assigned to hardware and so it'd be really messy if real reserved RAM pages given to hw, behaved different than non-RAM that accidentally got a struct page because of section alignment issues. Thanks, Andrea
On Wed, Nov 25, 2020 at 12:41:55PM +0100, David Hildenbrand wrote: > On 25.11.20 12:04, David Hildenbrand wrote: > > On 25.11.20 11:39, Mel Gorman wrote: > >> On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote: > >>>> Something must have changed more recently than v5.1 that caused the > >>>> zoneid of reserved pages to be wrong, a possible candidate for the > >>>> real would be this change below: > >>>> > >>>> + __init_single_page(pfn_to_page(pfn), pfn, 0, 0); > >>>> > >>> > >>> Before that change, the memmap of memory holes were only zeroed out. So the zones/nid was 0, however, pages were not reserved and had a refcount of zero - resulting in other issues. > >>> > >>> Most pfn walkers shouldn???t mess with reserved pages and simply skip them. That would be the right fix here. > >>> > >> > >> Ordinarily yes, pfn walkers should not care about reserved pages but it's > >> still surprising that the node/zone linkages would be wrong for memory > >> holes. If they are in the middle of a zone, it means that a hole with > >> valid struct pages could be mistaken for overlapping nodes (if the hole > >> was in node 1 for example) or overlapping zones which is just broken. > > > > I agree within zones - but AFAIU, the issue is reserved memory between > > zones, right? > > Double checking, I was confused. This applies also to memory holes > within zones in x86. Yes this is a memory hole within the DMA32 zone. Still why there should be any difference? As long as a page struct exists it's in a well defined mem_map array which comes for one and only one zoneid/nid combination. So what would be the benefit of treating memory holes within zones or in between zones differently and leave one or the other with a zoneid/nid uninitialized?
On Wed, Nov 25, 2020 at 01:08:54PM +0100, Vlastimil Babka wrote: > Yeah I guess it would be simpler if zoneid/nid was correct for > pfn_valid() pfns within a zone's range, even if they are reserved due > not not being really usable memory. > > I don't think we want to introduce CONFIG_HOLES_IN_ZONE to x86. If the > chosen solution is to make this to a real hole, the hole should be > extended to MAX_ORDER_NR_PAGES aligned boundaries. The way pfn_valid works it's not possible to render all non-RAM pfn as !pfn_valid, CONFIG_HOLES_IN_ZONE would not achieve it 100% either. So I don't think we can rely on that to eliminate all non-RAM reserved pages from the mem_map and avoid having to initialize them in the first place. Some could remain as in this case since in the same pageblock there's non-RAM followed by RAM and all pfn are valid. > In any case, compaction code can't fix this with better range checks. David's correct that it can, by adding enough PageReserved (I'm running all systems reproducing this with plenty of PageReserved checks in all places to work around it until we do a proper fix). My problem with that is that 1) it's simply non enforceable at runtime that there is not missing PageReserved check and 2) what benefit it would provide to leave a wrong zoneid in reserved pages and having to add extra PageReserved checks? A struct page has a deterministic zoneid/nid, if it's pointed by a valid pfn (as in pfn_valid()) the simplest is that the zoneid/nid in the page remain correct no matter if it's reserved at boot, it was marked reserved by a driver that swap the page somewhere else with the GART or EFI or something else. All reserved pages should work the same, RAM and non-RAM, since the non-RAM status can basically change at runtime if a driver assigns the page to hw somehow. NOTE: on the compaction side, we still need to add thepageblock_pfn_to_page to validate the "highest" pfn because the pfn_valid() check is missing on the first pfn on the pageblock as it's also missing the check of a pageblock that spans over two different zones. Thanks, Andrea
On Wed, Nov 25, 2020 at 04:13:25PM +0200, Mike Rapoport wrote: > I suspect that memmap for the reserved pages is not properly initialized > after recent changes in free_area_init(). They are cleared at > init_unavailable_mem() to have zone=0 and node=0, but they seem to be I'd really like if we would not leave those to 0,0 and if we set the whole struct page at 0xff, if we miss the second stage that corrects the uninitialized value. The hope is that it'll crash faster and more reproducible that way. > never re-initialized with proper zone and node links which was not the > case before commit 73a6e474cb37 ("mm: memmap_init: iterate over memblock > regions rather that check each PFN"). What's strange is that 73a6e474cb37 was suggested as fix for this bug... https://lkml.kernel.org/r/20200505124314.GA5029@MiWiFi-R3L-srv The addition of "pageblock_pfn_to_page" to validate min_pfn was added in commit 73a6e474cb37, so I assumed that the first report below didn't have commit 73a6e474cb37 already applied. https://lkml.kernel.org/r/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw However if you're correct perhaps the patch was already applied in 5.7.0-rc2-next-20200423+, it landed upstream in v5.8 after all. > Back then, memmap_init_zone() looped from zone_start_pfn till > zone_end_pfn and struct page for reserved pages with pfns inside the > zone would be initialized. > > Now the loop is for interesection of [zone_start_pfn, zone_end_pfn] with > memblock.memory and for x86 reserved ranges are not in memblock.memory, > so the memmap for them remains semi-initialized. That would matches the symptoms. I'll test it as first thing after confirming older kernels had the right zoneid/nid on the reserved pages. Thanks, Andrea
On 25.11.20 19:28, Andrea Arcangeli wrote: > On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote: >> Before that change, the memmap of memory holes were only zeroed >> out. So the zones/nid was 0, however, pages were not reserved and >> had a refcount of zero - resulting in other issues. > > So maybe that "0,0" zoneid/nid was not actually the thing that > introduced the regression? Note: I didn't bisect anything yet, it was > just a guess. I guess 0/0 is the issue, but that existed before when we had a simple memmset(0). The root issue should be what Mike said: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN") >> Most pfn walkers shouldn‘t mess with reserved pages and simply skip >> them. That would be the right fix here. > > What would then be the advantage of keeping wrong page->flags in > reserved pages compared to actually set it correct? "correct" is problematic. If you have an actual memory hole, there is not always a right answer - unless I am missing something important. Assume you have a layout like this [ zone X ] [ hole ] [ zone Y ] If either X and or Y starts within a memory section, you have a valid memmap for X - but what would be the right node/zone? Assume you have a layout like this [ zone X ] whereby X ends inside a memory section. The you hotplug memory. Assume it goes to X [ zone X ][ hole in X ][ zone X] or it goes to y [ zone X ][ hole ][ zone Y ] This can easily be reproduced by starting a VM in qemu with a memory size not aligned to 128 MB (e.g., -M 4000) and hotplugging memory. > > How do you plan to enforce that every VM bit will call PageReserved > (as it should be introduced in pageblock_pfn_to_page in that case) if > it's not possible to disambiguate a real DMA zone and nid 0 from the > uninitialized value? How can we patch page_nodenum to enforce it won't > read an uninitialized value because a PageReserved check was missing > in the caller? We can't. The general rule is (as I was once told by Michal IIRC) that there is no trusting on memmap content when the reserved bit is set - unless you know "why it was set" - e.g., to distinguish early boot allocations from ordinary allocations. See below for pointer to documentation. > > I don't see this easily enforceable by code review, it'd be pretty > flakey to leave a 0,0 wrong value in there with no way to verify if a > PageResered check in the caller was missing. I'm not rooting for "keep this at 0/0" - I'm saying that I think there are corner cases where it might not be that easy. > >>> It'd be preferable if the pfn_valid fails and the >>> pfn_to_section_nr(pfn) returns an invalid section for the intermediate >>> step. Even better memset 0xff over the whole page struct until the >>> second stage comes around. >> >> I recently discussed with Baoquan to >> 1. Using a new pagetype to mark memory holes >> 2. Using special nid/zid to mark memory holes >> >> Setting the memmap to 0xff would be even more dangerous - pfn_zone() might simply BUG_ON. > > What would need to call pfn_zone in between first and second stage? > > If something calls pfn_zone in between first and second stage isn't it > a feature if it crashes the kernel at boot? > > Note: I suggested 0xff kernel crashing "until the second stage comes > around" during meminit at boot, not permanently. Yes, then it makes sense - if we're able to come up with a way to initialize any memmap we might have - including actual memory holes that have a memmap. > > /* > * Use a fake node/zone (0) for now. Some of these pages > * (in memblock.reserved but not in memblock.memory) will > * get re-initialized via reserve_bootmem_region() later. > */ > > Specifically I relied on the comment "get re-initialized via > reserve_bootmem_region() later". Yes, but there is a "Some of these" :) Boot a VM with "-M 4000" and observe the memmap in the last section - they won't get initialized a second time. > > I assumed the second stage overwrites the 0,0 to the real zoneid/nid > value, which is clearly not happening, hence it'd be preferable to get > a crash at boot reliably. > > Now I have CONFIG_DEFERRED_STRUCT_PAGE_INIT=n so the second stage > calling init_reserved_page(start_pfn) won't do much with > CONFIG_DEFERRED_STRUCT_PAGE_INIT=n but I already tried to enable > CONFIG_DEFERRED_STRUCT_PAGE_INIT=y yesterday and it didn't help, the > page->flags were still wrong for reserved pages in the "Unknown E820 > type" region. > >>> Whenever pfn_valid is true, it's better that the zoneid/nid is correct >>> all times, otherwise if the second stage fails we end up in a bug with >>> weird side effects. >> >> Memory holes with a valid memmap might not have a zone/nid. For now, skipping reserved pages should be good enough, no? > > zoneid is always a pure abstract concept, not just the RAM-holes but > the RAM itself doesn't have a zoneid at all. > > Nid is hardware defined for RAM, but for reserved RAM holes it becomes > a pure software construct as the zoneid. > > So while it's right that they have no zone/nid in the hardware, they > still have a very well defined zoneid/nid in the software implementation. > > The page struct belongs to one and only one mem_map array, that has a > specific nodeid and nid. So it can be always initialized right even > for non-RAM. AFAIK, the mem_map array might have multiple NIDs - and it's set when initializing the zones. > > Only if the pfn is !pfn_valid, then it has no page struct and then > there's no zone/nid association, but in that case there's no reason to > even worry about it since nobody can possibly get a wrong value out of > the page struct because there's no page struct in the case. > > Last but not the least, RAM pages can be marked reserved and assigned > to hardware and so it'd be really messy if real reserved RAM pages > given to hw, behaved different than non-RAM that accidentally got a > struct page because of section alignment issues. Well, "reserved" is not a good indication "what" something actually is. I documented that a while ago in include/linux/page-flags.h "PG_reserved is set for special pages. The "struct page" of such a page should in general not be touched (e.g. set dirty) except by its owner. Pages marked as PG_reserved include:." I suggest looking at that. AFAIR, we have been setting *most* memmap in memory holes/non-ram reserved for a long time - long before I added the __init_single_page - see init_reserved_page() for example.
On 25.11.20 20:01, Andrea Arcangeli wrote: > On Wed, Nov 25, 2020 at 01:08:54PM +0100, Vlastimil Babka wrote: >> Yeah I guess it would be simpler if zoneid/nid was correct for >> pfn_valid() pfns within a zone's range, even if they are reserved due >> not not being really usable memory. >> >> I don't think we want to introduce CONFIG_HOLES_IN_ZONE to x86. If the >> chosen solution is to make this to a real hole, the hole should be >> extended to MAX_ORDER_NR_PAGES aligned boundaries. > > The way pfn_valid works it's not possible to render all non-RAM pfn as > !pfn_valid, CONFIG_HOLES_IN_ZONE would not achieve it 100% either. So Well, we could do it the arm64 way and provide a custom pfn_valid() and check memblock for RAM - please don't! :D > I don't think we can rely on that to eliminate all non-RAM reserved > pages from the mem_map and avoid having to initialize them in the > first place. Some could remain as in this case since in the same > pageblock there's non-RAM followed by RAM and all pfn are valid. > >> In any case, compaction code can't fix this with better range checks. > > David's correct that it can, by adding enough PageReserved (I'm > running all systems reproducing this with plenty of PageReserved > checks in all places to work around it until we do a proper fix). > > My problem with that is that 1) it's simply non enforceable at runtime > that there is not missing PageReserved check and 2) what benefit it > would provide to leave a wrong zoneid in reserved pages and having to > add extra PageReserved checks? See my other mail. If we have a clean way to set *any* memmap (non-RAM, memory holes at any place) to a proper nid/zid, then we won't need reserved checks. I raised some cases that need more thought than a simple "hole in zone".
On Wed, Nov 25, 2020 at 08:27:21PM +0100, David Hildenbrand wrote: > On 25.11.20 19:28, Andrea Arcangeli wrote: > > On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote: > >> Before that change, the memmap of memory holes were only zeroed > >> out. So the zones/nid was 0, however, pages were not reserved and > >> had a refcount of zero - resulting in other issues. > > > > So maybe that "0,0" zoneid/nid was not actually the thing that > > introduced the regression? Note: I didn't bisect anything yet, it was > > just a guess. > > I guess 0/0 is the issue, but that existed before when we had a simple > memmset(0). The root issue should be what Mike said: Yes, the second stage must have stopped running somehow. Is there anything we can do to induce a deterministically reproducible kernel crashing behavior if the second stage doesn't run? Why did we start doing a more graceful initialization in the first stage, instead of making a less graceful by setting it to 0xff instead of 0x00? > 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather > that check each PFN") So if that's not intentional, are you suggesting nodeid/nid was a bug if it was set to 0,0 for a non-RAM valid pfn? > "correct" is problematic. If you have an actual memory hole, there is > not always a right answer - unless I am missing something important. > > > Assume you have a layout like this > > [ zone X ] [ hole ] [ zone Y ] > > If either X and or Y starts within a memory section, you have a valid > memmap for X - but what would be the right node/zone? > > > Assume you have a layout like this > > [ zone X ] > > whereby X ends inside a memory section. The you hotplug memory. Assume > it goes to X > > [ zone X ][ hole in X ][ zone X] > > or it goes to y > > [ zone X ][ hole ][ zone Y ] > > This can easily be reproduced by starting a VM in qemu with a memory > size not aligned to 128 MB (e.g., -M 4000) and hotplugging memory. I don't get what the problem is sorry. You have a pfn, if pfn_valid() is true, pfn_to_page returns a page deterministically. It's up to the kernel to decide which page structure blongs to any pfn in the pfn_to_page function. Now if the pfn_to_page(pfn) function returns a page whose nid/zone_id in page->flags points to a node->zone whose zone_start_pfn - end_zone_pfn range doesn't contain "pfn" that is a bug in page_alloc.c. I don't see how is it not possible to deterministically enforce the above never happens. Only then it would be true that there's not always a right answer. zone can overlap, but it can't be that you do pfn_to_page of a pfn_valid and you obtain a page whose zone doesn't contain that pfn. Which is what is currently crashing compaction. I don't see how this is an unsolvable problem and why we should accept to live with a bogus page->flags for reserved pages. > We can't. The general rule is (as I was once told by Michal IIRC) that The fact we can't kernel crash reliably when somebody uses the wrong 0,0 uninitialized value by not adding an explicit PageReserved check, is my primary concern in keeping those nodeid/nid uninitialized, but non-kernel-crashing, since it already created this unreproducible bug. > I'm not rooting for "keep this at 0/0" - I'm saying that I think there > are corner cases where it might not be that easy. I'm not saying it's easy. What I don't see is how you don't always have the right answer and why it would be an unsolvable problem. It is certainly problematic and difficult to solve in the mem_map iniitalization logic, but to me having pfn_valid() && page_zone(pfn_to_page(pfn)) randomly returning the DMA zone on first node also looks problematic and difficult to handle across all VM code, so overall it looks preferable to keep the complexity of the mem_map initialization self contained and not spilling over the rest of the VM. > Yes, but there is a "Some of these" :) > > Boot a VM with "-M 4000" and observe the memmap in the last section - > they won't get initialized a second time. Is the beyond the end of the zone yet another case? I guess that's less likely to give us problems because it's beyond the end of the zone. Would pfn_valid return true for those pfn? If pfn_valid is not true it's not really a concern but the again I'd rather prefer if those struct pages beyond the end of the zone were kernel crashing set to 0xff. In other words I just don't see why we should ever prefer to leave some pages at a graceful and erroneous nid 0 nodeid 0 that wouldn't easily induce a crash if used. > AFAIK, the mem_map array might have multiple NIDs - and it's set when > initializing the zones. Well because there's no mem_map array with SPARSEMEM, but it's not conceptually too different than if there was one. Even with flatmem there could be multiple page struct for each pfn, the disambiguation has to be handled by pfn_to_page regardless of SPARSEMEM or not. The point is that if zone_page(pfn_to_page(pfn)) points to DMA zone of first node, and the pfn isn't part of the DMA of first node that looks a bug and it can be enforced it doesn't happen. > Well, "reserved" is not a good indication "what" something actually is. > > I documented that a while ago in include/linux/page-flags.h > > "PG_reserved is set for special pages. The "struct page" of such a page > should in general not be touched (e.g. set dirty) except by its owner. > Pages marked as PG_reserved include:." > > I suggest looking at that. > > AFAIR, we have been setting *most* memmap in memory holes/non-ram > reserved for a long time - long before I added the __init_single_page - > see init_reserved_page() for example. Sure, non-RAM with valid page struct always has been marked PG_reserved. I wasn't suggesting that it shouldn't be PG_reserved. I was pointing out that RAM can also be marked PG_reserved later by the kernel, long after boot, as you mentioned for all other cases of PG_reserved, the most notable are drivers doing PG_reserved after allocating RAM either vmalloc or GART swapping RAM around at other alias physical address. That is all born as RAM at boot, it gets page->flags done right, with the right zoneid, and it becomes PG_reserved later. So I was suggesting physical ranges "pfn" of non-RAM (be those holes withtin zones, or in between zones doesn't matter) with a pfn_valid returning true and a pfn_to_page pointing deterministically to one and only one struct page, should have such struct page initialized exactly the same as if it was RAM. Either that or we can define a new NO_ZONE NO_ID id and crash in page_zonenum or page_to_nid if it is ever called on such a page struct. Thanks, Andrea
On Wed, Nov 25, 2020 at 08:27:21PM +0100, David Hildenbrand wrote: > On 25.11.20 19:28, Andrea Arcangeli wrote: > > On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote: > > > > What would need to call pfn_zone in between first and second stage? > > > > If something calls pfn_zone in between first and second stage isn't it > > a feature if it crashes the kernel at boot? > > > > Note: I suggested 0xff kernel crashing "until the second stage comes > > around" during meminit at boot, not permanently. > > Yes, then it makes sense - if we're able to come up with a way to > initialize any memmap we might have - including actual memory holes that > have a memmap. > > > > > /* > > * Use a fake node/zone (0) for now. Some of these pages > > * (in memblock.reserved but not in memblock.memory) will > > * get re-initialized via reserve_bootmem_region() later. > > */ > > > > Specifically I relied on the comment "get re-initialized via > > reserve_bootmem_region() later". > > Yes, but there is a "Some of these" :) > > Boot a VM with "-M 4000" and observe the memmap in the last section - > they won't get initialized a second time. > > > > > I assumed the second stage overwrites the 0,0 to the real zoneid/nid > > value, which is clearly not happening, hence it'd be preferable to get > > a crash at boot reliably. > > > > Now I have CONFIG_DEFERRED_STRUCT_PAGE_INIT=n so the second stage > > calling init_reserved_page(start_pfn) won't do much with > > CONFIG_DEFERRED_STRUCT_PAGE_INIT=n but I already tried to enable > > CONFIG_DEFERRED_STRUCT_PAGE_INIT=y yesterday and it didn't help, the > > page->flags were still wrong for reserved pages in the "Unknown E820 > > type" region. I think the very root cause is how e820__memblock_setup() registers memory with memblock: if (entry->type == E820_TYPE_SOFT_RESERVED) memblock_reserve(entry->addr, entry->size); if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) continue; memblock_add(entry->addr, entry->size); From that point the system has inconsistent view of RAM in both memblock.memory and memblock.reserved and, which is then translated to memmap etc. Unfortunately, simply adding all RAM to memblock is not possible as there are systems that for them "the addresses listed in the reserved range must never be accessed, or (as we discovered) even be reachable by an active page table entry" [1]. [1] https://lore.kernel.org/lkml/20200528151510.GA6154@raspberrypi/
> Am 25.11.2020 um 21:41 schrieb Andrea Arcangeli <aarcange@redhat.com>: > > On Wed, Nov 25, 2020 at 08:27:21PM +0100, David Hildenbrand wrote: >>> On 25.11.20 19:28, Andrea Arcangeli wrote: >>> On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote: >>>> Before that change, the memmap of memory holes were only zeroed >>>> out. So the zones/nid was 0, however, pages were not reserved and >>>> had a refcount of zero - resulting in other issues. >>> >>> So maybe that "0,0" zoneid/nid was not actually the thing that >>> introduced the regression? Note: I didn't bisect anything yet, it was >>> just a guess. >> >> I guess 0/0 is the issue, but that existed before when we had a simple >> memmset(0). The root issue should be what Mike said: > > Yes, the second stage must have stopped running somehow. > > Is there anything we can do to induce a deterministically reproducible > kernel crashing behavior if the second stage doesn't run? > > Why did we start doing a more graceful initialization in the first > stage, instead of making a less graceful by setting it to 0xff instead > of 0x00? I guess because we weren‘t aware of the issues we have :) > >> 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather >> that check each PFN") > > So if that's not intentional, are you suggesting nodeid/nid was a bug > if it was set to 0,0 for a non-RAM valid pfn? > Depends on how we think checks for reserved pages should be performed. I am more of a friend of indicating „this memmap is just garbage, skip it“. If the reserved flag is not good enough, then via a special node/zone - as you also suggest below. >> "correct" is problematic. If you have an actual memory hole, there is >> not always a right answer - unless I am missing something important. >> >> >> Assume you have a layout like this >> >> [ zone X ] [ hole ] [ zone Y ] >> >> If either X and or Y starts within a memory section, you have a valid >> memmap for X - but what would be the right node/zone? >> >> >> Assume you have a layout like this >> >> [ zone X ] >> >> whereby X ends inside a memory section. The you hotplug memory. Assume >> it goes to X >> >> [ zone X ][ hole in X ][ zone X] >> >> or it goes to y >> >> [ zone X ][ hole ][ zone Y ] >> >> This can easily be reproduced by starting a VM in qemu with a memory >> size not aligned to 128 MB (e.g., -M 4000) and hotplugging memory. > > I don't get what the problem is sorry. > > You have a pfn, if pfn_valid() is true, pfn_to_page returns a page > deterministically. > > It's up to the kernel to decide which page structure blongs to any pfn > in the pfn_to_page function. > > Now if the pfn_to_page(pfn) function returns a page whose nid/zone_id > in page->flags points to a node->zone whose zone_start_pfn - > end_zone_pfn range doesn't contain "pfn" that is a bug in > page_alloc.c. > > I don't see how is it not possible to deterministically enforce the > above never happens. Only then it would be true that there's not > always a right answer. > > zone can overlap, but it can't be that you do pfn_to_page of a > pfn_valid and you obtain a page whose zone doesn't contain that > pfn. Which is what is currently crashing compaction. > > I don't see how this is an unsolvable problem and why we should accept > to live with a bogus page->flags for reserved pages. > I said it‘s problematic, not unsolvable. Using a special zone/node is certainly easier - but might reveal some issues we have to fix - I guess? Fair enough. >> We can't. The general rule is (as I was once told by Michal IIRC) that > > The fact we can't kernel crash reliably when somebody uses the wrong > 0,0 uninitialized value by not adding an explicit PageReserved check, > is my primary concern in keeping those nodeid/nid uninitialized, but > non-kernel-crashing, since it already created this unreproducible bug. Agreed. > >> I'm not rooting for "keep this at 0/0" - I'm saying that I think there >> are corner cases where it might not be that easy. > > I'm not saying it's easy. What I don't see is how you don't always > have the right answer and why it would be an unsolvable problem. „Problematic“ does not imply unsolvable. > > It is certainly problematic and difficult to solve in the mem_map > iniitalization logic, but to me having pfn_valid() && > page_zone(pfn_to_page(pfn)) randomly returning the DMA zone on first > node also looks problematic and difficult to handle across all VM > code, so overall it looks preferable to keep the complexity of the > mem_map initialization self contained and not spilling over the rest > of the VM. > >> Yes, but there is a "Some of these" :) >> >> Boot a VM with "-M 4000" and observe the memmap in the last section - >> they won't get initialized a second time. > > Is the beyond the end of the zone yet another case? I guess that's > less likely to give us problems because it's beyond the end of the > zone. Would pfn_valid return true for those pfn? If pfn_valid is not Yes. Especially, exposed after memory hotplug when zone/nid span changes. > true it's not really a concern but the again I'd rather prefer if > those struct pages beyond the end of the zone were kernel crashing set > to 0xff. > > In other words I just don't see why we should ever prefer to leave > some pages at a graceful and erroneous nid 0 nodeid 0 that wouldn't > easily induce a crash if used. I agree. > >> AFAIK, the mem_map array might have multiple NIDs - and it's set when >> initializing the zones. > > Well because there's no mem_map array with SPARSEMEM, but it's not > conceptually too different than if there was one. Even with flatmem > there could be multiple page struct for each pfn, the disambiguation > has to be handled by pfn_to_page regardless of SPARSEMEM or not. > > The point is that if zone_page(pfn_to_page(pfn)) points to DMA zone of > first node, and the pfn isn't part of the DMA of first node that looks > a bug and it can be enforced it doesn't happen. > >> Well, "reserved" is not a good indication "what" something actually is. >> >> I documented that a while ago in include/linux/page-flags.h >> >> "PG_reserved is set for special pages. The "struct page" of such a page >> should in general not be touched (e.g. set dirty) except by its owner. >> Pages marked as PG_reserved include:." >> >> I suggest looking at that. >> >> AFAIR, we have been setting *most* memmap in memory holes/non-ram >> reserved for a long time - long before I added the __init_single_page - >> see init_reserved_page() for example. > > Sure, non-RAM with valid page struct always has been marked > PG_reserved. I wasn't suggesting that it shouldn't be PG_reserved. > > I was pointing out that RAM can also be marked PG_reserved later by > the kernel, long after boot, as you mentioned for all other cases of > PG_reserved, the most notable are drivers doing PG_reserved after > allocating RAM either vmalloc or GART swapping RAM around at other > alias physical address. > > That is all born as RAM at boot, it gets page->flags done right, with > the right zoneid, and it becomes PG_reserved later. > > So I was suggesting physical ranges "pfn" of non-RAM (be those holes > withtin zones, or in between zones doesn't matter) with a pfn_valid > returning true and a pfn_to_page pointing deterministically to one and > only one struct page, should have such struct page initialized exactly > the same as if it was RAM. > > Either that or we can define a new NO_ZONE NO_ID id and crash in > page_zonenum or page_to_nid if it is ever called on such a page > struct. I feel like that is easier and maybe cleaner. Mark memmaps that exist but should be completely ignored. Could even check that in pfn_valid() and return „false“ - might be expensive, though. Anyhow, I do agree that properly catching these problematic pages, bailing out and fixing them (however we decide) is the right approach. > > Thanks, > Andrea
On Wed, Nov 25, 2020 at 11:04:14PM +0200, Mike Rapoport wrote: > I think the very root cause is how e820__memblock_setup() registers > memory with memblock: > > if (entry->type == E820_TYPE_SOFT_RESERVED) > memblock_reserve(entry->addr, entry->size); > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > continue; > > memblock_add(entry->addr, entry->size); > > From that point the system has inconsistent view of RAM in both > memblock.memory and memblock.reserved and, which is then translated to > memmap etc. > > Unfortunately, simply adding all RAM to memblock is not possible as > there are systems that for them "the addresses listed in the reserved > range must never be accessed, or (as we discovered) even be reachable by > an active page table entry" [1]. > > [1] https://lore.kernel.org/lkml/20200528151510.GA6154@raspberrypi/ It looks like what's missing is a blockmem_reserve which I don't think would interfere at all with the issue above since it won't create direct mapping and it'll simply invoke the second stage that wasn't invoked here. I guess this would have a better chance to have the second initialization stage run in reserve_bootmem_region and it would likely solve the problem without breaking E820_TYPE_RESERVED which is known by the kernel: > if (entry->type == E820_TYPE_SOFT_RESERVED) > memblock_reserve(entry->addr, entry->size); > + if (entry->type == 20) + memblock_reserve(entry->addr, entry->size); > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > continue; > This is however just to show the problem, I didn't check what type 20 is. To me it doesn't look the root cause though, the root cause is that if you don't call memblock_reserve the page->flags remains uninitialized. I think the page_alloc.c need to be more robust and detect at least if if holes within zones (but ideally all pfn_valid of all struct pages in system even if beyond the end of the zone) aren't being initialized in the second stage without relying on the arch code to remember to call memblock_reserve. In fact it's not clear why memblock_reserve even exists, that information can be calculated reliably by page_alloc in function of memblock.memory alone by walking all nodes and all zones. It doesn't even seem to help in destroying the direct mapping, reserve_bootmem_region just initializes the struct pages so it doesn't need a special memeblock_reserved to find those ranges. In fact it's scary that codes then does stuff like this trusting the memblock_reserve is nearly complete information (which obviously isn't given type 20 doesn't get queued and I got that type 20 in all my systems): for_each_reserved_mem_region(i, &start, &end) { if (addr >= start && addr_end <= end) return true; } That code in irq-gic-v3-its.c should stop using for_each_reserved_mem_region and start doing pfn_valid(addr>>PAGE_SHIFT) if PageReserved(pfn_to_page(addr>>PAGE_SHIFT)) instead. At best memory.reserved should be calculated automatically by the page_alloc.c based on the zone_start_pfn/zone_end_pfn and not passed by the e820 caller, instead of adding the memory_reserve call for type 20 we should delete the memory_reserve function. Thanks, Andrea
On Wed, Nov 25, 2020 at 12:34:41AM -0500, Andrea Arcangeli wrote: > pfn physaddr page->flags > 500224 0x7a200000 0x1fff000000001000 reserved True > 500225 0x7a201000 0x1fff000000001000 reserved True > *snip* > 500245 0x7a215000 0x1fff000000001000 reserved True > 500246 0x7a216000 0x1fff000000001000 reserved True > 500247 0x7a217000 0x3fff000000000000 reserved False > 500248 0x7a218000 0x3fff000000000000 reserved False The range is still this: 7a17b000-7a216fff : Unknown E820 type 7a217000-7ffdbfff : System RAM This is with v5.3: Interestingly the pages below 0x7a217000 weren't even marked reserved. DMA zone_start_pfn 1 zone_end_pfn() 4096 contiguous 1 DMA32 zone_start_pfn 4096 zone_end_pfn() 524252 contiguous 1 Normal zone_start_pfn 0 zone_end_pfn() 0 contiguous 0 Movable zone_start_pfn 0 zone_end_pfn() 0 contiguous 0 500246 0x7a216000 0x1fff000000000000 reserved False 500247 0x7a217000 0x1fff000000022036 reserved False 500248 0x7a218000 0x1fff000000002032 reserved False 4b094b7851bf4bf551ad456195d3f26e1c03bd74 no change (because the unknown e820 type 20 was still initialized by memmap_init_zone despite it was not part of memblock.memory): DMA zone_start_pfn 1 zone_end_pfn() 4096 contiguous 1 DMA32 zone_start_pfn 4096 zone_end_pfn() 524252 contiguous 1 Normal zone_start_pfn 0 zone_end_pfn() 0 contiguous 0 Movable zone_start_pfn 0 zone_end_pfn() 0 contiguous 0 500246 0x7a216000 0x1fff000000000000 reserved False 500247 0x7a217000 0x1fff000000022036 reserved False 500248 0x7a218000 0x1fff000000002032 reserved False 73a6e474cb376921a311786652782155eac2fdf0 this changed it: DMA zone_start_pfn 1 zone_end_pfn() 4096 contiguous 1 DMA32 zone_start_pfn 4096 zone_end_pfn() 524252 contiguous 0 Normal zone_start_pfn 0 zone_end_pfn() 0 contiguous 0 Movable zone_start_pfn 0 zone_end_pfn() 0 contiguous 0 500246 0x7a216000 0xfff000000001000 reserved True 500247 0x7a217000 0x1fff000000000000 reserved False 500248 0x7a218000 0x1fff000000010200 reserved False da50c57bdbb8e37ec6f8c934a2f8acbf4e4fdfb9 (73a6e474cb376921a311786652782155eac2fdf0^) no change: DMA zone_start_pfn 1 zone_end_pfn() 4096 contiguous 1 DMA32 zone_start_pfn 4096 zone_end_pfn() 524252 contiguous 1 Normal zone_start_pfn 0 zone_end_pfn() 0 contiguous 0 Movable zone_start_pfn 0 zone_end_pfn() 0 contiguous 0 500246 0x7a216000 0x1fff000000000000 reserved False 500247 0x7a217000 0x1fff000000002032 reserved False 500248 0x7a218000 0x1fff000000002032 reserved False So like Mike suggested this started in 73a6e474cb376921a311786652782155eac2fdf0, although the previous code looked buggy too by not setting PageReserved when it should have. It appears the pfn in the e820 type 20 were initialized by memmap_init_zone before commit 73a6e474cb376921a311786652782155eac2fdf0 and they're not initialized anymore after 73a6e474cb376921a311786652782155eac2fdf0 because the memblock.memory didn't include the e820 type 20 so it couldn't call memmap_init_zone anymore. So after 4b094b7851bf4bf551ad456195d3f26e1c03bd74 uninitialized pfn gets the 0,0 assignment (but even before 4b094b7851bf4bf551ad456195d3f26e1c03bd74 they would have gotten a completely zero page->flags, 4b094b7851bf4bf551ad456195d3f26e1c03bd74 is completely unrelated to this issue). Mike, your patch alone doesn't help because nothing calls memblock_reserve on the range starting at 0x7a17b000. I appended at the end the output of memblock=debug. Only after I apply the below hack on top of your patch finally all sign of bugs are gone, PG_reserved is finally set (old code was wrong not setting it) and the zoneid/nid is right (new code leaves it uninitialized and so it's wrong): diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index 983cd53ed4c9..bbb5b4d7a117 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -1303,6 +1303,8 @@ void __init e820__memblock_setup(void) if (entry->type == E820_TYPE_SOFT_RESERVED) memblock_reserve(entry->addr, entry->size); + if (entry->type == 20) + memblock_reserve(entry->addr, entry->size); if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) continue; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3fb35fe6a9e4..2e2047282ff3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6145,7 +6145,9 @@ void __meminit __weak memmap_init(unsigned long size, int nid, { unsigned long start_pfn, end_pfn; unsigned long range_end_pfn = range_start_pfn + size; + phys_addr_t start, end; int i; + u64 j; for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn); @@ -6157,6 +6159,19 @@ void __meminit __weak memmap_init(unsigned long size, int nid, MEMINIT_EARLY, NULL); } } + + for_each_mem_range(j, &memblock.reserved, NULL, nid, MEMBLOCK_NONE, + &start, &end, NULL) { + start_pfn = clamp(PHYS_PFN(start), range_start_pfn, + range_end_pfn); + end_pfn = clamp(PHYS_PFN(end), range_start_pfn, range_end_pfn); + + if (end_pfn > start_pfn) { + size = end_pfn - start_pfn; + memmap_init_zone(size, nid, zone, start_pfn, + MEMINIT_EARLY, NULL); + } + } } static int zone_batchsize(struct zone *zone) Result with above patch all good: 500246 0x7a216000 0x1fff000000001000 reserved True pfn_valid True 500247 0x7a217000 0x1fff000000002032 reserved False pfn_valid True 500248 0x7a218000 0x1fff000000020036 reserved False pfn_valid True Summary: both old code (missing PG_reserved) and the current code (leaving the page struct uninitialized and with wrong nodeid/nid) look wrong. Overall this brings more questions: - why memblock.reserved exists and isn't automatically calculated as inversion of memblock.memory? (or if you prefer: why is the above call to memblock_reserve needed to initialize the memory holes?) - why there's no initialization of the memblock.reserved regions given they exists, was it just an oversight? (this one is fixed by Mike's patch, although I wish it was possible to drop the function memblock_reserve instead) - what can we do instead of setting the uninitialized nodeid/nid to 0,0 and to reliably detect at boot if some page structure within zones (but ideally also outside the zone boundary for any pfn where pfn_valid returns true) is left uninitialized, as it is happening currently on the e820 type 20 range? Thanks, Andrea [ 0.006587] memblock_reserve: [0x00000000000f6bf0-0x00000000000f6bff] smp_scan_config+0xaa/0xee [ 0.006590] memblock_reserve: [0x00000000000f6c00-0x00000000000f6d7b] smp_scan_config+0xc1/0xee [ 0.006595] memblock_reserve: [0x0000000003400000-0x0000000003406fff] setup_arch+0x51b/0xb4e [ 0.006599] memblock_add: [0x0000000000001000-0x000000000009fbff] e820__memblock_setup+0x63/0x89 [ 0.006614] memblock_add: [0x0000000000100000-0x000000007a17afff] e820__memblock_setup+0x63/0x89 [ 0.006615] memblock_add: [0x000000007a217000-0x000000007ffdbfff] e820__memblock_setup+0x63/0x89 [ 0.006627] memblock_reserve: [0x000000000009f000-0x00000000000fffff] setup_arch+0x53f/0xb4e [ 0.006634] memblock_reserve: [0x0000000000099000-0x000000000009efff] reserve_real_mode+0x6f/0x8d [ 0.006665] memblock_reserve: [0x0000000000000000-0x000000000000ffff] setup_arch+0x56b/0xb4e [ 0.006907] memblock_reserve: [0x000000007fdff000-0x000000007fdfffff] alloc_low_pages+0x16b/0x180 [ 0.007219] memblock_reserve: [0x000000007ffd9000-0x000000007ffdbfff] memblock_alloc_range_nid+0xb1/0x11f [ 0.007245] memblock_alloc_try_nid: 16384 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 sparse_init+0x8c/0x2c9 [ 0.007246] memblock_reserve: [0x000000007ffd5000-0x000000007ffd8fff] memblock_alloc_range_nid+0xb1/0x11f [ 0.007250] memblock_alloc_try_nid: 4096 bytes align=0x40 nid=0 from=0x0000000000000000 max_addr=0x0000000000000000 sparse_index_alloc+0x44/0x67 [ 0.007252] memblock_reserve: [0x000000007ffd4000-0x000000007ffd4fff] memblock_alloc_range_nid+0xb1/0x11f [ 0.007260] memblock_alloc_try_nid: 640 bytes align=0x40 nid=0 from=0x0000000000000000 max_addr=0x0000000000000000 sparse_init_nid+0x3d/0x224 [ 0.007262] memblock_reserve: [0x000000007ffd3d80-0x000000007ffd3fff] memblock_alloc_range_nid+0xb1/0x11f [ 0.007265] memblock_alloc_exact_nid_raw: 33554432 bytes align=0x200000 nid=0 from=0x0000000001000000 max_addr=0x0000000000000000 sparse_init_nid+0x8a/0x224 [ 0.007266] memblock_reserve: [0x000000007dc00000-0x000000007fbfffff] memblock_alloc_range_nid+0xb1/0x11f [ 0.012443] memblock_alloc_try_nid_raw: 4096 bytes align=0x1000 nid=0 from=0x0000000001000000 max_addr=0x0000000000000000 vmemmap_alloc_block_zero.constprop.0+0xc/0x24 [ 0.012450] memblock_reserve: [0x000000007ffd2000-0x000000007ffd2fff] memblock_alloc_range_nid+0xb1/0x11f [ 0.012453] memblock_alloc_try_nid_raw: 4096 bytes align=0x1000 nid=0 from=0x0000000001000000 max_addr=0x0000000000000000 vmemmap_alloc_block_zero.constprop.0+0xc/0x24 [ 0.012454] memblock_reserve: [0x000000007ffd1000-0x000000007ffd1fff] memblock_alloc_range_nid+0xb1/0x11f [ 0.016975] memblock_alloc_try_nid: 73 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 acpi_parse_hpet+0xca/0x132 [ 0.016983] memblock_reserve: [0x000000007ffd3d00-0x000000007ffd3d48] memblock_alloc_range_nid+0xb1/0x11f [ 0.016992] memblock_alloc_try_nid: 75 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 io_apic_init_mappings+0x38/0x1c0 [ 0.016993] memblock_reserve: [0x000000007ffd3c80-0x000000007ffd3cca] memblock_alloc_range_nid+0xb1/0x11f [ 0.017012] memblock_alloc_try_nid: 768 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 e820__reserve_resources+0x2c/0x1be [ 0.017013] memblock_reserve: [0x000000007ffd3980-0x000000007ffd3c7f] memblock_alloc_range_nid+0xb1/0x11f [ 0.017017] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52 [ 0.017019] memblock_reserve: [0x000000007ffd3900-0x000000007ffd3967] memblock_alloc_range_nid+0xb1/0x11f [ 0.017023] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52 [ 0.017025] memblock_reserve: [0x000000007ffd3880-0x000000007ffd38e7] memblock_alloc_range_nid+0xb1/0x11f [ 0.017026] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52 [ 0.017028] memblock_reserve: [0x000000007ffd3800-0x000000007ffd3867] memblock_alloc_range_nid+0xb1/0x11f [ 0.017029] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52 [ 0.017031] memblock_reserve: [0x000000007ffd3780-0x000000007ffd37e7] memblock_alloc_range_nid+0xb1/0x11f [ 0.017032] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52 [ 0.017033] memblock_reserve: [0x000000007ffd3700-0x000000007ffd3767] memblock_alloc_range_nid+0xb1/0x11f [ 0.017035] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52 [ 0.017036] memblock_reserve: [0x000000007ffd3680-0x000000007ffd36e7] memblock_alloc_range_nid+0xb1/0x11f [ 0.017038] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52 [ 0.017039] memblock_reserve: [0x000000007ffd3600-0x000000007ffd3667] memblock_alloc_range_nid+0xb1/0x11f [ 0.017041] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52 [ 0.017042] memblock_reserve: [0x000000007ffd3580-0x000000007ffd35e7] memblock_alloc_range_nid+0xb1/0x11f [ 0.017044] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52 [ 0.017045] memblock_reserve: [0x000000007ffd3500-0x000000007ffd3567] memblock_alloc_range_nid+0xb1/0x11f [ 0.017047] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52 [ 0.017048] memblock_reserve: [0x000000007ffd3480-0x000000007ffd34e7] memblock_alloc_range_nid+0xb1/0x11f [ 0.017050] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52 [ 0.017051] memblock_reserve: [0x000000007ffd3400-0x000000007ffd3467] memblock_alloc_range_nid+0xb1/0x11f [ 0.017069] memblock_alloc_try_nid: 32 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 __register_nosave_region+0x7a/0xf7 [ 0.017071] memblock_reserve: [0x000000007ffd33c0-0x000000007ffd33df] memblock_alloc_range_nid+0xb1/0x11f [ 0.017073] memblock_alloc_try_nid: 32 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 __register_nosave_region+0x7a/0xf7 [ 0.017075] memblock_reserve: [0x000000007ffd3380-0x000000007ffd339f] memblock_alloc_range_nid+0xb1/0x11f [ 0.017099] memblock_alloc_try_nid: 32 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 __register_nosave_region+0x7a/0xf7 [ 0.017101] memblock_reserve: [0x000000007ffd3340-0x000000007ffd335f] memblock_alloc_range_nid+0xb1/0x11f [ 0.020380] memblock_alloc_try_nid: 70 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 start_kernel+0x110/0x57c [ 0.020382] memblock_reserve: [0x000000007ffd32c0-0x000000007ffd3305] memblock_alloc_range_nid+0xb1/0x11f [ 0.020383] memblock_alloc_try_nid: 70 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 start_kernel+0x124/0x57c [ 0.020385] memblock_reserve: [0x000000007ffd3240-0x000000007ffd3285] memblock_alloc_range_nid+0xb1/0x11f [ 0.020391] memblock_alloc_try_nid: 4096 bytes align=0x1000 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_alloc_alloc_info+0x5e/0x8f [ 0.020392] memblock_reserve: [0x000000007ffd0000-0x000000007ffd0fff] memblock_alloc_range_nid+0xb1/0x11f [ 0.020394] memblock_alloc_try_nid: 4096 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_embed_first_chunk+0x6a/0x299 [ 0.020396] memblock_reserve: [0x000000007ffcf000-0x000000007ffcffff] memblock_alloc_range_nid+0xb1/0x11f [ 0.020397] memblock_alloc_try_nid: 2097152 bytes align=0x200000 nid=0 from=0x0000000001000000 max_addr=0x0000000000000000 pcpu_embed_first_chunk+0xd0/0x299 [ 0.020399] memblock_reserve: [0x000000007da00000-0x000000007dbfffff] memblock_alloc_range_nid+0xb1/0x11f [ 0.020584] memblock_free: [0x000000007da37000-0x000000007da3ffff] pcpu_embed_first_chunk+0x1b8/0x299 [ 0.020590] memblock_free: [0x000000007da77000-0x000000007da7ffff] pcpu_embed_first_chunk+0x1b8/0x299 [ 0.020595] memblock_free: [0x000000007dab7000-0x000000007dabffff] pcpu_embed_first_chunk+0x1b8/0x299 [ 0.020600] memblock_free: [0x000000007daf7000-0x000000007dafffff] pcpu_embed_first_chunk+0x1b8/0x299 [ 0.020606] memblock_free: [0x000000007db37000-0x000000007db3ffff] pcpu_embed_first_chunk+0x1b8/0x299 [ 0.020611] memblock_free: [0x000000007db77000-0x000000007db7ffff] pcpu_embed_first_chunk+0x1b8/0x299 [ 0.020616] memblock_free: [0x000000007dbb7000-0x000000007dbbffff] pcpu_embed_first_chunk+0x1b8/0x299 [ 0.020621] memblock_free: [0x000000007dbf7000-0x000000007dbfffff] pcpu_embed_first_chunk+0x1b8/0x299 [ 0.020624] memblock_alloc_try_nid: 8 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_setup_first_chunk+0x2f2/0x7ae [ 0.020625] memblock_reserve: [0x000000007ffd3200-0x000000007ffd3207] memblock_alloc_range_nid+0xb1/0x11f [ 0.020626] memblock_alloc_try_nid: 8 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_setup_first_chunk+0x30f/0x7ae [ 0.020628] memblock_reserve: [0x000000007ffd31c0-0x000000007ffd31c7] memblock_alloc_range_nid+0xb1/0x11f [ 0.020629] memblock_alloc_try_nid: 32 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_setup_first_chunk+0x345/0x7ae [ 0.020630] memblock_reserve: [0x000000007ffd3180-0x000000007ffd319f] memblock_alloc_range_nid+0xb1/0x11f [ 0.020632] memblock_alloc_try_nid: 64 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_setup_first_chunk+0x37f/0x7ae [ 0.020633] memblock_reserve: [0x000000007ffd3140-0x000000007ffd317f] memblock_alloc_range_nid+0xb1/0x11f [ 0.020639] memblock_alloc_try_nid: 576 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_setup_first_chunk+0x638/0x7ae [ 0.020640] memblock_reserve: [0x000000007ffcedc0-0x000000007ffcefff] memblock_alloc_range_nid+0xb1/0x11f [ 0.020645] memblock_alloc_try_nid: 144 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_alloc_first_chunk+0x7a/0x259 [ 0.020646] memblock_reserve: [0x000000007ffd3080-0x000000007ffd310f] memblock_alloc_range_nid+0xb1/0x11f [ 0.020648] memblock_alloc_try_nid: 384 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_alloc_first_chunk+0xd0/0x259 [ 0.020649] memblock_reserve: [0x000000007ffcec40-0x000000007ffcedbf] memblock_alloc_range_nid+0xb1/0x11f [ 0.020650] memblock_alloc_try_nid: 392 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_alloc_first_chunk+0x10f/0x259 [ 0.020652] memblock_reserve: [0x000000007ffcea80-0x000000007ffcec07] memblock_alloc_range_nid+0xb1/0x11f [ 0.020653] memblock_alloc_try_nid: 96 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_alloc_first_chunk+0x12f/0x259 [ 0.020654] memblock_reserve: [0x000000007ffd3000-0x000000007ffd305f] memblock_alloc_range_nid+0xb1/0x11f [ 0.020657] memblock_alloc_try_nid: 144 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_alloc_first_chunk+0x7a/0x259 [ 0.020658] memblock_reserve: [0x000000007ffce9c0-0x000000007ffcea4f] memblock_alloc_range_nid+0xb1/0x11f [ 0.020660] memblock_alloc_try_nid: 1024 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_alloc_first_chunk+0xd0/0x259 [ 0.020661] memblock_reserve: [0x000000007ffce5c0-0x000000007ffce9bf] memblock_alloc_range_nid+0xb1/0x11f [ 0.020662] memblock_alloc_try_nid: 1032 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_alloc_first_chunk+0x10f/0x259 [ 0.020664] memblock_reserve: [0x000000007ffce180-0x000000007ffce587] memblock_alloc_range_nid+0xb1/0x11f [ 0.020665] memblock_alloc_try_nid: 256 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_alloc_first_chunk+0x12f/0x259 [ 0.020667] memblock_reserve: [0x000000007ffce080-0x000000007ffce17f] memblock_alloc_range_nid+0xb1/0x11f [ 0.020668] memblock_free: [0x000000007ffd0000-0x000000007ffd0fff] pcpu_embed_first_chunk+0x25a/0x299 [ 0.020670] memblock_free: [0x000000007ffcf000-0x000000007ffcffff] pcpu_embed_first_chunk+0x287/0x299 [ 0.020681] memblock_alloc_try_nid: 4096 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 alloc_large_system_hash+0x16b/0x25a [ 0.020682] memblock_reserve: [0x000000007ffd0000-0x000000007ffd0fff] memblock_alloc_range_nid+0xb1/0x11f [ 0.020741] memblock_alloc_try_nid: 2097152 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 alloc_large_system_hash+0x16b/0x25a [ 0.020742] memblock_reserve: [0x000000007d800000-0x000000007d9fffff] memblock_alloc_range_nid+0xb1/0x11f [ 0.020854] memblock_alloc_try_nid: 1048576 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 alloc_large_system_hash+0x16b/0x25a [ 0.020856] memblock_reserve: [0x000000007fece080-0x000000007ffce07f] memblock_alloc_range_nid+0xb1/0x11f [ 0.020891] memblock_alloc_try_nid: 4096 bytes align=0x1000 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 spp_getpage+0x3d/0x70 [ 0.020893] memblock_reserve: [0x000000007ffcf000-0x000000007ffcffff] memblock_alloc_range_nid+0xb1/0x11f [ 0.020895] memblock_alloc_try_nid: 4096 bytes align=0x1000 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 spp_getpage+0x3d/0x70 [ 0.020896] memblock_reserve: [0x000000007fecd000-0x000000007fecdfff] memblock_alloc_range_nid+0xb1/0x11f [ 0.020898] memblock_alloc_try_nid: 4096 bytes align=0x1000 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 spp_getpage+0x3d/0x70 [ 0.020900] memblock_reserve: [0x000000007fecc000-0x000000007feccfff] memblock_alloc_range_nid+0xb1/0x11f
On Wed, Nov 25, 2020 at 04:38:16PM -0500, Andrea Arcangeli wrote: > On Wed, Nov 25, 2020 at 11:04:14PM +0200, Mike Rapoport wrote: > > I think the very root cause is how e820__memblock_setup() registers > > memory with memblock: > > > > if (entry->type == E820_TYPE_SOFT_RESERVED) > > memblock_reserve(entry->addr, entry->size); > > > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > > continue; > > > > memblock_add(entry->addr, entry->size); > > > > From that point the system has inconsistent view of RAM in both > > memblock.memory and memblock.reserved and, which is then translated to > > memmap etc. > > > > Unfortunately, simply adding all RAM to memblock is not possible as > > there are systems that for them "the addresses listed in the reserved > > range must never be accessed, or (as we discovered) even be reachable by > > an active page table entry" [1]. > > > > [1] https://lore.kernel.org/lkml/20200528151510.GA6154@raspberrypi/ > > It looks like what's missing is a blockmem_reserve which I don't think > would interfere at all with the issue above since it won't create > direct mapping and it'll simply invoke the second stage that wasn't > invoked here. > > I guess this would have a better chance to have the second > initialization stage run in reserve_bootmem_region and it would likely > solve the problem without breaking E820_TYPE_RESERVED which is known > by the kernel: > > > if (entry->type == E820_TYPE_SOFT_RESERVED) > > memblock_reserve(entry->addr, entry->size); > > > > + if (entry->type == 20) > + memblock_reserve(entry->addr, entry->size); > > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > > continue; > > > > This is however just to show the problem, I didn't check what type 20 > is. I think it's inveneted by your BIOS vendor :) > To me it doesn't look the root cause though, the root cause is that if > you don't call memblock_reserve the page->flags remains uninitialized. I didn't mean that root cause is that we don't call memblock_reserve(). I meant that the root cause is inconsitency in memory representation. On most architectures, memblock.memory represents the entire RAM in the system and memblock.reserved represents memory regions that were reserved either by the firmware or by the kernel during early boot. On x86 the memory that firmware reserved for its use is never considered memory and some of the reserved memory types are never registered with memblock at all. As memblock data is used to initialize the memory map, we end up with some page structs not being properly initialized. > I think the page_alloc.c need to be more robust and detect at least if > if holes within zones (but ideally all pfn_valid of all struct pages > in system even if beyond the end of the zone) aren't being initialized > in the second stage without relying on the arch code to remember to > call memblock_reserve. I agree that page_alloc.c needs to be more robust, but it anyway needs to rely on some data supplied by arch to know where valid memory is. With SPARSMEM, pfn_valid() only says where memmap exists, it's not necessary there is an actual page frame behind a valid pfn. > In fact it's not clear why memblock_reserve even exists, that > information can be calculated reliably by page_alloc in function of > memblock.memory alone by walking all nodes and all zones. It doesn't > even seem to help in destroying the direct mapping, > reserve_bootmem_region just initializes the struct pages so it doesn't > need a special memeblock_reserved to find those ranges. memblock_reserve() is there to allow architectures to mark memory regions as busy so this memory won't be used by buddy as free pages. It could be memory that firmware reported as reserved, memory occupied by the kernel image and initrd, or the early memory allocations kernel does before page allocator is up. > In fact it's scary that codes then does stuff like this trusting the > memblock_reserve is nearly complete information (which obviously isn't > given type 20 doesn't get queued and I got that type 20 in all my systems): > > for_each_reserved_mem_region(i, &start, &end) { > if (addr >= start && addr_end <= end) > return true; > } > > That code in irq-gic-v3-its.c should stop using > for_each_reserved_mem_region and start doing > pfn_valid(addr>>PAGE_SHIFT) if > PageReserved(pfn_to_page(addr>>PAGE_SHIFT)) instead. I think that for coldpluged CPUs this code runs before memmap us set up, so pfn_valid() or PageReserved() are not yet available then. > At best memory.reserved should be calculated automatically by the > page_alloc.c based on the zone_start_pfn/zone_end_pfn and not passed > by the e820 caller, instead of adding the memory_reserve call for type > 20 we should delete the memory_reserve function. memory.reserved cannot be calculated automatically. It represents all the memory allocations made before page allocator is up. And as memblock_reserve() is the most basic to allocate memory early at boot we cannot really delete it ;-) As for e820 and type 20, unless it is in memblock, page_alloc.c has no way to properly initialize memmap for it. It can continue to guess, like it does with init_unavailable_memory(). > Thanks, > Andrea >
On 26.11.20 10:36, Mike Rapoport wrote: > On Wed, Nov 25, 2020 at 04:38:16PM -0500, Andrea Arcangeli wrote: >> On Wed, Nov 25, 2020 at 11:04:14PM +0200, Mike Rapoport wrote: >>> I think the very root cause is how e820__memblock_setup() registers >>> memory with memblock: >>> >>> if (entry->type == E820_TYPE_SOFT_RESERVED) >>> memblock_reserve(entry->addr, entry->size); >>> >>> if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) >>> continue; >>> >>> memblock_add(entry->addr, entry->size); >>> >>> From that point the system has inconsistent view of RAM in both >>> memblock.memory and memblock.reserved and, which is then translated to >>> memmap etc. >>> >>> Unfortunately, simply adding all RAM to memblock is not possible as >>> there are systems that for them "the addresses listed in the reserved >>> range must never be accessed, or (as we discovered) even be reachable by >>> an active page table entry" [1]. >>> >>> [1] https://lore.kernel.org/lkml/20200528151510.GA6154@raspberrypi/ >> >> It looks like what's missing is a blockmem_reserve which I don't think >> would interfere at all with the issue above since it won't create >> direct mapping and it'll simply invoke the second stage that wasn't >> invoked here. >> >> I guess this would have a better chance to have the second >> initialization stage run in reserve_bootmem_region and it would likely >> solve the problem without breaking E820_TYPE_RESERVED which is known >> by the kernel: >> >>> if (entry->type == E820_TYPE_SOFT_RESERVED) >>> memblock_reserve(entry->addr, entry->size); >>> >> >> + if (entry->type == 20) >> + memblock_reserve(entry->addr, entry->size); >> >>> if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) >>> continue; >>> >> >> This is however just to show the problem, I didn't check what type 20 >> is. > > I think it's inveneted by your BIOS vendor :) > >> To me it doesn't look the root cause though, the root cause is that if >> you don't call memblock_reserve the page->flags remains uninitialized. > > I didn't mean that root cause is that we don't call memblock_reserve(). > I meant that the root cause is inconsitency in memory representation. > > On most architectures, memblock.memory represents the entire RAM in the > system and memblock.reserved represents memory regions that were > reserved either by the firmware or by the kernel during early boot. > > On x86 the memory that firmware reserved for its use is never considered > memory and some of the reserved memory types are never registered with > memblock at all. > > As memblock data is used to initialize the memory map, we end up with > some page structs not being properly initialized. > >> I think the page_alloc.c need to be more robust and detect at least if >> if holes within zones (but ideally all pfn_valid of all struct pages >> in system even if beyond the end of the zone) aren't being initialized >> in the second stage without relying on the arch code to remember to >> call memblock_reserve. > > I agree that page_alloc.c needs to be more robust, but it anyway needs > to rely on some data supplied by arch to know where valid memory is. > With SPARSMEM, pfn_valid() only says where memmap exists, it's not > necessary there is an actual page frame behind a valid pfn. > >> In fact it's not clear why memblock_reserve even exists, that >> information can be calculated reliably by page_alloc in function of >> memblock.memory alone by walking all nodes and all zones. It doesn't >> even seem to help in destroying the direct mapping, >> reserve_bootmem_region just initializes the struct pages so it doesn't >> need a special memeblock_reserved to find those ranges. > > memblock_reserve() is there to allow architectures to mark memory > regions as busy so this memory won't be used by buddy as free pages. It > could be memory that firmware reported as reserved, memory occupied by > the kernel image and initrd, or the early memory allocations kernel does > before page allocator is up. > >> In fact it's scary that codes then does stuff like this trusting the >> memblock_reserve is nearly complete information (which obviously isn't >> given type 20 doesn't get queued and I got that type 20 in all my systems): >> >> for_each_reserved_mem_region(i, &start, &end) { >> if (addr >= start && addr_end <= end) >> return true; >> } >> >> That code in irq-gic-v3-its.c should stop using >> for_each_reserved_mem_region and start doing >> pfn_valid(addr>>PAGE_SHIFT) if >> PageReserved(pfn_to_page(addr>>PAGE_SHIFT)) instead. > > I think that for coldpluged CPUs this code runs before memmap us set up, > so pfn_valid() or PageReserved() are not yet available then. > >> At best memory.reserved should be calculated automatically by the >> page_alloc.c based on the zone_start_pfn/zone_end_pfn and not passed >> by the e820 caller, instead of adding the memory_reserve call for type >> 20 we should delete the memory_reserve function. > > memory.reserved cannot be calculated automatically. It represents all > the memory allocations made before page allocator is up. And as > memblock_reserve() is the most basic to allocate memory early at boot we > cannot really delete it ;-) > > As for e820 and type 20, unless it is in memblock, page_alloc.c has no > way to properly initialize memmap for it. It can continue to guess, like > it does with init_unavailable_memory(). AFAIS, init_unavailable_mem() has to logic to setup *any* such struct page properly up - except the node/zid, because that's unknown at that time. It is documented that (big comment above init_unavailable_mem()) which pages won't get initialized again. They *are* initialized, just the node/zone is "fake" - that's the real problem we seem to have. I agree that this is sub-optimal, as such pages are impossible to detect (PageReserved is just not clear as discussed with Andrea). The basic question is how we want to proceed: a) Make sure any online struct page has a valid nid/zid, and is spanned by the nid/zid. b) Use a fake nid that will bail out when used for page_zone() and page_pgdat(), and make pfn walkers detect that. AFAIU, Andrea seems to prefer a). I thing b) might be easier in corner cases. Thoughts? --- The tricky thing for b) is to fix all call places that do a page_zone()/page_pgdat() without checking if they are allowed to do so. We would have to audit all callers of page_zone() / page_zone_id() / page_pgdat() ... E.g., using a pattern like mm/memory_hotplug.c:find_biggest_section_pfn() is fine if (unlikely(!pfn_to_online_page(pfn))) continue; if (unlikely(pfn_to_nid(pfn) != nid)) continue; e.g., mm/page_alloc.c:__pageblock_pfn_to_page() would have to be adjusted start_page = pfn_to_online_page(start_pfn); if (!start_page) return NULL; // necessary to avoid ... if (page_to_nid(page) == NODE_INVALID) return NULL; // page_zone() from bailing out if (page_zone(start_page) != zone) return NULL;
On Wed, Nov 25, 2020 at 10:40:20PM -0500, Andrea Arcangeli wrote: > On Wed, Nov 25, 2020 at 12:34:41AM -0500, Andrea Arcangeli wrote: > > Summary: both old code (missing PG_reserved) and the current code > (leaving the page struct uninitialized and with wrong nodeid/nid) look > wrong. > > Overall this brings more questions: > > - why memblock.reserved exists and isn't automatically calculated as > inversion of memblock.memory? (or if you prefer: why is the above > call to memblock_reserve needed to initialize the memory holes?) memblock.reserved represents memory areas that were allocated before page allocator is up. This could be memory reported by firmware as resrved, kernel image, initrd, and any memory kernel allocates before page allocator is ready. > - why there's no initialization of the memblock.reserved regions given > they exists, was it just an oversight? (this one is fixed by Mike's > patch, although I wish it was possible to drop the function > memblock_reserve instead) It was an oversight when I posted 73a6e474cb376921a311786652782155eac2fdf0. I overlooked the fact that x86 does not consider memory allocated by firmware as memory and only parts of it are considred reserved. > - what can we do instead of setting the uninitialized nodeid/nid to > 0,0 and to reliably detect at boot if some page structure within > zones (but ideally also outside the zone boundary for any pfn where > pfn_valid returns true) is left uninitialized, as it is happening > currently on the e820 type 20 range? I think that we need to fix the e820 and memblock interaction at the first place. The "non-RAM" holes reported as various types other than E820_TYPE_RAM are actually seem to be "RAM that firmware grabbed for itself", so they should be seen by the system as such, like in all other architectures. For the regions that cannot be mapped, like those on HPE UV (was it SGI?) systems, we have MEMBLOCK_NOMAP exactly for that purpose. Once this is done, we can straihgten logic around memmap intialization in page_alloc, and I feel it will get simpler that today in the end. > Thanks, > Andrea
On Wed, Nov 25, 2020 at 12:59:58PM -0500, Andrea Arcangeli wrote: > On Wed, Nov 25, 2020 at 10:30:53AM +0000, Mel Gorman wrote: > > On Tue, Nov 24, 2020 at 03:56:22PM -0500, Andrea Arcangeli wrote: > > > Hello, > > > > > > On Tue, Nov 24, 2020 at 01:32:05PM +0000, Mel Gorman wrote: > > > > I would hope that is not the case because they are not meant to overlap. > > > > However, if the beginning of the pageblock was not the start of a zone > > > > then the pages would be valid but the pfn would still be outside the > > > > zone boundary. If it was reserved, the struct page is valid but not > > > > suitable for set_pfnblock_flags_mask. However, it is a concern in > > > > general because the potential is there that pages are isolated from the > > > > wrong zone. > > > > > > I guess we have more than one issue to correct in that function > > > because the same BUG_ON reproduced again even with the tentative patch > > > I posted earlier. > > > > > > So my guess is that the problematic reserved page isn't pointed by the > > > min_pfn, but it must have been pointed by the "highest" variable > > > calculated below? > > > > > > if (pfn >= highest) > > > highest = pageblock_start_pfn(pfn); > > > > > > When I looked at where "highest" comes from, it lacks > > > pageblock_pfn_to_page check (which was added around v5.7 to min_pfn). > > > > > > Is that the real bug, which may be fixed by something like this? (untested) > > > > > > > It's plausible as it is a potential source of leaking but as you note > > in another mail, it's surprising to me that valid struct pages, even if > > within memory holes and reserved would have broken node/zone information > > in the page flags. > > I think the patch to add pageblock_pfn_to_page is still needed to cope > with !pfn_valid or a pageblock in between zones, but pfn_valid or > pageblock in between zones is not what happens here. > > So the patch adding pageblock_pfn_to_page would have had the undesired > side effect of hiding the bug so it's best to deal with the other bug > first. > Agreed. This thread has a lot of different directions in it at this point so what I'd hope for is first, a patch that initialises holes with zone/node linkages within a 1<<(MAX_ORDER-1) alignment. If there is a hole, it would be expected the pages are PageReserved. Second, a fix to fast_isolate that forces PFNs returned to always be within the stated zone boundaries. The first is because there are assumptions that without HOLES_IN_ZONE, a true pfn_valid within 1<<(MAX_ORDER-1) means pfn_valid would be true for any PFN within that range. That assumption is relaxed in many cases -- e.g. the page allocator may not care at the moment because of how it orders checks but compaction assumes that pfn_valid within a pageblock means that all PFNs within that pageblock are valid.
On Wed, Nov 25, 2020 at 03:42:37PM +0100, David Hildenbrand wrote: > > Now the loop is for interesection of [zone_start_pfn, zone_end_pfn] with > > memblock.memory and for x86 reserved ranges are not in memblock.memory, > > so the memmap for them remains semi-initialized. > > As far as I understood Mel, rounding these ranges up/down to cover full > MAX_ORDER blocks/pageblocks might work. > Yes, round down the lower end of the hole and round up the higher end to the MAX_ORDER boundary for the purposes of having valid zone/node linkages even if the underlying page is PageReserved.
On Thu, Nov 26, 2020 at 11:05:14AM +0100, David Hildenbrand wrote: > On 26.11.20 10:36, Mike Rapoport wrote: > > On Wed, Nov 25, 2020 at 04:38:16PM -0500, Andrea Arcangeli wrote: > > > >> At best memory.reserved should be calculated automatically by the > >> page_alloc.c based on the zone_start_pfn/zone_end_pfn and not passed > >> by the e820 caller, instead of adding the memory_reserve call for type > >> 20 we should delete the memory_reserve function. > > > > memory.reserved cannot be calculated automatically. It represents all > > the memory allocations made before page allocator is up. And as > > memblock_reserve() is the most basic to allocate memory early at boot we > > cannot really delete it ;-) > > > > As for e820 and type 20, unless it is in memblock, page_alloc.c has no > > way to properly initialize memmap for it. It can continue to guess, like > > it does with init_unavailable_memory(). > > AFAIS, init_unavailable_mem() has to logic to setup *any* such struct > page properly up - except the node/zid, because that's unknown at that > time. It is documented that (big comment above init_unavailable_mem()) // The comment is a bit inaccurate, but that's another story :) > which pages won't get initialized again. They *are* initialized, just > the node/zone is "fake" - that's the real problem we seem to have. Let's try to merge init_unavailable_memory() into memmap_init(). Than it'll be able to set zone/nid for those nasty pfns that BIOS decided to keep to itself, like in Andrea's case and will also take care of struct pages that do not really have a frame in DRAM, but are there because of arbitrary section size. Something like this: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index eaa227a479e4..072e94042a11 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6185,24 +6185,84 @@ static void __meminit zone_init_free_lists(struct zone *zone) } } -void __meminit __weak memmap_init(unsigned long size, int nid, - unsigned long zone, - unsigned long range_start_pfn) +#if !defined(CONFIG_FLAT_NODE_MEM_MAP) +/* + * Only struct pages that are backed by physical memory available to the + * kernel are zeroed and initialized by memmap_init_zone(). + * But, there are some struct pages that are either reserved by firmware or + * do not correspond to physical page frames becuase actual memory bank is + * not a multiple of SECTION_SIZE. Fields of those struct pages may be + * accessed (for example page_to_pfn() on some configuration accesses + * flags) so we must explicitly initialize those struct pages. + */ +static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn, + int zone, int node) { - unsigned long start_pfn, end_pfn; + unsigned long pfn; + u64 pgcnt = 0; + + for (pfn = spfn; pfn < epfn; pfn++) { + if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) { + pfn = ALIGN_DOWN(pfn, pageblock_nr_pages) + + pageblock_nr_pages - 1; + continue; + } + __init_single_page(pfn_to_page(pfn), pfn, zone, node); + __SetPageReserved(pfn_to_page(pfn)); + pgcnt++; + } + + return pgcnt; +} +#else +static inline u64 init_unavailable_range(unsigned long spfn, unsigned long epfn, + int zone, int node) +{ + return 0; +} +#endif + +void __init __weak memmap_init(unsigned long size, int nid, + unsigned long zone, + unsigned long range_start_pfn) +{ + unsigned long start_pfn, end_pfn, next_pfn = 0; unsigned long range_end_pfn = range_start_pfn + size; + u64 pgcnt = 0; int i; for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn); end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn); + next_pfn = clamp(next_pfn, range_start_pfn, range_end_pfn); if (end_pfn > start_pfn) { size = end_pfn - start_pfn; memmap_init_zone(size, nid, zone, start_pfn, MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); } + + if (next_pfn < start_pfn) + pgcnt += init_unavailable_range(next_pfn, start_pfn, + zone, nid); + next_pfn = end_pfn; } + + /* + * Early sections always have a fully populated memmap for the whole + * section - see pfn_valid(). If the last section has holes at the + * end and that section is marked "online", the memmap will be + * considered initialized. Make sure that memmap has a well defined + * state. + */ + if (next_pfn < range_end_pfn) + pgcnt += init_unavailable_range(next_pfn, range_end_pfn, + zone, nid); + + if (pgcnt) + pr_info("%s: Zeroed struct page in unavailable ranges: %lld\n", + zone_names[zone], pgcnt); + } static int zone_batchsize(struct zone *zone) @@ -6995,88 +7055,6 @@ void __init free_area_init_memoryless_node(int nid) free_area_init_node(nid); } -#if !defined(CONFIG_FLAT_NODE_MEM_MAP) -/* - * Initialize all valid struct pages in the range [spfn, epfn) and mark them - * PageReserved(). Return the number of struct pages that were initialized. - */ -static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn) -{ - unsigned long pfn; - u64 pgcnt = 0; - - for (pfn = spfn; pfn < epfn; pfn++) { - if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) { - pfn = ALIGN_DOWN(pfn, pageblock_nr_pages) - + pageblock_nr_pages - 1; - continue; - } - /* - * Use a fake node/zone (0) for now. Some of these pages - * (in memblock.reserved but not in memblock.memory) will - * get re-initialized via reserve_bootmem_region() later. - */ - __init_single_page(pfn_to_page(pfn), pfn, 0, 0); - __SetPageReserved(pfn_to_page(pfn)); - pgcnt++; - } - - return pgcnt; -} - -/* - * Only struct pages that are backed by physical memory are zeroed and - * initialized by going through __init_single_page(). But, there are some - * struct pages which are reserved in memblock allocator and their fields - * may be accessed (for example page_to_pfn() on some configuration accesses - * flags). We must explicitly initialize those struct pages. - * - * This function also addresses a similar issue where struct pages are left - * uninitialized because the physical address range is not covered by - * memblock.memory or memblock.reserved. That could happen when memblock - * layout is manually configured via memmap=, or when the highest physical - * address (max_pfn) does not end on a section boundary. - */ -static void __init init_unavailable_mem(void) -{ - phys_addr_t start, end; - u64 i, pgcnt; - phys_addr_t next = 0; - - /* - * Loop through unavailable ranges not covered by memblock.memory. - */ - pgcnt = 0; - for_each_mem_range(i, &start, &end) { - if (next < start) - pgcnt += init_unavailable_range(PFN_DOWN(next), - PFN_UP(start)); - next = end; - } - - /* - * Early sections always have a fully populated memmap for the whole - * section - see pfn_valid(). If the last section has holes at the - * end and that section is marked "online", the memmap will be - * considered initialized. Make sure that memmap has a well defined - * state. - */ - pgcnt += init_unavailable_range(PFN_DOWN(next), - round_up(max_pfn, PAGES_PER_SECTION)); - - /* - * Struct pages that do not have backing memory. This could be because - * firmware is using some of this memory, or for some other reasons. - */ - if (pgcnt) - pr_info("Zeroed struct page in unavailable ranges: %lld pages", pgcnt); -} -#else -static inline void __init init_unavailable_mem(void) -{ -} -#endif /* !CONFIG_FLAT_NODE_MEM_MAP */ - #if MAX_NUMNODES > 1 /* * Figure out the number of possible node ids. @@ -7500,7 +7478,6 @@ void __init free_area_init(unsigned long *max_zone_pfn) /* Initialise every node */ mminit_verify_pageflags_layout(); setup_nr_node_ids(); - init_unavailable_mem(); for_each_online_node(nid) { pg_data_t *pgdat = NODE_DATA(nid); free_area_init_node(nid); > I agree that this is sub-optimal, as such pages are impossible to detect > (PageReserved is just not clear as discussed with Andrea). The basic > question is how we want to proceed: > > a) Make sure any online struct page has a valid nid/zid, and is spanned > by the nid/zid. > b) Use a fake nid that will bail out when used for page_zone() and > page_pgdat(), and make pfn walkers detect that. > > AFAIU, Andrea seems to prefer a). I thing b) might be easier in corner > cases. Thoughts? I'd also prefer (a). The hardware defines what physical addresses correspond to which node, so for any populated DIMM (or soldered DRAM for that matter) we can detect page <-> node relationship. As for the struct pages that just "hang" in the end of a section (your example with 4000M), the addresses of these pages still obey the same rules, so again, we have page <-> node correspondence. The zones are software construct but they also correspond to some hardware defined restrictions - each zone has a maximal PFN that HW allows. Here again we can always know which zone spans that or another page. We'd have to fix a couple of things to get there, though :) > --- > > The tricky thing for b) is to fix all call places that do a > page_zone()/page_pgdat() without checking if they are allowed to do so. > We would have to audit all callers of page_zone() / page_zone_id() / > page_pgdat() ... > > > E.g., using a pattern like > mm/memory_hotplug.c:find_biggest_section_pfn() is fine > > > if (unlikely(!pfn_to_online_page(pfn))) > continue; > if (unlikely(pfn_to_nid(pfn) != nid)) > continue; > > > e.g., mm/page_alloc.c:__pageblock_pfn_to_page() would have to be adjusted > > start_page = pfn_to_online_page(start_pfn); > if (!start_page) > return NULL; > > // necessary to avoid ... > if (page_to_nid(page) == NODE_INVALID) > return NULL; > > // page_zone() from bailing out > if (page_zone(start_page) != zone) > return NULL; > > -- > Thanks, > > David / dhildenb >
On Thu, Nov 26, 2020 at 11:05:14AM +0100, David Hildenbrand wrote: > I agree that this is sub-optimal, as such pages are impossible to detect > (PageReserved is just not clear as discussed with Andrea). The basic > question is how we want to proceed: > > a) Make sure any online struct page has a valid nid/zid, and is spanned > by the nid/zid. > b) Use a fake nid that will bail out when used for page_zone() and > page_pgdat(), and make pfn walkers detect that. > > AFAIU, Andrea seems to prefer a). I thing b) might be easier in corner > cases. Thoughts? Yes that's a good summary. My reason I slightly prefer a) is that it will perform faster at runtime. I seen also the proposed patch that adds the pfn_to_page embedded in pfn_valid to show those pfn as invalid, that is especially slow and I don't like it for that reason. Additional bugchecks for NO_NODEID will slowdown things too, so they should be justified by some benefit. It concerns me if pfn_valid becomes slower. b) will also make compaction bail out on that pageblock which it doesn't need to so it'll provide a worse runtime to compaction as well. a) is what the code already does if only the e820 map range was reserved with memblock_reserved, after applying Mike's patch to initialize reserved memblock regions too. It turns out the VM_BUG_ON_PAGE, as far as compaction is concerned, is just a false positive, it detected a broken VM invariant, but it was fine to proceed in the DEBUG_VM=n case that wouldn't crash. Before da50c57bdbb8e37ec6f8c934a2f8acbf4e4fdfb9 the struct page corresponding to the e820 unknown type range page wouldn't be marked PageReserved, that also looked wrong but it was also basically harmless. Neither of the two defects is too bad: it could be ignored if we just remove the bugcheck, but it's just nice if the bugcheck turn out to be correct in the pageblock. If we initialize all RAM and non-RAM ranges in the same way, then there's no discrepancy. Mike's patch seem to do just that by walking the memblock.reserved memblocks in the same way as the memblock.memory. NOTE: I would also much prefer b) if only it would guarantee zero runtime slowdown. (Which is I especially dislike pfn_valid internally calling pfn_to_page) Thanks, Andrea
On Thu, Nov 26, 2020 at 11:36:02AM +0200, Mike Rapoport wrote: > memory.reserved cannot be calculated automatically. It represents all > the memory allocations made before page allocator is up. And as > memblock_reserve() is the most basic to allocate memory early at boot we > cannot really delete it ;-) Well this explanation totally covers "memory allocated early at boot" that overlaps with memblock.memory. Does the E820_TYPE_SOFT_RESERVED range added to memblock.reserve define as "memory allocated early at boot"? Does it overlap ranges added with any RAM added to memblock.memory? if (entry->type == E820_TYPE_SOFT_RESERVED) memblock_reserve(entry->addr, entry->size); if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) continue; memblock_add(entry->addr, entry->size); To me the above looks it's being used for something completely different than from reserving "memory allocated early at boot". Why there is no warning at boot if there's no overlap between memblock.resereve and memblock.memory? My question about memblock.reserve is really about the non overlapping ranges: why are ranges non overlapping with memblock.memory regions, added to memblock.reserve, and why aren't those calculated automatically as reverse of memblock.memory? It's easy to see that when memblock.reserve overlaps fully, it makes perfect sense and it has to stay for it. I was really only thinking at the usage like above of memblock_reserve that looks like it should be turned into a noop and deleted. Thanks, Andrea
On Thu, Nov 26, 2020 at 11:36:02AM +0200, Mike Rapoport wrote:
> I think it's inveneted by your BIOS vendor :)
BTW, all systems I use on a daily basis have that type 20... Only two
of them are reproducing the VM_BUG_ON on a weekly basis on v5.9.
If you search 'E820 "type 20"' you'll get plenty of hits, so it's not
just me at least :).. In fact my guess is there are probably more
workstation/laptops with that type 20 than without. Maybe it only
showup if booting with EFI?
Easy to check with `dmesg | grep "type 20"` after boot.
One guess why this wasn't frequently reproduced is some desktop distro
is doing the mistake of keeping THP enabled = madvise by default and
they're reducing the overall compaction testing? Or maybe they're not
all setting DEBUG_VM=y (but some other distro I'm sure ships v5.9 with
DEBUG_VM=y). Often I hit this bug in kcompactd0 for example, that
wouldn't happen with THP enabled=madvise.
The two bpf tracing tools below can proof how the current
defrag=madvise default only increase the allocation latency from a few
usec to a dozen usec. Only if setting defrag=always the latency goes
up to single digit milliseconds, because of the cost of direct
compaction which is only worth paying for, for MADV_HUGEPAGE ranges
doing long-lived allocations (we know by now that defrag=always was
a suboptimal default).
https://www.kernel.org/pub/linux/kernel/people/andrea/ebpf/thp-comm.bp
https://www.kernel.org/pub/linux/kernel/people/andrea/ebpf/thp.bp
Since 3917c80280c93a7123f1a3a6dcdb10a3ea19737d even app like Redis
using fork for snapshotting purposes should prefer THP
enabled. (besides it would be better if it used uffd-wp as alternative
to fork)
3917c80280c93a7123f1a3a6dcdb10a3ea19737d also resolved another concern
because the decade old "fork() vs gup/O_DIRECT vs thread" race was
supposed to be unnoticeable from userland if the O_DIRECT min I/O
granularity was enforced to be >=PAGE_SIZE. However with THP backed
anon memory, that minimum granularity requirement increase to
HPAGE_PMD_SIZE. Recent kernels are going in the direction of solving
that race by doing cow during fork as it was originally proposed long
time ago
(https://lkml.kernel.org/r/20090311165833.GI27823@random.random) which
I suppose will solve the race with sub-PAGE_SIZE granularity too, but
3917c80280c93a7123f1a3a6dcdb10a3ea19737d alone is enough to reduce the
minumum I/O granularity requirement from HPAGE_PMD_SIZE to PAGE_SIZE
as some userland may have expected. The best of course is to fully
prevent that race condition by setting MADV_DONTFORK on the regions
under O_DIRECT (as qemu does for example).
Overall the only tangible concern left is potential higher memory
usage for servers handling tiny object storage freed at PAGE_SIZE
granularity with MADV_DONTNEED (instead of having a way to copy and
defrag the virtual space of small objects through a callback that
updates the pointer to the object...).
Small object storage relying on jemalloc/tcmalloc for tiny object
management simply need to selectively disable THP to avoid wasting
memory either with MADV_NOHUGEPAGE or with the prctl
PR_SET_THP_DISABLE. Flipping a switch in the OCI schema allows to
disable THP too for those object storage apps making heavy use of
MADV_DONTNEED, not even a single line of code need changing in the app
for it if deployed through the OCI container runtime.
Recent jemalloc uses MADV_NOHUGEPAGE. I didn't check exactly how it's
being used but I've an hope it already does the right thing and
separates small object arena zapped with MADV_DONTNEED at PAGE_SIZE
granularity, with large object arena where THP shall remain
enabled. glibc also should learn to separate small objects and big
objects in different arenas. This has to be handled by the app, like
it is handled optimally already in scylladb that in fact invokes
MADV_HUGEPAGE, or plenty of other databases are using not just THP but
also hugetlbfs which certainly won't fly if MADV_DONTNEED is attempted
at PAGE_SIZE granularity.. or elastic search that also gets a
significant boost from THP etc..
Thanks,
Andrea
On Thu, Nov 26, 2020 at 01:29:30PM -0500, Andrea Arcangeli wrote: > On Thu, Nov 26, 2020 at 11:36:02AM +0200, Mike Rapoport wrote: > > memory.reserved cannot be calculated automatically. It represents all > > the memory allocations made before page allocator is up. And as > > memblock_reserve() is the most basic to allocate memory early at boot we > > cannot really delete it ;-) > > Well this explanation totally covers "memory allocated early at > boot" that overlaps with memblock.memory. > > Does the E820_TYPE_SOFT_RESERVED range added to memblock.reserve > define as "memory allocated early at boot"? > > Does it overlap ranges added with any RAM added to memblock.memory? > > if (entry->type == E820_TYPE_SOFT_RESERVED) > memblock_reserve(entry->addr, entry->size); > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > continue; > > memblock_add(entry->addr, entry->size); > > To me the above looks it's being used for something completely > different than from reserving "memory allocated early at boot". > > Why there is no warning at boot if there's no overlap between > memblock.resereve and memblock.memory? > My question about memblock.reserve is really about the non overlapping > ranges: why are ranges non overlapping with memblock.memory regions, > added to memblock.reserve, and why aren't those calculated > automatically as reverse of memblock.memory? Once there was this comment in arch/x86/kernel/e820.c: /* * all !E820_TYPE_RAM ranges (including gap ranges) are put * into memblock.reserved to make sure that struct pages in * such regions are not left uninitialized after bootup. */ I presume there were struct pages that corresponded to some unusable memory and they were not initilized, so the solution was to add them to memblock.reserved. > It's easy to see that when memblock.reserve overlaps fully, it makes > perfect sense and it has to stay for it. I was really only thinking at > the usage like above of memblock_reserve that looks like it should be > turned into a noop and deleted. TBH, the whole interaction between e820 and memblock keeps me puzzled and I can only make educated guesses why some ranges here are memblock_reserve()'d and some memblock_add()ed. I think what should be there is that e820 entries that are essentially RAM, used by BIOS or not, should be listed in memblock.memory. Then using memblock_reserve() for parts that BIOS claimed for itself would have the same semantics as for memory allocated by kernel. I.e. if there is a DIMM from 0 to, say 512M, memblock.memory will have a range [0, 512M]. And areas such as 0x000-0xfff, 0x9d000-0x9ffff will be in memblock.reserved. Than in page_alloc.c we'll know that we have a physical memory bank from 0 to 512M but there are some ranges that we cannot use. I suggested it back then when the issue with compaction was reported at the first time, but Baoquan mentioned that there are systems that cannot even tolerate having BIOS reserved areas in the page tables and I didn't continue to pursue this. Now I'm thinking to resurrect this patch with some additions so that init_mem_mapping could skip such regions. [1] https://lore.kernel.org/lkml/20200528090731.GI20045@MiWiFi-R3L-srv/#t > Thanks, > Andrea >
On Thu, Nov 26, 2020 at 09:44:26PM +0200, Mike Rapoport wrote: > TBH, the whole interaction between e820 and memblock keeps me puzzled > and I can only make educated guesses why some ranges here are > memblock_reserve()'d and some memblock_add()ed. The mixed usage in that interaction between memblock.reserve and memblock.memory where sometime it's used to reserve overlapping memblock.memory ranges (clearly ok), and sometimes is used on ranges with no overlap (not clear even why it's being called), is what makes the current code messy. We're basically passing down the exact same information (inverted), through two different APIs, if there is no overlap. > I think what should be there is that e820 entries that are essentially > RAM, used by BIOS or not, should be listed in memblock.memory. Then > using memblock_reserve() for parts that BIOS claimed for itself would > have the same semantics as for memory allocated by kernel. > > I.e. if there is a DIMM from 0 to, say 512M, memblock.memory will have a > range [0, 512M]. And areas such as 0x000-0xfff, 0x9d000-0x9ffff will be > in memblock.reserved. > > Than in page_alloc.c we'll know that we have a physical memory bank from > 0 to 512M but there are some ranges that we cannot use. > > I suggested it back then when the issue with compaction was reported at > the first time, but Baoquan mentioned that there are systems that cannot > even tolerate having BIOS reserved areas in the page tables and I didn't > continue to pursue this. That explains why we can't add the e820 non-RAM regions to memblock_add, what's not clear is why it should be required to call memblock_reserve on a region that has no overlap with any memblock_add. Instead of the patch that adds a walk to the memblock.reserve and that requires adding even more memblock_reserve to e820__memblock_setup for type 20, we can add a walk for the memblock.memory holes and then we can remove the memblock_reserve for E820_TYPE_SOFT_RESERVED too. Thanks, Andrea
On Thu, Nov 26, 2020 at 03:30:01PM -0500, Andrea Arcangeli wrote: > On Thu, Nov 26, 2020 at 09:44:26PM +0200, Mike Rapoport wrote: > > TBH, the whole interaction between e820 and memblock keeps me puzzled > > and I can only make educated guesses why some ranges here are > > memblock_reserve()'d and some memblock_add()ed. > > The mixed usage in that interaction between memblock.reserve and > memblock.memory where sometime it's used to reserve overlapping > memblock.memory ranges (clearly ok), and sometimes is used on ranges > with no overlap (not clear even why it's being called), is what makes > the current code messy. > > We're basically passing down the exact same information (inverted), > through two different APIs, if there is no overlap. > > > I think what should be there is that e820 entries that are essentially > > RAM, used by BIOS or not, should be listed in memblock.memory. Then > > using memblock_reserve() for parts that BIOS claimed for itself would > > have the same semantics as for memory allocated by kernel. > > > > I.e. if there is a DIMM from 0 to, say 512M, memblock.memory will have a > > range [0, 512M]. And areas such as 0x000-0xfff, 0x9d000-0x9ffff will be > > in memblock.reserved. > > > > Than in page_alloc.c we'll know that we have a physical memory bank from > > 0 to 512M but there are some ranges that we cannot use. > > > > I suggested it back then when the issue with compaction was reported at > > the first time, but Baoquan mentioned that there are systems that cannot > > even tolerate having BIOS reserved areas in the page tables and I didn't > > continue to pursue this. > > That explains why we can't add the e820 non-RAM regions to > memblock_add, what's not clear is why it should be required to call > memblock_reserve on a region that has no overlap with any memblock_add. > > Instead of the patch that adds a walk to the memblock.reserve and that > requires adding even more memblock_reserve to e820__memblock_setup for > type 20, we can add a walk for the memblock.memory holes and then we > can remove the memblock_reserve for E820_TYPE_SOFT_RESERVED too. This is more or less what I have done here: https://lore.kernel.org/lkml/20201126174601.GT123287@linux.ibm.com/ just without the removal of the call to memblock_reserve() for E820_TYPE_SOFT_RESERVED. > Thanks, > Andrea >
Hello Andrea, On Thu, Nov 26, 2020 at 07:46:05PM +0200, Mike Rapoport wrote: > On Thu, Nov 26, 2020 at 11:05:14AM +0100, David Hildenbrand wrote: > > Let's try to merge init_unavailable_memory() into memmap_init(). > Than it'll be able to set zone/nid for those nasty pfns that BIOS > decided to keep to itself, like in Andrea's case and will also take care > of struct pages that do not really have a frame in DRAM, but are there > because of arbitrary section size. Did you have a chance to check if this solves your issue? If yes, I'll resend this as a formal patch. > Something like this: > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index eaa227a479e4..072e94042a11 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6185,24 +6185,84 @@ static void __meminit zone_init_free_lists(struct zone *zone) > } > } > > -void __meminit __weak memmap_init(unsigned long size, int nid, > - unsigned long zone, > - unsigned long range_start_pfn) > +#if !defined(CONFIG_FLAT_NODE_MEM_MAP) > +/* > + * Only struct pages that are backed by physical memory available to the > + * kernel are zeroed and initialized by memmap_init_zone(). > + * But, there are some struct pages that are either reserved by firmware or > + * do not correspond to physical page frames becuase actual memory bank is > + * not a multiple of SECTION_SIZE. Fields of those struct pages may be > + * accessed (for example page_to_pfn() on some configuration accesses > + * flags) so we must explicitly initialize those struct pages. > + */ > +static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn, > + int zone, int node) > { > - unsigned long start_pfn, end_pfn; > + unsigned long pfn; > + u64 pgcnt = 0; > + > + for (pfn = spfn; pfn < epfn; pfn++) { > + if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) { > + pfn = ALIGN_DOWN(pfn, pageblock_nr_pages) > + + pageblock_nr_pages - 1; > + continue; > + } > + __init_single_page(pfn_to_page(pfn), pfn, zone, node); > + __SetPageReserved(pfn_to_page(pfn)); > + pgcnt++; > + } > + > + return pgcnt; > +} > +#else > +static inline u64 init_unavailable_range(unsigned long spfn, unsigned long epfn, > + int zone, int node) > +{ > + return 0; > +} > +#endif > + > +void __init __weak memmap_init(unsigned long size, int nid, > + unsigned long zone, > + unsigned long range_start_pfn) > +{ > + unsigned long start_pfn, end_pfn, next_pfn = 0; > unsigned long range_end_pfn = range_start_pfn + size; > + u64 pgcnt = 0; > int i; > > for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { > start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn); > end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn); > + next_pfn = clamp(next_pfn, range_start_pfn, range_end_pfn); > > if (end_pfn > start_pfn) { > size = end_pfn - start_pfn; > memmap_init_zone(size, nid, zone, start_pfn, > MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); > } > + > + if (next_pfn < start_pfn) > + pgcnt += init_unavailable_range(next_pfn, start_pfn, > + zone, nid); > + next_pfn = end_pfn; > } > + > + /* > + * Early sections always have a fully populated memmap for the whole > + * section - see pfn_valid(). If the last section has holes at the > + * end and that section is marked "online", the memmap will be > + * considered initialized. Make sure that memmap has a well defined > + * state. > + */ > + if (next_pfn < range_end_pfn) > + pgcnt += init_unavailable_range(next_pfn, range_end_pfn, > + zone, nid); > + > + if (pgcnt) > + pr_info("%s: Zeroed struct page in unavailable ranges: %lld\n", > + zone_names[zone], pgcnt); > + > } > > static int zone_batchsize(struct zone *zone) > @@ -6995,88 +7055,6 @@ void __init free_area_init_memoryless_node(int nid) > free_area_init_node(nid); > } > > -#if !defined(CONFIG_FLAT_NODE_MEM_MAP) > -/* > - * Initialize all valid struct pages in the range [spfn, epfn) and mark them > - * PageReserved(). Return the number of struct pages that were initialized. > - */ > -static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn) > -{ > - unsigned long pfn; > - u64 pgcnt = 0; > - > - for (pfn = spfn; pfn < epfn; pfn++) { > - if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) { > - pfn = ALIGN_DOWN(pfn, pageblock_nr_pages) > - + pageblock_nr_pages - 1; > - continue; > - } > - /* > - * Use a fake node/zone (0) for now. Some of these pages > - * (in memblock.reserved but not in memblock.memory) will > - * get re-initialized via reserve_bootmem_region() later. > - */ > - __init_single_page(pfn_to_page(pfn), pfn, 0, 0); > - __SetPageReserved(pfn_to_page(pfn)); > - pgcnt++; > - } > - > - return pgcnt; > -} > - > -/* > - * Only struct pages that are backed by physical memory are zeroed and > - * initialized by going through __init_single_page(). But, there are some > - * struct pages which are reserved in memblock allocator and their fields > - * may be accessed (for example page_to_pfn() on some configuration accesses > - * flags). We must explicitly initialize those struct pages. > - * > - * This function also addresses a similar issue where struct pages are left > - * uninitialized because the physical address range is not covered by > - * memblock.memory or memblock.reserved. That could happen when memblock > - * layout is manually configured via memmap=, or when the highest physical > - * address (max_pfn) does not end on a section boundary. > - */ > -static void __init init_unavailable_mem(void) > -{ > - phys_addr_t start, end; > - u64 i, pgcnt; > - phys_addr_t next = 0; > - > - /* > - * Loop through unavailable ranges not covered by memblock.memory. > - */ > - pgcnt = 0; > - for_each_mem_range(i, &start, &end) { > - if (next < start) > - pgcnt += init_unavailable_range(PFN_DOWN(next), > - PFN_UP(start)); > - next = end; > - } > - > - /* > - * Early sections always have a fully populated memmap for the whole > - * section - see pfn_valid(). If the last section has holes at the > - * end and that section is marked "online", the memmap will be > - * considered initialized. Make sure that memmap has a well defined > - * state. > - */ > - pgcnt += init_unavailable_range(PFN_DOWN(next), > - round_up(max_pfn, PAGES_PER_SECTION)); > - > - /* > - * Struct pages that do not have backing memory. This could be because > - * firmware is using some of this memory, or for some other reasons. > - */ > - if (pgcnt) > - pr_info("Zeroed struct page in unavailable ranges: %lld pages", pgcnt); > -} > -#else > -static inline void __init init_unavailable_mem(void) > -{ > -} > -#endif /* !CONFIG_FLAT_NODE_MEM_MAP */ > - > #if MAX_NUMNODES > 1 > /* > * Figure out the number of possible node ids. > @@ -7500,7 +7478,6 @@ void __init free_area_init(unsigned long *max_zone_pfn) > /* Initialise every node */ > mminit_verify_pageflags_layout(); > setup_nr_node_ids(); > - init_unavailable_mem(); > for_each_online_node(nid) { > pg_data_t *pgdat = NODE_DATA(nid); > free_area_init_node(nid); > > > > I agree that this is sub-optimal, as such pages are impossible to detect > > (PageReserved is just not clear as discussed with Andrea). The basic > > question is how we want to proceed: > > > > a) Make sure any online struct page has a valid nid/zid, and is spanned > > by the nid/zid. > > b) Use a fake nid that will bail out when used for page_zone() and > > page_pgdat(), and make pfn walkers detect that. > > > > AFAIU, Andrea seems to prefer a). I thing b) might be easier in corner > > cases. Thoughts? > > I'd also prefer (a). > > The hardware defines what physical addresses correspond to which node, > so for any populated DIMM (or soldered DRAM for that matter) we can > detect page <-> node relationship. > > As for the struct pages that just "hang" in the end of a section (your > example with 4000M), the addresses of these pages still obey the same > rules, so again, we have page <-> node correspondence. > > The zones are software construct but they also correspond to some > hardware defined restrictions - each zone has a maximal PFN that HW > allows. Here again we can always know which zone spans that or another > page. > > We'd have to fix a couple of things to get there, though :) > > > --- > > > > The tricky thing for b) is to fix all call places that do a > > page_zone()/page_pgdat() without checking if they are allowed to do so. > > We would have to audit all callers of page_zone() / page_zone_id() / > > page_pgdat() ... > > > > > > E.g., using a pattern like > > mm/memory_hotplug.c:find_biggest_section_pfn() is fine > > > > > > if (unlikely(!pfn_to_online_page(pfn))) > > continue; > > if (unlikely(pfn_to_nid(pfn) != nid)) > > continue; > > > > > > e.g., mm/page_alloc.c:__pageblock_pfn_to_page() would have to be adjusted > > > > start_page = pfn_to_online_page(start_pfn); > > if (!start_page) > > return NULL; > > > > // necessary to avoid ... > > if (page_to_nid(page) == NODE_INVALID) > > return NULL; > > > > // page_zone() from bailing out > > if (page_zone(start_page) != zone) > > return NULL; > > > > -- > > Thanks, > > > > David / dhildenb > > > > -- > Sincerely yours, > Mike.
Hello Mike, On Sun, Nov 29, 2020 at 02:32:57PM +0200, Mike Rapoport wrote: > Hello Andrea, > > On Thu, Nov 26, 2020 at 07:46:05PM +0200, Mike Rapoport wrote: > > On Thu, Nov 26, 2020 at 11:05:14AM +0100, David Hildenbrand wrote: > > > > Let's try to merge init_unavailable_memory() into memmap_init(). > > Than it'll be able to set zone/nid for those nasty pfns that BIOS > > decided to keep to itself, like in Andrea's case and will also take care > > of struct pages that do not really have a frame in DRAM, but are there > > because of arbitrary section size. > > Did you have a chance to check if this solves your issue? > If yes, I'll resend this as a formal patch. I tested the patch you sent, but it doesn't seem to boot. Reverting it booted. Also I noticed leaving pages uninitialized already happened before and was fixed in 124049decbb121ec32742c94fb5d9d6bed8f24d8 where literally all holes were registered in memblock_reserve() by hand to workaround this very same defect in memblock callee we're fixing again. Then it was lifted in 9fd61bc95130d4971568b89c9548b5e0a4e18e0e since the memblock was supposed to be able to initialize all pages. Since this seems the second time this happens, I'd suggest to change the graceful pfn, 0,0 initialization to memset(page, 0xff, sizeof(struct page)) too like we mentioned earlier and then have at least a DEBUG_SOMETHING=y to search for struct pages with all 1 in page->flags to ensure the boot stage worked right so perhaps there's a graceful notification at boot before a VM_BUG_ON triggers later. The page struct validation could be done based on DEBUG_VM=y too since it won't cause any runtime cost post boot. Thanks, Andrea
Hello Andrea, On Tue, Dec 01, 2020 at 07:44:51PM -0500, Andrea Arcangeli wrote: > Hello Mike, > > On Sun, Nov 29, 2020 at 02:32:57PM +0200, Mike Rapoport wrote: > > Hello Andrea, > > > > On Thu, Nov 26, 2020 at 07:46:05PM +0200, Mike Rapoport wrote: > > > On Thu, Nov 26, 2020 at 11:05:14AM +0100, David Hildenbrand wrote: > > > > > > Let's try to merge init_unavailable_memory() into memmap_init(). > > > Than it'll be able to set zone/nid for those nasty pfns that BIOS > > > decided to keep to itself, like in Andrea's case and will also take care > > > of struct pages that do not really have a frame in DRAM, but are there > > > because of arbitrary section size. > > > > Did you have a chance to check if this solves your issue? > > If yes, I'll resend this as a formal patch. > > I tested the patch you sent, but it doesn't seem to boot. Reverting it > booted. Hmm, do you have any logs? It worked on my box and with various memory configurations in qemu. > Also I noticed leaving pages uninitialized already happened before and > was fixed in 124049decbb121ec32742c94fb5d9d6bed8f24d8 where literally > all holes were registered in memblock_reserve() by hand to workaround > this very same defect in memblock callee we're fixing again. > > Then it was lifted in 9fd61bc95130d4971568b89c9548b5e0a4e18e0e since > the memblock was supposed to be able to initialize all pages. I still that think both 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved") and 9fd61bc95130 ("Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved") are band aids and they do not address the root cause but rather try to workaround it. The problem is that we have no proper way to initialize struct pages corresponding to holes. The holes can appear in case when the memory is not a multiple of SECTION_SIZE and, at least on x86, in case BIOS reserves pages and they are not E820_TYPE_RAM. The first case is not much of a problem, as I doubt there are still systems that have not MAX_ORDER aligned memory banks, so the "trailing" pages in semi-empty section will never be used by the page allocator. The BIOS case, however is more problematic as it creates holes inside MAX_ORDER aligned pageblocks. Pushing the pages in such holes in and out of memblock.reserved won't really help because they are not in memblock.memory while most calculations during initialization of page allocator and memory map use memblock.memory and iterate over it. As these holes do not seem to appear on zone/node boundaries, we do not see reports about wrong zone/node sizing because of the holes. I believe that the proper solution would be to have all the memory reserved by the firmware added to memblock.memory, like all other architectures do, but I'm not sure if we can easily and reliably determine what of E820 reserved types are actually backed by RAM. So this is not feasible as a short term solution. My patch [1], though, is an immediate improvement and I think it's worth trying to fix off-by-one's that prevent your system from booting. [1] https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@kernel.org > Since this seems the second time this happens, I'd suggest to change > the graceful pfn, 0,0 initialization to memset(page, 0xff, > sizeof(struct page)) too like we mentioned earlier and then have at > least a DEBUG_SOMETHING=y to search for struct pages with all 1 in > page->flags to ensure the boot stage worked right so perhaps there's a > graceful notification at boot before a VM_BUG_ON triggers later. The > page struct validation could be done based on DEBUG_VM=y too since it > won't cause any runtime cost post boot. > > Thanks, > Andrea >
On Wed, Dec 02, 2020 at 07:39:23PM +0200, Mike Rapoport wrote: > Hmm, do you have any logs? It worked on my box and with various memory > configurations in qemu. It crashes in reserve_bootmem_region so no logs at all since the serial console isn't active yet. > I believe that the proper solution would be to have all the memory > reserved by the firmware added to memblock.memory, like all other > architectures do, but I'm not sure if we can easily and reliably > determine what of E820 reserved types are actually backed by RAM. > So this is not feasible as a short term solution. > > My patch [1], though, is an immediate improvement and I think it's worth > trying to fix off-by-one's that prevent your system from booting. > > [1] https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@kernel.org Yes that's the patch I applied. Relevant points after debugging: 1) can you try again with DEBUG_VM=y? 2) DEBUG_VM=y already enforces the memset(page, 0xff, sizeof(struct page)) on all struct pages allocated, exactly I was suggesting to do in the previous email. I wonder why we're not doing that even with DEBUG_VM=n, perhaps it's too slow for TiB systems. See page_init_poison(). 3) given 2), your patch below by deleting "0,0" initialization achieves the debug feature to keep PagePoison forever on all uninitialized pages, imagine PagePoison is really NO_NODEID/NO_ZONEID and doesn't need handling other than a crash. - __init_single_page(pfn_to_page(pfn), pfn, 0, 0); 4) because of the feature in 3) I now hit the PagePoison VM_BUG_ON because pfn 0 wasn't initialized when reserve_bootmem_region tries to call __SetPageReserved on a PagePoison page. 5) pfn 0 is the classical case where pfn 0 is in a reserved zone in memblock.reserve that doesn't overlap any memblock.memory zone. The memblock_start_of_DRAM moves the start of the DMA zone above the pfn 0, so then pfn 0 already ends up in the no zone land David mentioned where it's not trivial to decide to give it a zoneid if it's not even spanned in the zone. Not just the zoneid, there's not even a nid. So we have a pfn with no zoneid, and no nid, and not part of the zone ranges, not part of the nid ranges not part of the memory.memblock. We can't really skip the initialization of the the pfn 0, it has to get the zoneid 0 treatment or if pfn 1 ends up in compaction code, we'd crash again. (although we'd crash with a nice PagePoison(page) == true behavior) The below fixes the issue and it boots fine again and the compaction crash should be solved with both patches applied. DMA zone_start_pfn 0 zone_end_pfn() 4096 contiguous 1 DMA32 zone_start_pfn 4096 zone_end_pfn() 524252 contiguous 1 Normal zone_start_pfn 0 zone_end_pfn() 0 contiguous 0 Movable zone_start_pfn 0 zone_end_pfn() 0 contiguous 0 500246 0x7a216000 0x1fff000000001000 reserved True pfn_valid True 500247 0x7a217000 0x1fff000000000000 reserved False pfn_valid True 500248 0x7a218000 0x1fff000000000000 reserved False pfn_valid True ==== quote from previous email ==== 73a6e474cb376921a311786652782155eac2fdf0 this changed it: DMA zone_start_pfn 1 zone_end_pfn() 4096 contiguous 1 DMA32 zone_start_pfn 4096 zone_end_pfn() 524252 contiguous 0 Normal zone_start_pfn 0 zone_end_pfn() 0 contiguous 0 Movable zone_start_pfn 0 zone_end_pfn() 0 contiguous 0 500246 0x7a216000 0xfff000000001000 reserved True 500247 0x7a217000 0x1fff000000000000 reserved False 500248 0x7a218000 0x1fff000000010200 reserved False ] ==== quote from previous email ==== From 89469f063c192ae70aea0bd6e1e2a7e99894e5b6 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli <aarcange@redhat.com> Date: Wed, 2 Dec 2020 23:23:26 -0500 Subject: [PATCH 1/1] mm: initialize struct pages in reserved regions outside of the zone ranges pfn 0 wasn't initialized and the PagePoison remained set when reserve_bootmem_region() called __SetPageReserved, inducing a silent boot failure with DEBUG_VM (and correctly so, because the crash signaled the nodeid/nid of pfn 0 would be again wrong). Without this change, the pfn 0 part of a memblock.reserved range, isn't in any zone spanned range, nor in any nid spanned range, when we initialize the memblock.memory holes pfn 0 won't be included because it has no nid and no zoneid. There's no enforcement that all memblock.reserved ranges must overlap memblock.memory ranges, so the memblock.reserved ranges also require an explicit initialization and the zones and nid ranges need to be extended to include all memblock.reserved ranges with struct pages, or they'll be left uninitialized with PagePoison as it happened to pfn 0. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- include/linux/memblock.h | 17 +++++++++--- mm/debug.c | 3 ++- mm/memblock.c | 4 +-- mm/page_alloc.c | 57 ++++++++++++++++++++++++++++++++-------- 4 files changed, 63 insertions(+), 18 deletions(-) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index ef131255cedc..c8e30cd69564 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -251,7 +251,8 @@ static inline bool memblock_is_nomap(struct memblock_region *m) int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn, unsigned long *end_pfn); void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn, - unsigned long *out_end_pfn, int *out_nid); + unsigned long *out_end_pfn, int *out_nid, + struct memblock_type *type); /** * for_each_mem_pfn_range - early memory pfn range iterator @@ -263,9 +264,17 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn, * * Walks over configured memory ranges. */ -#define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid) \ - for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid); \ - i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid)) +#define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid) \ + for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid, \ + &memblock.memory); \ + i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid, \ + &memblock.memory)) + +#define for_each_res_pfn_range(i, nid, p_start, p_end, p_nid) \ + for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid, \ + &memblock.reserved); \ + i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid, \ + &memblock.reserved)) #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone, diff --git a/mm/debug.c b/mm/debug.c index ccca576b2899..6a1d534f5ffc 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -64,7 +64,8 @@ void __dump_page(struct page *page, const char *reason) * dump_page() when detected. */ if (page_poisoned) { - pr_warn("page:%px is uninitialized and poisoned", page); + pr_warn("page:%px pfn:%ld is uninitialized and poisoned", + page, page_to_pfn(page)); goto hex_only; } diff --git a/mm/memblock.c b/mm/memblock.c index b68ee86788af..3964a5e8914f 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1198,9 +1198,9 @@ void __init_memblock __next_mem_range_rev(u64 *idx, int nid, */ void __init_memblock __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn, - unsigned long *out_end_pfn, int *out_nid) + unsigned long *out_end_pfn, int *out_nid, + struct memblock_type *type) { - struct memblock_type *type = &memblock.memory; struct memblock_region *r; int r_nid; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ce2bdaabdf96..3eed49598d66 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1458,6 +1458,7 @@ static void __meminit init_reserved_page(unsigned long pfn) { pg_data_t *pgdat; int nid, zid; + bool found = false; if (!early_page_uninitialised(pfn)) return; @@ -1468,10 +1469,15 @@ static void __meminit init_reserved_page(unsigned long pfn) for (zid = 0; zid < MAX_NR_ZONES; zid++) { struct zone *zone = &pgdat->node_zones[zid]; - if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone)) + if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone)) { + found = true; break; + } } - __init_single_page(pfn_to_page(pfn), pfn, zid, nid); + if (likely(found)) + __init_single_page(pfn_to_page(pfn), pfn, zid, nid); + else + WARN_ON_ONCE(1); } #else static inline void init_reserved_page(unsigned long pfn) @@ -6227,7 +6233,7 @@ void __init __weak memmap_init(unsigned long size, int nid, unsigned long zone, unsigned long range_start_pfn) { - unsigned long start_pfn, end_pfn, next_pfn = 0; + unsigned long start_pfn, end_pfn, prev_pfn = 0; unsigned long range_end_pfn = range_start_pfn + size; u64 pgcnt = 0; int i; @@ -6235,7 +6241,7 @@ void __init __weak memmap_init(unsigned long size, int nid, for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn); end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn); - next_pfn = clamp(next_pfn, range_start_pfn, range_end_pfn); + prev_pfn = clamp(prev_pfn, range_start_pfn, range_end_pfn); if (end_pfn > start_pfn) { size = end_pfn - start_pfn; @@ -6243,10 +6249,10 @@ void __init __weak memmap_init(unsigned long size, int nid, MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); } - if (next_pfn < start_pfn) - pgcnt += init_unavailable_range(next_pfn, start_pfn, + if (prev_pfn < start_pfn) + pgcnt += init_unavailable_range(prev_pfn, start_pfn, zone, nid); - next_pfn = end_pfn; + prev_pfn = end_pfn; } /* @@ -6256,12 +6262,31 @@ void __init __weak memmap_init(unsigned long size, int nid, * considered initialized. Make sure that memmap has a well defined * state. */ - if (next_pfn < range_end_pfn) - pgcnt += init_unavailable_range(next_pfn, range_end_pfn, + if (prev_pfn < range_end_pfn) + pgcnt += init_unavailable_range(prev_pfn, range_end_pfn, zone, nid); + /* + * memblock.reserved isn't enforced to overlap with + * memblock.memory so initialize the struct pages for + * memblock.reserved too in case it wasn't overlapping. + * + * If any struct page associated with a memblock.reserved + * range isn't overlapping with a zone range, it'll be left + * uninitialized, ideally with PagePoison, and it'll be a more + * easily detectable error. + */ + for_each_res_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { + start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn); + end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn); + + if (end_pfn > start_pfn) + pgcnt += init_unavailable_range(start_pfn, end_pfn, + zone, nid); + } + if (pgcnt) - pr_info("%s: Zeroed struct page in unavailable ranges: %lld\n", + pr_info("%s: pages in unavailable ranges: %lld\n", zone_names[zone], pgcnt); } @@ -6499,6 +6524,10 @@ void __init get_pfn_range_for_nid(unsigned int nid, *start_pfn = min(*start_pfn, this_start_pfn); *end_pfn = max(*end_pfn, this_end_pfn); } + for_each_res_pfn_range(i, nid, &this_start_pfn, &this_end_pfn, NULL) { + *start_pfn = min(*start_pfn, this_start_pfn); + *end_pfn = max(*end_pfn, this_end_pfn); + } if (*start_pfn == -1UL) *start_pfn = 0; @@ -7126,7 +7155,13 @@ unsigned long __init node_map_pfn_alignment(void) */ unsigned long __init find_min_pfn_with_active_regions(void) { - return PHYS_PFN(memblock_start_of_DRAM()); + /* + * reserved regions must be included so that their page + * structure can be part of a zone and obtain a valid zoneid + * before __SetPageReserved(). + */ + return min(PHYS_PFN(memblock_start_of_DRAM()), + PHYS_PFN(memblock.reserved.regions[0].base)); } /*
On Thu, Dec 03, 2020 at 01:23:02AM -0500, Andrea Arcangeli wrote: > On Wed, Dec 02, 2020 at 07:39:23PM +0200, Mike Rapoport wrote: > > Hmm, do you have any logs? It worked on my box and with various memory > > configurations in qemu. > > It crashes in reserve_bootmem_region so no logs at all since the > serial console isn't active yet. > > > I believe that the proper solution would be to have all the memory > > reserved by the firmware added to memblock.memory, like all other > > architectures do, but I'm not sure if we can easily and reliably > > determine what of E820 reserved types are actually backed by RAM. > > So this is not feasible as a short term solution. > > > > My patch [1], though, is an immediate improvement and I think it's worth > > trying to fix off-by-one's that prevent your system from booting. > > > > [1] https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@kernel.org > > Yes that's the patch I applied. > > Relevant points after debugging: > > 1) can you try again with DEBUG_VM=y? Mea cupla :) > 2) DEBUG_VM=y already enforces the memset(page, 0xff, sizeof(struct > page)) on all struct pages allocated, exactly I was suggesting to > do in the previous email. I wonder why we're not doing that even > with DEBUG_VM=n, perhaps it's too slow for TiB systems. See > page_init_poison(). > > 3) given 2), your patch below by deleting "0,0" initialization > achieves the debug feature to keep PagePoison forever on all > uninitialized pages, imagine PagePoison is really > NO_NODEID/NO_ZONEID and doesn't need handling other than a crash. > > - __init_single_page(pfn_to_page(pfn), pfn, 0, 0); > > 4) because of the feature in 3) I now hit the PagePoison VM_BUG_ON > because pfn 0 wasn't initialized when reserve_bootmem_region tries > to call __SetPageReserved on a PagePoison page. So this is the off-by-one a was talking about. My patch removed initializaion of the hole before the memblock.memory > 5) pfn 0 is the classical case where pfn 0 is in a reserved zone in > memblock.reserve that doesn't overlap any memblock.memory zone. And, IMHO, this is the fundamental BUG. There is RAM at address 0, there is a DIMM there, so why on earth this should not be a part of memblock.memory? > The memblock_start_of_DRAM moves the start of the DMA zone above > the pfn 0, so then pfn 0 already ends up in the no zone land David > mentioned where it's not trivial to decide to give it a zoneid if > it's not even spanned in the zone. > > Not just the zoneid, there's not even a nid. > > So we have a pfn with no zoneid, and no nid, and not part of the > zone ranges, not part of the nid ranges not part of the > memory.memblock. We have a pfn that should have been in memblock.memory, in ZONE_DMA and in the first node with memory. If it was not trimmed from there > We can't really skip the initialization of the the pfn 0, it has to > get the zoneid 0 treatment or if pfn 1 ends up in compaction code, > we'd crash again. (although we'd crash with a nice PagePoison(page) > == true behavior) Agree. Struct page for pfn should get the same zoneid and nodeid as pfn 1. For me the below fixup worked (both with and without DEBUG_VM=y). From 84a1c2531374706f3592a638523278aa29aaa448 Mon Sep 17 00:00:00 2001 From: Mike Rapoport <rppt@linux.ibm.com> Date: Thu, 3 Dec 2020 11:40:17 +0200 Subject: [PATCH] fixup for "mm: refactor initialization of stuct page for holes" Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> --- mm/page_alloc.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ce2bdaabdf96..86fde4424e87 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6227,7 +6227,8 @@ void __init __weak memmap_init(unsigned long size, int nid, unsigned long zone, unsigned long range_start_pfn) { - unsigned long start_pfn, end_pfn, next_pfn = 0; + static unsigned long hole_start_pfn; + unsigned long start_pfn, end_pfn; unsigned long range_end_pfn = range_start_pfn + size; u64 pgcnt = 0; int i; @@ -6235,7 +6236,6 @@ void __init __weak memmap_init(unsigned long size, int nid, for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn); end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn); - next_pfn = clamp(next_pfn, range_start_pfn, range_end_pfn); if (end_pfn > start_pfn) { size = end_pfn - start_pfn; @@ -6243,10 +6243,10 @@ void __init __weak memmap_init(unsigned long size, int nid, MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); } - if (next_pfn < start_pfn) - pgcnt += init_unavailable_range(next_pfn, start_pfn, - zone, nid); - next_pfn = end_pfn; + if (hole_start_pfn < start_pfn) + pgcnt += init_unavailable_range(hole_start_pfn, + start_pfn, zone, nid); + hole_start_pfn = end_pfn; } /* @@ -6256,8 +6256,8 @@ void __init __weak memmap_init(unsigned long size, int nid, * considered initialized. Make sure that memmap has a well defined * state. */ - if (next_pfn < range_end_pfn) - pgcnt += init_unavailable_range(next_pfn, range_end_pfn, + if (hole_start_pfn < range_end_pfn) + pgcnt += init_unavailable_range(hole_start_pfn, range_end_pfn, zone, nid); if (pgcnt)
On Thu, Dec 03, 2020 at 12:51:07PM +0200, Mike Rapoport wrote: > On Thu, Dec 03, 2020 at 01:23:02AM -0500, Andrea Arcangeli wrote: > > 5) pfn 0 is the classical case where pfn 0 is in a reserved zone in > > memblock.reserve that doesn't overlap any memblock.memory zone. > > And, IMHO, this is the fundamental BUG. We have a dozen arches that change all the time, new efi/e820 stuff, always new bios configs, the memblock code must cope and be able to cope with the caller being wrong or changing behavior if the e820 map changes behaviour. Trying to work around memblock core deficiencies in the caller is what 124049decbb121ec32742c94fb5d9d6bed8f24d8 already did, and it doesn't work and it's not even clean thing to do. In fact here there's really nothing to blame the caller for (i.e. the e820 init code), unless you declare that there must be always overlap (which I understood would break stuff for regions that must not have a direct mapping). The caller here is correct and it can be summarized as below: if (e820 map type != system_ram) memblock_reserve(range) else memblock_add(range) Then: zone_boundaries = [ 16M, 4G, 0 ] free_area_init(zone_boundaries) It's not the caller fault if the e820 map in the bios starts with: pfn 0-1 reserved pfn 1-N system RAM This is better fixed in a way that must not require any change in the caller... > There is RAM at address 0, there is a DIMM there, so why on earth this > should not be a part of memblock.memory? How can you blame the caller if you explicitly didn't required that .reserved ranges must always overlap .memory ranges? In fact it was explained as a feature to avoid the direct mapping. Declaring that there must be always overlap would be one way to cope with this issue, but even then memblock must detect a wrong caller, workaround it by doing a memblock_add call inside the memblock_reserved and by printing a warning to improve the caller. It shouldn't crash at boot with console off if the overlap is not there. In my view the caller here is not to blame, all these issues are because memblock needs improvement and must cope with any caller. > > The memblock_start_of_DRAM moves the start of the DMA zone above > > the pfn 0, so then pfn 0 already ends up in the no zone land David > > mentioned where it's not trivial to decide to give it a zoneid if > > it's not even spanned in the zone. > > > > Not just the zoneid, there's not even a nid. > > > > So we have a pfn with no zoneid, and no nid, and not part of the > > zone ranges, not part of the nid ranges not part of the > > memory.memblock. > > We have a pfn that should have been in memblock.memory, in ZONE_DMA and > in the first node with memory. If it was not trimmed from there My incremental patch already solves how to extend the zones spans and the nid spans by taking memblock.reserved into account, not just memblock.memory, but still without actually messing with the direct mapping, otherwise I would have called memblock_add instead of walking memblock.reserved when detecting the nid and zoneid ranges. > > > We can't really skip the initialization of the the pfn 0, it has to > > get the zoneid 0 treatment or if pfn 1 ends up in compaction code, > > we'd crash again. (although we'd crash with a nice PagePoison(page) > > == true behavior) > > Agree. Struct page for pfn should get the same zoneid and nodeid as pfn 1. How are you sure there's no a zone ID 0 that starts at pfn 0 and ends at pfn 1 and the zone dma starts at pfn 1? In such case the pfn 0 would have a zoneid different from pfn 1. I'm not exactly sure why we should initialize a pfn 0 with a zoneid of a zone whose pfn 0 is not part of, that looks wrong. I'm not exactly sure why you don't fix the zone->zone_start_pfn and the respective nid node_start_pfn to start from pfn 0 instead, just like I did in my patch. > From 84a1c2531374706f3592a638523278aa29aaa448 Mon Sep 17 00:00:00 2001 > From: Mike Rapoport <rppt@linux.ibm.com> > Date: Thu, 3 Dec 2020 11:40:17 +0200 > Subject: [PATCH] fixup for "mm: refactor initialization of stuct page for holes" > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> > --- > mm/page_alloc.c | 16 ++++++++-------- > 1 file changed, 8 insertions(+), 8 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index ce2bdaabdf96..86fde4424e87 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6227,7 +6227,8 @@ void __init __weak memmap_init(unsigned long size, int nid, > unsigned long zone, > unsigned long range_start_pfn) > { > - unsigned long start_pfn, end_pfn, next_pfn = 0; > + static unsigned long hole_start_pfn; > + unsigned long start_pfn, end_pfn; > unsigned long range_end_pfn = range_start_pfn + size; > u64 pgcnt = 0; > int i; > @@ -6235,7 +6236,6 @@ void __init __weak memmap_init(unsigned long size, int nid, > for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { > start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn); > end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn); > - next_pfn = clamp(next_pfn, range_start_pfn, range_end_pfn); > > if (end_pfn > start_pfn) { > size = end_pfn - start_pfn; > @@ -6243,10 +6243,10 @@ void __init __weak memmap_init(unsigned long size, int nid, > MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); > } > > - if (next_pfn < start_pfn) > - pgcnt += init_unavailable_range(next_pfn, start_pfn, > - zone, nid); > - next_pfn = end_pfn; > + if (hole_start_pfn < start_pfn) > + pgcnt += init_unavailable_range(hole_start_pfn, > + start_pfn, zone, nid); > + hole_start_pfn = end_pfn; > } > > /* > @@ -6256,8 +6256,8 @@ void __init __weak memmap_init(unsigned long size, int nid, > * considered initialized. Make sure that memmap has a well defined > * state. > */ > - if (next_pfn < range_end_pfn) > - pgcnt += init_unavailable_range(next_pfn, range_end_pfn, > + if (hole_start_pfn < range_end_pfn) > + pgcnt += init_unavailable_range(hole_start_pfn, range_end_pfn, > zone, nid); > 1) this looks very specific for pfn 0 2) it's not optimal to call init_unavailable_range(0,) the first time at every zone initialization, that's a pretty inefficient loop for large RAM systems. 3) you give pfn 0 a zoneid despite it's not part of the zone, that looks wrong from a zoneid initialization point of view. The bug here is that the DMA zone starts in pfn 1 which doesn't make any technical sense since the start of the pageblock would then be beyond the start_zone_pfn. I don't really see how you can fix this, without ending with an even bigger undefined special case to keep in mind where a pfn gets initialized beyond the zone. If something the real good alternative to changing the zone_start_pfn to start from 0 and not from 1, is to keep the pfn 0 PagePoison and to never call __setPageReserved on it in reserve_bootmem_region. That is also a very clean solution where everything remains in check. The poison is effectively a NO_NODEID NO_NID because that pfn 0 really is not spanned in any nid or any zone. Although we need to ensure compaction won't ever try to access it by means of checking zone_start_pfn that it can't rewind to the start of the pageblock. It'll cause an inefficiency but that also looks cleaner than giving a random zoneid beyond the zone span, to pfn 0. Thanks, Andrea
Hi Mel, On Thu, Nov 26, 2020 at 10:47:20AM +0000, Mel Gorman wrote: > Agreed. This thread has a lot of different directions in it at this > point so what I'd hope for is first, a patch that initialises holes with > zone/node linkages within a 1<<(MAX_ORDER-1) alignment. If there is a > hole, it would be expected the pages are PageReserved. Second, a fix to > fast_isolate that forces PFNs returned to always be within the stated > zone boundaries. The last two patches should resolve the struct page initialization https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/ and the VM_BUG_ON_PAGE never happened again as expected. So I looked back to see how the "distance" logic is accurate. I added those checks and I get negative hits: diff --git a/mm/compaction.c b/mm/compaction.c index cc1a7f600a86..844a90b0acdf 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1331,6 +1331,12 @@ fast_isolate_freepages(struct compact_control *cc) low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2)); min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1)); + WARN_ON_ONCE((long) low_pfn < 0); + WARN_ON_ONCE((long) min_pfn < 0); + if ((long) low_pfn < 0) + return cc->free_pfn; + if ((long) min_pfn < 0) + return cc->free_pfn; if (WARN_ON_ONCE(min_pfn > low_pfn)) low_pfn = min_pfn; Both warn-on-once triggers, so it goes negative. While it appears not kernel crashing since pfn_valid won't succeed on negative entries and they'll always be higher than any pfn in the freelist, is this sign that there's further room for improvement here? Thanks, Andrea
On Thu, Dec 03, 2020 at 12:31:44PM -0500, Andrea Arcangeli wrote: > On Thu, Dec 03, 2020 at 12:51:07PM +0200, Mike Rapoport wrote: > > On Thu, Dec 03, 2020 at 01:23:02AM -0500, Andrea Arcangeli wrote: > > > 5) pfn 0 is the classical case where pfn 0 is in a reserved zone in > > > memblock.reserve that doesn't overlap any memblock.memory zone. > > > > And, IMHO, this is the fundamental BUG. > > We have a dozen arches that change all the time, We have ~20 arches that work just fine and only x86 runs into troubles from time to time ;-) > new efi/e820 stuff, always new bios configs, the memblock code must > cope and be able to cope with the caller being wrong or changing > behavior if the e820 map changes behaviour. > Trying to work around memblock core deficiencies in the caller is what > 124049decbb121ec32742c94fb5d9d6bed8f24d8 already did, and it doesn't > work and it's not even clean thing to do. It's not exactly memblock core deficiencies, it's rather the whole path from memory detection to the point where we have memmap and page allocator functional. I agree that the initialization of memmap and page allocator should be improved. And 73a6e474cb376921a311786652782155eac2fdf0 was a part of large series that largely reduced the complexity of free_area_init() and friends. Particularly this commit removed a couple of cycles from memmap init. And when you touch such a sensitive and fragile area mistakes can happen. So 73a6e474cb376921a311786652782155eac2fdf0 missed a corner case, but I don't understand why you are so annoyed by "memblock core"... > In fact here there's really nothing to blame the caller for (i.e. the > e820 init code), unless you declare that there must be always overlap > (which I understood would break stuff for regions that must not have a > direct mapping). > > The caller here is correct and it can be summarized as below: > > if (e820 map type != system_ram) > memblock_reserve(range) > else > memblock_add(range) I don't agree that memblock_add(range) should be in else. PFN 0 is *memory* so why don't we treat it as such? I have a preliminary patch that actually does this and updates x86 parts of x86 memory initialization to keep it backward compatible. > Then: > > zone_boundaries = [ 16M, 4G, 0 ] > free_area_init(zone_boundaries) > > It's not the caller fault if the e820 map in the bios starts with: > > pfn 0-1 reserved > pfn 1-N system RAM So in your example, is pfn 0 a part of memory or not? It there a physical memory at address 0? I don't think that because BIOS uses this memory it stops being memory. > This is better fixed in a way that must not require any change in the > caller... The generic code needs to be more robust, but I'm not sure that the complexity around holes initialization vs. relatively simple update to the caller is the best way to solve it. > > There is RAM at address 0, there is a DIMM there, so why on earth this > > should not be a part of memblock.memory? > > How can you blame the caller if you explicitly didn't required that > .reserved ranges must always overlap .memory ranges? In fact it was > explained as a feature to avoid the direct mapping. > > Declaring that there must be always overlap would be one way to cope > with this issue, but even then memblock must detect a wrong caller, > workaround it by doing a memblock_add call inside the > memblock_reserved and by printing a warning to improve the caller. It > shouldn't crash at boot with console off if the overlap is not there. If memblock would have done this and would had an implicit memblock_add() for the reserved areas we would hit the same issue with regions that any access to them crashes the system. > In my view the caller here is not to blame, all these issues are > because memblock needs improvement and must cope with any caller. The caller, which is low level architecture code, has the best knowledge of the memory layout. It's the caller who translates that memory layout to the generic abstraction. If that translation lacks details, the generic code is limited in its ability to workaround those missing details. In the case of e820 it seems that the existing abstraction of 'memory' and 'reserved' is not sufficient and we need more details to describe the physical memory. For instance 'memory', 'used by BIOS', 'reserved by kernel'. Still, this requires consensus between arch and mm about what is what. > > > The memblock_start_of_DRAM moves the start of the DMA zone above > > > the pfn 0, so then pfn 0 already ends up in the no zone land David > > > mentioned where it's not trivial to decide to give it a zoneid if > > > it's not even spanned in the zone. > > > > > > Not just the zoneid, there's not even a nid. > > > > > > So we have a pfn with no zoneid, and no nid, and not part of the > > > zone ranges, not part of the nid ranges not part of the > > > memory.memblock. > > > > We have a pfn that should have been in memblock.memory, in ZONE_DMA and > > in the first node with memory. If it was not trimmed from there > > My incremental patch already solves how to extend the zones spans and > the nid spans by taking memblock.reserved into account, not just > memblock.memory, but still without actually messing with the direct > mapping, otherwise I would have called memblock_add instead of walking > memblock.reserved when detecting the nid and zoneid ranges. I've commented on the patch at other thread. > > > > > We can't really skip the initialization of the the pfn 0, it has to > > > get the zoneid 0 treatment or if pfn 1 ends up in compaction code, > > > we'd crash again. (although we'd crash with a nice PagePoison(page) > > > == true behavior) > > > > Agree. Struct page for pfn should get the same zoneid and nodeid as pfn 1. > > How are you sure there's no a zone ID 0 that starts at pfn 0 and ends > at pfn 1 and the zone dma starts at pfn 1? In such case the pfn 0 > would have a zoneid different from pfn 1. > > I'm not exactly sure why we should initialize a pfn 0 with a zoneid of > a zone whose pfn 0 is not part of, that looks wrong. > > I'm not exactly sure why you don't fix the zone->zone_start_pfn and > the respective nid node_start_pfn to start from pfn 0 instead, just > like I did in my patch. I think the correct fix for this is to have pfn 0 in memblock.memory, Meanwhile, having this patch on top of "mm: refactor initialization of stuct page for holes in memory layout" is the least of all evils, see also comments below. > > From 84a1c2531374706f3592a638523278aa29aaa448 Mon Sep 17 00:00:00 2001 > > From: Mike Rapoport <rppt@linux.ibm.com> > > Date: Thu, 3 Dec 2020 11:40:17 +0200 > > Subject: [PATCH] fixup for "mm: refactor initialization of stuct page for holes" > > > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> > > --- > > mm/page_alloc.c | 16 ++++++++-------- > > 1 file changed, 8 insertions(+), 8 deletions(-) > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index ce2bdaabdf96..86fde4424e87 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -6227,7 +6227,8 @@ void __init __weak memmap_init(unsigned long size, int nid, > > unsigned long zone, > > unsigned long range_start_pfn) > > zone, nid); ... > 1) this looks very specific for pfn 0 Nope, this fix takes care of pfns "before start of RAM" in general. To the best of my knowledge, this only happens to be pfn 0 on x86. > 2) it's not optimal to call init_unavailable_range(0,) the first time > at every zone initialization, that's a pretty inefficient loop for > large RAM systems. This is not called every time because hole_start_pfn is static. > 3) you give pfn 0 a zoneid despite it's not part of the zone, that > looks wrong from a zoneid initialization point of view. > > The bug here is that the DMA zone starts in pfn 1 which doesn't make > any technical sense since the start of the pageblock would then be > beyond the start_zone_pfn. Right, the bug is that DMA zone starts in pfn 1 because pfn 0 is not in .memory and thus bypasses all the zone and node sizing calculations. As architectures only specify max_zone_pfns when they call free_area_init() I believe it is safe to presume that every pfn that is smaller than max_zone_pfns[0] belongs to zone 0. And this fix alone with interleaving of init_unavailable_mem() inside memmap_init() does exactly that. For pages magically appearing at the range [0, memblock_start_of_DRAM()] I set zone link in struct page to zone[0]. > Thanks, > Andrea >
On Sat, Dec 05, 2020 at 09:26:47PM -0500, Andrea Arcangeli wrote: > Hi Mel, > > On Thu, Nov 26, 2020 at 10:47:20AM +0000, Mel Gorman wrote: > > Agreed. This thread has a lot of different directions in it at this > > point so what I'd hope for is first, a patch that initialises holes with > > zone/node linkages within a 1<<(MAX_ORDER-1) alignment. If there is a > > hole, it would be expected the pages are PageReserved. Second, a fix to > > fast_isolate that forces PFNs returned to always be within the stated > > zone boundaries. > > The last two patches should resolve the struct page > initialization > https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/ and the > VM_BUG_ON_PAGE never happened again as expected. > > So I looked back to see how the "distance" logic is accurate. I added > those checks and I get negative hits: > > diff --git a/mm/compaction.c b/mm/compaction.c > index cc1a7f600a86..844a90b0acdf 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -1331,6 +1331,12 @@ fast_isolate_freepages(struct compact_control *cc) > low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2)); > min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1)); > > + WARN_ON_ONCE((long) low_pfn < 0); > + WARN_ON_ONCE((long) min_pfn < 0); > + if ((long) low_pfn < 0) > + return cc->free_pfn; > + if ((long) min_pfn < 0) > + return cc->free_pfn; > if (WARN_ON_ONCE(min_pfn > low_pfn)) > low_pfn = min_pfn; > > Both warn-on-once triggers, so it goes negative. While it appears not > kernel crashing since pfn_valid won't succeed on negative entries and > they'll always be higher than any pfn in the freelist, is this sign > that there's further room for improvement here? > Possibly, checking the wrong pfns is simply risky. This is not tested at all, just checking if it's in the right ballpark even. Intent is that when the free/migrate PFNs are too close or already overlapping that no attempt is made and it returns back to detect the scanners have met. diff --git a/mm/compaction.c b/mm/compaction.c index 13cb7a961b31..208cb5857446 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1313,6 +1313,10 @@ fast_isolate_freepages(struct compact_control *cc) if (cc->order <= 0) return cc->free_pfn; + /* Ensure that migration and free scanner are not about to cross */ + if (cc->migrate_pfn < cc->free_pfn) + return cc->free_pfn; + /* * If starting the scan, use a deeper search and use the highest * PFN found if a suitable one is not found. @@ -1324,9 +1328,12 @@ fast_isolate_freepages(struct compact_control *cc) /* * Preferred point is in the top quarter of the scan space but take - * a pfn from the top half if the search is problematic. + * a pfn from the top half if the search is problematic. Ensure + * there is enough distance to make the fast search worthwhile. */ distance = (cc->free_pfn - cc->migrate_pfn); + if (distance <= (pageblock_nr_pages << 2)) + return cc->free_pfn; low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2)); min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1));
diff --git a/mm/compaction.c b/mm/compaction.c index 13cb7a961b31..d17e69549d34 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1433,7 +1433,10 @@ fast_isolate_freepages(struct compact_control *cc) page = pageblock_pfn_to_page(min_pfn, pageblock_end_pfn(min_pfn), cc->zone); - cc->free_pfn = min_pfn; + if (likely(!PageReserved(page))) + cc->free_pfn = min_pfn; + else + page = NULL; } } }
A corollary issue was fixed in e577c8b64d58fe307ea4d5149d31615df2d90861. A second issue remained in v5.7: https://lkml.kernel.org/r/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw == page:ffffea0000aa0000 refcount:1 mapcount:0 mapping:000000002243743b index:0x0 flags: 0x1fffe000001000(reserved) == 73a6e474cb376921a311786652782155eac2fdf0 was applied to supposedly the second issue, but I still reproduced it twice with v5.9 on two different systems: == page:0000000062b3e92f refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x39800 flags: 0x1000(reserved) == page:000000002a7114f8 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7a200 flags: 0x1fff000000001000(reserved) == I actually never reproduced it until v5.9, but it's still the same bug as it was reported first for v5.7. See the page is "reserved" in all 3 cases. In the last two crashes with the pfn: pfn 0x39800 -> 0x39800000 min_pfn hit non-RAM: 39639000-39814fff : Unknown E820 type pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM: 7a17b000-7a216fff : Unknown E820 type This actually seems a false positive bugcheck, the page structures are valid and the zones are correct, just it's non-RAM but setting pageblockskip should do no harm. However it's possible to solve the crash without lifting the bugcheck, by enforcing the invariant that the free_pfn cursor doesn't point to reserved pages (which would be otherwise implicitly achieved through the PageBuddy check, except in the new fast_isolate_around() path). Fixes: 5a811889de10 ("mm, compaction: use free lists to quickly locate a migration target") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- mm/compaction.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)