diff mbox series

[1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

Message ID 20201121194506.13464-2-aarcange@redhat.com (mailing list archive)
State New, archived
Headers show
Series VM_BUG_ON_PAGE(!zone_spans_pfn) in set_pfnblock_flags_mask | expand

Commit Message

Andrea Arcangeli Nov. 21, 2020, 7:45 p.m. UTC
A corollary issue was fixed in
e577c8b64d58fe307ea4d5149d31615df2d90861. A second issue remained in
v5.7:

	https://lkml.kernel.org/r/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw

==
page:ffffea0000aa0000 refcount:1 mapcount:0 mapping:000000002243743b index:0x0
flags: 0x1fffe000001000(reserved)
==

73a6e474cb376921a311786652782155eac2fdf0 was applied to supposedly the
second issue, but I still reproduced it twice with v5.9 on two
different systems:

==
page:0000000062b3e92f refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x39800
flags: 0x1000(reserved)
==
page:000000002a7114f8 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7a200
flags: 0x1fff000000001000(reserved)
==

I actually never reproduced it until v5.9, but it's still the same bug
as it was reported first for v5.7.

See the page is "reserved" in all 3 cases. In the last two crashes
with the pfn:

pfn 0x39800 -> 0x39800000 min_pfn hit non-RAM:

39639000-39814fff : Unknown E820 type

pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM:

7a17b000-7a216fff : Unknown E820 type

This actually seems a false positive bugcheck, the page structures are
valid and the zones are correct, just it's non-RAM but setting
pageblockskip should do no harm. However it's possible to solve the
crash without lifting the bugcheck, by enforcing the invariant that
the free_pfn cursor doesn't point to reserved pages (which would be
otherwise implicitly achieved through the PageBuddy check, except in
the new fast_isolate_around() path).

Fixes: 5a811889de10 ("mm, compaction: use free lists to quickly locate a migration target")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/compaction.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Andrea Arcangeli Nov. 21, 2020, 7:53 p.m. UTC | #1
On Sat, Nov 21, 2020 at 02:45:06PM -0500, Andrea Arcangeli wrote:
> +					if (likely(!PageReserved(page)))

NOTE: this line will have to become "likely(page &&
!PageReserved(page))" to handle the case of non contiguous zones,
since pageblock_pfn_to_page might return NULL in that case.. I noticed
right after sending, but I'll wait some feedback before resending the
correction in case it gets fixed in another way.
David Hildenbrand Nov. 23, 2020, 11:26 a.m. UTC | #2
On 21.11.20 20:53, Andrea Arcangeli wrote:
> On Sat, Nov 21, 2020 at 02:45:06PM -0500, Andrea Arcangeli wrote:
>> +					if (likely(!PageReserved(page)))
> 
> NOTE: this line will have to become "likely(page &&
> !PageReserved(page))" to handle the case of non contiguous zones,
> since pageblock_pfn_to_page might return NULL in that case.. I noticed
> right after sending, but I'll wait some feedback before resending the
> correction in case it gets fixed in another way.

With that fixed, this looks sane to me.
Vlastimil Babka Nov. 23, 2020, 1:01 p.m. UTC | #3
On 11/21/20 8:45 PM, Andrea Arcangeli wrote:
> A corollary issue was fixed in
> e577c8b64d58fe307ea4d5149d31615df2d90861. A second issue remained in
> v5.7:
> 
> 	https://lkml.kernel.org/r/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw
> 
> ==
> page:ffffea0000aa0000 refcount:1 mapcount:0 mapping:000000002243743b index:0x0
> flags: 0x1fffe000001000(reserved)
> ==
> 
> 73a6e474cb376921a311786652782155eac2fdf0 was applied to supposedly the
> second issue, but I still reproduced it twice with v5.9 on two
> different systems:
> 
> ==
> page:0000000062b3e92f refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x39800
> flags: 0x1000(reserved)
> ==
> page:000000002a7114f8 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7a200
> flags: 0x1fff000000001000(reserved)
> ==
> 
> I actually never reproduced it until v5.9, but it's still the same bug
> as it was reported first for v5.7.
> 
> See the page is "reserved" in all 3 cases. In the last two crashes
> with the pfn:
> 
> pfn 0x39800 -> 0x39800000 min_pfn hit non-RAM:
> 
> 39639000-39814fff : Unknown E820 type
> 
> pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM:
> 
> 7a17b000-7a216fff : Unknown E820 type

It would be nice to also provide a /proc/zoneinfo and how exactly the 
"zone_spans_pfn" was violated. I assume we end up below zone's 
start_pfn, but is it true?

> This actually seems a false positive bugcheck, the page structures are
> valid and the zones are correct, just it's non-RAM but setting
> pageblockskip should do no harm. However it's possible to solve the
> crash without lifting the bugcheck, by enforcing the invariant that
> the free_pfn cursor doesn't point to reserved pages (which would be
> otherwise implicitly achieved through the PageBuddy check, except in
> the new fast_isolate_around() path).
> 
> Fixes: 5a811889de10 ("mm, compaction: use free lists to quickly locate a migration target")
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>   mm/compaction.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 13cb7a961b31..d17e69549d34 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1433,7 +1433,10 @@ fast_isolate_freepages(struct compact_control *cc)
>   					page = pageblock_pfn_to_page(min_pfn,
>   						pageblock_end_pfn(min_pfn),
>   						cc->zone);
> -					cc->free_pfn = min_pfn;
> +					if (likely(!PageReserved(page)))

PageReserved check seems rather awkward solution to me. Wouldn't it be 
more obvious if we made sure we don't end up below zone's start_pfn (if 
my assumption is correct) in the first place?

When I check the code:

unsigned long distance;
distance = (cc->free_pfn - cc->migrate_pfn);
low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2));
min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1));

I think what can happen is that cc->free_pfn <= cc->migrate_pfn after 
the very last isolate_migratepages(). Then compact_finished() detects 
that in compact_zone(), but only after migrate_pages() and thus 
fast_isolate_freepages() is called.

That would mean distance can be negative, or rather a large unsigned 
number and low_pfn and min_pfn end up away from the zone?

Or maybe the above doesn't happen, but cc->free_pfn gets so close to 
start of the zone, that the calculations above make min_pfn go below 
start_pfn?

In any case I would rather make sure we stay within the expected zone 
boundaries, than play tricks with PageReserved. Mel?

> +						cc->free_pfn = min_pfn;
> +					else
> +						page = NULL;
>   				}
>   			}
>   		}
>
Mel Gorman Nov. 24, 2020, 1:32 p.m. UTC | #4
On Mon, Nov 23, 2020 at 02:01:16PM +0100, Vlastimil Babka wrote:
> > I actually never reproduced it until v5.9, but it's still the same bug
> > as it was reported first for v5.7.
> > 
> > See the page is "reserved" in all 3 cases. In the last two crashes
> > with the pfn:
> > 
> > pfn 0x39800 -> 0x39800000 min_pfn hit non-RAM:
> > 
> > 39639000-39814fff : Unknown E820 type
> > 
> > pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM:
> > 
> > 7a17b000-7a216fff : Unknown E820 type
> 
> It would be nice to also provide a /proc/zoneinfo and how exactly the
> "zone_spans_pfn" was violated. I assume we end up below zone's start_pfn,
> but is it true?
> 

It must be if zone_spans_pfn is getting triggered.

> > This actually seems a false positive bugcheck, the page structures are
> > valid and the zones are correct, just it's non-RAM but setting
> > pageblockskip should do no harm. However it's possible to solve the
> > crash without lifting the bugcheck, by enforcing the invariant that
> > the free_pfn cursor doesn't point to reserved pages (which would be
> > otherwise implicitly achieved through the PageBuddy check, except in
> > the new fast_isolate_around() path).
> > 
> > Fixes: 5a811889de10 ("mm, compaction: use free lists to quickly locate a migration target")
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> >   mm/compaction.c | 5 ++++-
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 13cb7a961b31..d17e69549d34 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -1433,7 +1433,10 @@ fast_isolate_freepages(struct compact_control *cc)
> >   					page = pageblock_pfn_to_page(min_pfn,
> >   						pageblock_end_pfn(min_pfn),
> >   						cc->zone);
> > -					cc->free_pfn = min_pfn;
> > +					if (likely(!PageReserved(page)))
> 
> PageReserved check seems rather awkward solution to me. Wouldn't it be more
> obvious if we made sure we don't end up below zone's start_pfn (if my
> assumption is correct) in the first place?
> 
> When I check the code:
> 
> unsigned long distance;
> distance = (cc->free_pfn - cc->migrate_pfn);
> low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2));
> min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1));
> 
> I think what can happen is that cc->free_pfn <= cc->migrate_pfn after the
> very last isolate_migratepages().

I would hope that is not the case because they are not meant to overlap.
However, if the beginning of the pageblock was not the start of a zone
then the pages would be valid but the pfn would still be outside the
zone boundary. If it was reserved, the struct page is valid but not
suitable for set_pfnblock_flags_mask. However, it is a concern in
general because the potential is there that pages are isolated from the
wrong zone.

> Then compact_finished() detects that in
> compact_zone(), but only after migrate_pages() and thus
> fast_isolate_freepages() is called.
> 
> That would mean distance can be negative, or rather a large unsigned number
> and low_pfn and min_pfn end up away from the zone?
> 
> Or maybe the above doesn't happen, but cc->free_pfn gets so close to start
> of the zone, that the calculations above make min_pfn go below start_pfn?
> 

I think the last part is more likely, going below start_pfn

> In any case I would rather make sure we stay within the expected zone
> boundaries, than play tricks with PageReserved. Mel?
> 

It would be preferable because this time it's PageReserved that happened
to trip up an assumption in set_pfnblock_flags_mask but if it was a real
zone and real page then compaction is migrating cross-zone which would
be surprising.

Maybe this untested patch?


diff --git a/mm/compaction.c b/mm/compaction.c
index 13cb7a961b31..ef1b5dacc289 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1330,6 +1330,10 @@ fast_isolate_freepages(struct compact_control *cc)
 	low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2));
 	min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1));
 
+	/* Ensure the PFN is within the zone */
+	low_pfn = max(cc->zone->zone_start_pfn, low_pfn);
+	min_pfn = max(cc->zone->zone_start_pfn, min_pfn);
+
 	if (WARN_ON_ONCE(min_pfn > low_pfn))
 		low_pfn = min_pfn;
Andrea Arcangeli Nov. 24, 2020, 8:56 p.m. UTC | #5
Hello,

On Tue, Nov 24, 2020 at 01:32:05PM +0000, Mel Gorman wrote:
> I would hope that is not the case because they are not meant to overlap.
> However, if the beginning of the pageblock was not the start of a zone
> then the pages would be valid but the pfn would still be outside the
> zone boundary. If it was reserved, the struct page is valid but not
> suitable for set_pfnblock_flags_mask. However, it is a concern in
> general because the potential is there that pages are isolated from the
> wrong zone.

I guess we have more than one issue to correct in that function
because the same BUG_ON reproduced again even with the tentative patch
I posted earlier.

So my guess is that the problematic reserved page isn't pointed by the
min_pfn, but it must have been pointed by the "highest" variable
calculated below?

			if (pfn >= highest)
				highest = pageblock_start_pfn(pfn);

When I looked at where "highest" comes from, it lacks
pageblock_pfn_to_page check (which was added around v5.7 to min_pfn).

Is that the real bug, which may be fixed by something like this? (untested)

==
From 262671e88723b3074251189004ceae39dcd1689d Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Sat, 21 Nov 2020 12:55:58 -0500
Subject: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set
 pageblock_skip on reserved pages

A corollary issue was fixed in
e577c8b64d58fe307ea4d5149d31615df2d90861. A second issue remained in
v5.7:

	https://lkml.kernel.org/r/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw

==
page:ffffea0000aa0000 refcount:1 mapcount:0 mapping:000000002243743b index:0x0
flags: 0x1fffe000001000(reserved)
==

73a6e474cb376921a311786652782155eac2fdf0 was applied to supposedly fix
the second issue, but it still reproduces with v5.9 on two different
systems:

==
page:0000000062b3e92f refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x39800
flags: 0x1000(reserved)
raw: 0000000000001000 fffff5b880e60008 fffff5b880e60008 0000000000000000
raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn))
==
page:000000004d32675d refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7a200
flags: 0x1fff000000001000(reserved)
raw: 1fff000000001000 ffffe6c5c1e88008 ffffe6c5c1e88008 0000000000000000
raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn))
==

The page is "reserved" in all cases. In the last two crashes with the
pfn:

pfn 0x39800 -> 0x39800000 min_pfn hit non-RAM:

39639000-39814fff : Unknown E820 type

pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM:

7a17b000-7a216fff : Unknown E820 type

The pageblock_pfn_to_page check was missing when rewinding pfn to the
start of the pageblock to calculate the "highest" value.

So the "highest" pfn could point to a non valid pfn or not within the
zone, checking the pageblock_pfn_to_page fixes it.

Fixes: 5a811889de10 ("mm, compaction: use free lists to quickly locate a migration target")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/compaction.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 653862aba266..76f061af8f22 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1418,8 +1418,14 @@ fast_isolate_freepages(struct compact_control *cc)
 			nr_scanned++;
 			pfn = page_to_pfn(freepage);
 
-			if (pfn >= highest)
-				highest = pageblock_start_pfn(pfn);
+			if (pfn >= highest) {
+				unsigned long start_pfn, end_pfn;
+				start_pfn = pageblock_start_pfn(pfn);
+				end_pfn = pageblock_end_pfn(start_pfn);
+				if (pageblock_pfn_to_page(start_pfn, end_pfn,
+							  cc->zone))
+					highest = pfn;
+			}
 
 			if (pfn >= low_pfn) {
 				cc->fast_search_fail = 0;
==


Can't we also try to scan in between start_pfn and "pfn" to see if
there's one pfn that passes the pageblock_pfn_to_page test or isn't it
worth it for the fast isolate variant?

> > Then compact_finished() detects that in
> > compact_zone(), but only after migrate_pages() and thus
> > fast_isolate_freepages() is called.
> > 
> > That would mean distance can be negative, or rather a large unsigned number
> > and low_pfn and min_pfn end up away from the zone?
> > 
> > Or maybe the above doesn't happen, but cc->free_pfn gets so close to start
> > of the zone, that the calculations above make min_pfn go below start_pfn?
> > 
> 
> I think the last part is more likely, going below start_pfn

Would it help if I dump the whole status of the zone and pages around
those addresses in the two systems that are reproducing this in v5.9
as extra check?

I was going to do that right now, to validate all zone->zone_start_pfn
and zone_end_pfn() were correct around that non-RAM reserved page
physical address.

> > In any case I would rather make sure we stay within the expected zone
> > boundaries, than play tricks with PageReserved. Mel?
> > 
> 
> It would be preferable because this time it's PageReserved that happened
> to trip up an assumption in set_pfnblock_flags_mask but if it was a real
> zone and real page then compaction is migrating cross-zone which would
> be surprising.
> 
> Maybe this untested patch?

"highest" is not influenced by either low_pfn or min_pfn so it may
very well be needed, but for another case, I don't think this can help
this specific VM_BUG_ON_PAGE if it's caused by "highest" pfn after all?

> diff --git a/mm/compaction.c b/mm/compaction.c
> index 13cb7a961b31..ef1b5dacc289 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1330,6 +1330,10 @@ fast_isolate_freepages(struct compact_control *cc)
>  	low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2));
>  	min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1));
>  
> +	/* Ensure the PFN is within the zone */
> +	low_pfn = max(cc->zone->zone_start_pfn, low_pfn);
> +	min_pfn = max(cc->zone->zone_start_pfn, min_pfn);
> +
>  	if (WARN_ON_ONCE(min_pfn > low_pfn))
>  		low_pfn = min_pfn;
>  
> 
> 
> -- 
> Mel Gorman
> SUSE Labs
>
Andrea Arcangeli Nov. 25, 2020, 5:34 a.m. UTC | #6
Hello,

On Mon, Nov 23, 2020 at 02:01:16PM +0100, Vlastimil Babka wrote:
> On 11/21/20 8:45 PM, Andrea Arcangeli wrote:
> > A corollary issue was fixed in
> > 39639000-39814fff : Unknown E820 type
> > 
> > pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM:
> > 
> > 7a17b000-7a216fff : Unknown E820 type
> 
> It would be nice to also provide a /proc/zoneinfo and how exactly the 
> "zone_spans_pfn" was violated. I assume we end up below zone's 
> start_pfn, but is it true?

Agreed, I was about to grab that info along with all page struct
around the pfn 0x7a200 and phys address 0x7a216fff.

# grep -A1 E820 /proc/iomem
7a17b000-7a216fff : Unknown E820 type
7a217000-7bffffff : System RAM

DMA      zone_start_pfn 1            zone_end_pfn() 4096         contiguous 1     
DMA32    zone_start_pfn 4096         zone_end_pfn() 1048576      contiguous 0     
Normal   zone_start_pfn 1048576      zone_end_pfn() 4715392      contiguous 1     
Movable  zone_start_pfn 0            zone_end_pfn() 0            contiguous 0     

500222 0x7a1fe000 0x1fff000000001000 reserved True
500223 0x7a1ff000 0x1fff000000001000 reserved True

# I suspect "highest pfn" was somewhere in the RAM range
# 0x7a217000-0x7a400000 and the pageblock_start_pfn(pfn)
# made highest point to pfn 0x7a200 physaddr 0x7a200000
# below, which is reserved indeed since it's non-RAM
# first number is pfn hex(500224) == 0x7a200

pfn    physaddr   page->flags
500224 0x7a200000 0x1fff000000001000 reserved True
500225 0x7a201000 0x1fff000000001000 reserved True
*snip*
500245 0x7a215000 0x1fff000000001000 reserved True
500246 0x7a216000 0x1fff000000001000 reserved True
500247 0x7a217000 0x3fff000000000000 reserved False
500248 0x7a218000 0x3fff000000000000 reserved False

All RAM pages non-reserved are starting at 0x7a217000 as expected.

The non-RAM page_zonenum(pfn_to_page(0x7a200)) points to ZONE_DMA and 
page_zone(page) below was the DMA zone despite the pfn of 0x7a200 is
in DMA32.

	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);

So the patch I sent earlier should prevent the above BUG_ON by not
setting highest to 0x7a200 when pfn is in the phys RAM range
0x7a217000-0x7a400000, because pageblock_pfn_to_page will notice that
the zone is the wrong one.

	if (page_zone(start_page) != zone)
		return NULL;

However the real bug seems that reserved pages have a zero zone_id in
the page->flags when it should have the real zone id/nid. The patch I
sent earlier to validate highest would only be needed to deal with
pfn_valid.

Something must have changed more recently than v5.1 that caused the
zoneid of reserved pages to be wrong, a possible candidate for the
real would be this change below:

+               __init_single_page(pfn_to_page(pfn), pfn, 0, 0);

Even if it may not be it, at the light of how the reserved page
zoneid/nid initialized went wrong, the above line like it's too flakey
to stay.

It'd be preferable if the pfn_valid fails and the
pfn_to_section_nr(pfn) returns an invalid section for the intermediate
step. Even better memset 0xff over the whole page struct until the
second stage comes around.

Whenever pfn_valid is true, it's better that the zoneid/nid is correct
all times, otherwise if the second stage fails we end up in a bug with
weird side effects.

Maybe it's not the above that left a zero zoneid though, I haven't
tried to bisect it yet to look how the page->flags looked like on a
older kernel that didn't seem to reproduce this crash, I'm just
guessing.

Thanks,
Andrea
David Hildenbrand Nov. 25, 2020, 6:45 a.m. UTC | #7
> Am 25.11.2020 um 06:34 schrieb Andrea Arcangeli <aarcange@redhat.com>:
> 
> Hello,
> 
>> On Mon, Nov 23, 2020 at 02:01:16PM +0100, Vlastimil Babka wrote:
>>> On 11/21/20 8:45 PM, Andrea Arcangeli wrote:
>>> A corollary issue was fixed in
>>> 39639000-39814fff : Unknown E820 type
>>> 
>>> pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM:
>>> 
>>> 7a17b000-7a216fff : Unknown E820 type
>> 
>> It would be nice to also provide a /proc/zoneinfo and how exactly the 
>> "zone_spans_pfn" was violated. I assume we end up below zone's 
>> start_pfn, but is it true?
> 
> Agreed, I was about to grab that info along with all page struct
> around the pfn 0x7a200 and phys address 0x7a216fff.
> 
> # grep -A1 E820 /proc/iomem
> 7a17b000-7a216fff : Unknown E820 type
> 7a217000-7bffffff : System RAM
> 
> DMA      zone_start_pfn 1            zone_end_pfn() 4096         contiguous 1     
> DMA32    zone_start_pfn 4096         zone_end_pfn() 1048576      contiguous 0     
> Normal   zone_start_pfn 1048576      zone_end_pfn() 4715392      contiguous 1     
> Movable  zone_start_pfn 0            zone_end_pfn() 0            contiguous 0     
> 
> 500222 0x7a1fe000 0x1fff000000001000 reserved True
> 500223 0x7a1ff000 0x1fff000000001000 reserved True
> 
> # I suspect "highest pfn" was somewhere in the RAM range
> # 0x7a217000-0x7a400000 and the pageblock_start_pfn(pfn)
> # made highest point to pfn 0x7a200 physaddr 0x7a200000
> # below, which is reserved indeed since it's non-RAM
> # first number is pfn hex(500224) == 0x7a200
> 
> pfn    physaddr   page->flags
> 500224 0x7a200000 0x1fff000000001000 reserved True
> 500225 0x7a201000 0x1fff000000001000 reserved True
> *snip*
> 500245 0x7a215000 0x1fff000000001000 reserved True
> 500246 0x7a216000 0x1fff000000001000 reserved True
> 500247 0x7a217000 0x3fff000000000000 reserved False
> 500248 0x7a218000 0x3fff000000000000 reserved False
> 
> All RAM pages non-reserved are starting at 0x7a217000 as expected.
> 
> The non-RAM page_zonenum(pfn_to_page(0x7a200)) points to ZONE_DMA and 
> page_zone(page) below was the DMA zone despite the pfn of 0x7a200 is
> in DMA32.
> 
>    VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
> 
> So the patch I sent earlier should prevent the above BUG_ON by not
> setting highest to 0x7a200 when pfn is in the phys RAM range
> 0x7a217000-0x7a400000, because pageblock_pfn_to_page will notice that
> the zone is the wrong one.
> 
>    if (page_zone(start_page) != zone)
>        return NULL;
> 
> However the real bug seems that reserved pages have a zero zone_id in
> the page->flags when it should have the real zone id/nid. The patch I
> sent earlier to validate highest would only be needed to deal with
> pfn_valid.
> 
> Something must have changed more recently than v5.1 that caused the
> zoneid of reserved pages to be wrong, a possible candidate for the
> real would be this change below:
> 
> +               __init_single_page(pfn_to_page(pfn), pfn, 0, 0);
> 

Before that change, the memmap of memory holes were only zeroed out. So the zones/nid was 0, however, pages were not reserved and had a refcount of zero - resulting in other issues.

Most pfn walkers shouldn‘t mess with reserved pages and simply skip them. That would be the right fix here.

> Even if it may not be it, at the light of how the reserved page
> zoneid/nid initialized went wrong, the above line like it's too flakey
> to stay.
> 
> It'd be preferable if the pfn_valid fails and the
> pfn_to_section_nr(pfn) returns an invalid section for the intermediate
> step. Even better memset 0xff over the whole page struct until the
> second stage comes around.

I recently discussed with Baoquan to
1. Using a new pagetype to mark memory holes
2. Using special nid/zid to mark memory holes

Setting the memmap to 0xff would be even more dangerous - pfn_zone() might simply BUG_ON.

> 
> Whenever pfn_valid is true, it's better that the zoneid/nid is correct
> all times, otherwise if the second stage fails we end up in a bug with
> weird side effects.

Memory holes with a valid memmap might not have a zone/nid. For now, skipping reserved pages should be good enough, no?

> 
> Maybe it's not the above that left a zero zoneid though, I haven't
> tried to bisect it yet to look how the page->flags looked like on a
> older kernel that didn't seem to reproduce this crash, I'm just
> guessing.
> 
> Thanks,
> Andrea
Mike Rapoport Nov. 25, 2020, 8:51 a.m. UTC | #8
On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
> 
> > Am 25.11.2020 um 06:34 schrieb Andrea Arcangeli <aarcange@redhat.com>:
> > 
> > Hello,
> > 
> >> On Mon, Nov 23, 2020 at 02:01:16PM +0100, Vlastimil Babka wrote:
> >>> On 11/21/20 8:45 PM, Andrea Arcangeli wrote:
> >>> A corollary issue was fixed in
> >>> 39639000-39814fff : Unknown E820 type
> >>> 
> >>> pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM:
> >>> 
> >>> 7a17b000-7a216fff : Unknown E820 type
> >> 
> >> It would be nice to also provide a /proc/zoneinfo and how exactly the 
> >> "zone_spans_pfn" was violated. I assume we end up below zone's 
> >> start_pfn, but is it true?
> > 
> > Agreed, I was about to grab that info along with all page struct
> > around the pfn 0x7a200 and phys address 0x7a216fff.
> > 
> > # grep -A1 E820 /proc/iomem
> > 7a17b000-7a216fff : Unknown E820 type
> > 7a217000-7bffffff : System RAM
> > 
> > DMA      zone_start_pfn 1            zone_end_pfn() 4096         contiguous 1     
> > DMA32    zone_start_pfn 4096         zone_end_pfn() 1048576      contiguous 0     
> > Normal   zone_start_pfn 1048576      zone_end_pfn() 4715392      contiguous 1     
> > Movable  zone_start_pfn 0            zone_end_pfn() 0            contiguous 0     
> > 
> > 500222 0x7a1fe000 0x1fff000000001000 reserved True
> > 500223 0x7a1ff000 0x1fff000000001000 reserved True
> > 
> > # I suspect "highest pfn" was somewhere in the RAM range
> > # 0x7a217000-0x7a400000 and the pageblock_start_pfn(pfn)
> > # made highest point to pfn 0x7a200 physaddr 0x7a200000
> > # below, which is reserved indeed since it's non-RAM
> > # first number is pfn hex(500224) == 0x7a200
> > 
> > pfn    physaddr   page->flags
> > 500224 0x7a200000 0x1fff000000001000 reserved True
> > 500225 0x7a201000 0x1fff000000001000 reserved True
> > *snip*
> > 500245 0x7a215000 0x1fff000000001000 reserved True
> > 500246 0x7a216000 0x1fff000000001000 reserved True
> > 500247 0x7a217000 0x3fff000000000000 reserved False
> > 500248 0x7a218000 0x3fff000000000000 reserved False
> > 
> > All RAM pages non-reserved are starting at 0x7a217000 as expected.
> > 
> > The non-RAM page_zonenum(pfn_to_page(0x7a200)) points to ZONE_DMA and 
> > page_zone(page) below was the DMA zone despite the pfn of 0x7a200 is
> > in DMA32.
> > 
> >    VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
> > 
> > So the patch I sent earlier should prevent the above BUG_ON by not
> > setting highest to 0x7a200 when pfn is in the phys RAM range
> > 0x7a217000-0x7a400000, because pageblock_pfn_to_page will notice that
> > the zone is the wrong one.
> > 
> >    if (page_zone(start_page) != zone)
> >        return NULL;
> > 
> > However the real bug seems that reserved pages have a zero zone_id in
> > the page->flags when it should have the real zone id/nid. The patch I
> > sent earlier to validate highest would only be needed to deal with
> > pfn_valid.
> > 
> > Something must have changed more recently than v5.1 that caused the
> > zoneid of reserved pages to be wrong, a possible candidate for the
> > real would be this change below:
> > 
> > +               __init_single_page(pfn_to_page(pfn), pfn, 0, 0);
> > 
> 
> Before that change, the memmap of memory holes were only zeroed out.
> So the zones/nid was 0, however, pages were not reserved and had a
> refcount of zero - resulting in other issues.
> 
> Most pfn walkers shouldn‘t mess with reserved pages and simply skip
> them. That would be the right fix here.

My guest would be that it is me and Baoquan:
73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")

Until then reserved pages were traversed in memmap_init_zone() and after
the change they are not because on x86 reserved memory is not considered
memory for some reason.

Can you please check if this untested patch helps:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eaa227a479e4..be3c85a6714e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6191,7 +6191,9 @@ void __meminit __weak memmap_init(unsigned long size, int nid,
 {
 	unsigned long start_pfn, end_pfn;
 	unsigned long range_end_pfn = range_start_pfn + size;
+	phys_addr_t start, end;
 	int i;
+	u64 j;
 
 	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
 		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
@@ -6203,6 +6205,19 @@ void __meminit __weak memmap_init(unsigned long size, int nid,
 					 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
 		}
 	}
+
+	__for_each_mem_range(j, &memblock.reserved, NULL, nid,	MEMBLOCK_NONE,
+			     &start, &end, NULL) {
+		start_pfn = clamp(PHYS_PFN(start), range_start_pfn,
+				  range_end_pfn);
+		end_pfn = clamp(PHYS_PFN(end), range_start_pfn, range_end_pfn);
+
+		if (end_pfn > start_pfn) {
+			size = end_pfn - start_pfn;
+			memmap_init_zone(size, nid, zone, start_pfn,
+					 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
+		}
+	}
 }
 
 static int zone_batchsize(struct zone *zone)

> > Even if it may not be it, at the light of how the reserved page
> > zoneid/nid initialized went wrong, the above line like it's too flakey
> > to stay.
> > 
> > It'd be preferable if the pfn_valid fails and the
> > pfn_to_section_nr(pfn) returns an invalid section for the intermediate
> > step. Even better memset 0xff over the whole page struct until the
> > second stage comes around.
> 
> I recently discussed with Baoquan to
> 1. Using a new pagetype to mark memory holes
> 2. Using special nid/zid to mark memory holes
> 
> Setting the memmap to 0xff would be even more dangerous - pfn_zone() might simply BUG_ON.
> 
> > 
> > Whenever pfn_valid is true, it's better that the zoneid/nid is correct
> > all times, otherwise if the second stage fails we end up in a bug with
> > weird side effects.
> 
> Memory holes with a valid memmap might not have a zone/nid. For now, skipping reserved pages should be good enough, no?
> 
> > 
> > Maybe it's not the above that left a zero zoneid though, I haven't
> > tried to bisect it yet to look how the page->flags looked like on a
> > older kernel that didn't seem to reproduce this crash, I'm just
> > guessing.
> > 
> > Thanks,
> > Andrea
>
Mel Gorman Nov. 25, 2020, 10:30 a.m. UTC | #9
On Tue, Nov 24, 2020 at 03:56:22PM -0500, Andrea Arcangeli wrote:
> Hello,
> 
> On Tue, Nov 24, 2020 at 01:32:05PM +0000, Mel Gorman wrote:
> > I would hope that is not the case because they are not meant to overlap.
> > However, if the beginning of the pageblock was not the start of a zone
> > then the pages would be valid but the pfn would still be outside the
> > zone boundary. If it was reserved, the struct page is valid but not
> > suitable for set_pfnblock_flags_mask. However, it is a concern in
> > general because the potential is there that pages are isolated from the
> > wrong zone.
> 
> I guess we have more than one issue to correct in that function
> because the same BUG_ON reproduced again even with the tentative patch
> I posted earlier.
> 
> So my guess is that the problematic reserved page isn't pointed by the
> min_pfn, but it must have been pointed by the "highest" variable
> calculated below?
> 
> 			if (pfn >= highest)
> 				highest = pageblock_start_pfn(pfn);
> 
> When I looked at where "highest" comes from, it lacks
> pageblock_pfn_to_page check (which was added around v5.7 to min_pfn).
> 
> Is that the real bug, which may be fixed by something like this? (untested)
> 

It's plausible as it is a potential source of leaking but as you note
in another mail, it's surprising to me that valid struct pages, even if
within memory holes and reserved would have broken node/zone information
in the page flags.
Mel Gorman Nov. 25, 2020, 10:39 a.m. UTC | #10
On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
> > Something must have changed more recently than v5.1 that caused the
> > zoneid of reserved pages to be wrong, a possible candidate for the
> > real would be this change below:
> > 
> > +               __init_single_page(pfn_to_page(pfn), pfn, 0, 0);
> > 
> 
> Before that change, the memmap of memory holes were only zeroed out. So the zones/nid was 0, however, pages were not reserved and had a refcount of zero - resulting in other issues.
> 
> Most pfn walkers shouldn???t mess with reserved pages and simply skip them. That would be the right fix here.
> 

Ordinarily yes, pfn walkers should not care about reserved pages but it's
still surprising that the node/zone linkages would be wrong for memory
holes. If they are in the middle of a zone, it means that a hole with
valid struct pages could be mistaken for overlapping nodes (if the hole
was in node 1 for example) or overlapping zones which is just broken.

> > 
> > Whenever pfn_valid is true, it's better that the zoneid/nid is correct
> > all times, otherwise if the second stage fails we end up in a bug with
> > weird side effects.
> 
> Memory holes with a valid memmap might not have a zone/nid. For now, skipping reserved pages should be good enough, no?
> 

It would partially paper over the issue that setting the pageblock type
based on a reserved page. I agree that compaction should not be returning
pfns that are outside of the zone range because that is buggy in itself
but valid struct pages should have valid information. I don't think we
want to paper over that with unnecessary PageReserved checks.
David Hildenbrand Nov. 25, 2020, 11:04 a.m. UTC | #11
On 25.11.20 11:39, Mel Gorman wrote:
> On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
>>> Something must have changed more recently than v5.1 that caused the
>>> zoneid of reserved pages to be wrong, a possible candidate for the
>>> real would be this change below:
>>>
>>> +               __init_single_page(pfn_to_page(pfn), pfn, 0, 0);
>>>
>>
>> Before that change, the memmap of memory holes were only zeroed out. So the zones/nid was 0, however, pages were not reserved and had a refcount of zero - resulting in other issues.
>>
>> Most pfn walkers shouldn???t mess with reserved pages and simply skip them. That would be the right fix here.
>>
> 
> Ordinarily yes, pfn walkers should not care about reserved pages but it's
> still surprising that the node/zone linkages would be wrong for memory
> holes. If they are in the middle of a zone, it means that a hole with
> valid struct pages could be mistaken for overlapping nodes (if the hole
> was in node 1 for example) or overlapping zones which is just broken.

I agree within zones - but AFAIU, the issue is reserved memory between
zones, right?

Assume your end of memory falls within a section - what would be the
right node/zone for such a memory hole at the end of the section? With
memory hotplug after such a hole, we can easily have multiple
nodes/zones spanning such a hole, unknown before hotplug.

IMHO, marking memory holes properly (as discussed) would be the cleanest
approach. For now, we use node/zone 0 + PageReserved - because memory
hotunplug (zone shrinking etc.) doesn't really care about ZONE_DMA.

> 
>>>
>>> Whenever pfn_valid is true, it's better that the zoneid/nid is correct
>>> all times, otherwise if the second stage fails we end up in a bug with
>>> weird side effects.
>>
>> Memory holes with a valid memmap might not have a zone/nid. For now, skipping reserved pages should be good enough, no?
>>
> 
> It would partially paper over the issue that setting the pageblock type
> based on a reserved page. I agree that compaction should not be returning
> pfns that are outside of the zone range because that is buggy in itself
> but valid struct pages should have valid information. I don't think we
> want to paper over that with unnecessary PageReserved checks.

Agreed as long as we can handle that issue using range checks.
David Hildenbrand Nov. 25, 2020, 11:41 a.m. UTC | #12
On 25.11.20 12:04, David Hildenbrand wrote:
> On 25.11.20 11:39, Mel Gorman wrote:
>> On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
>>>> Something must have changed more recently than v5.1 that caused the
>>>> zoneid of reserved pages to be wrong, a possible candidate for the
>>>> real would be this change below:
>>>>
>>>> +               __init_single_page(pfn_to_page(pfn), pfn, 0, 0);
>>>>
>>>
>>> Before that change, the memmap of memory holes were only zeroed out. So the zones/nid was 0, however, pages were not reserved and had a refcount of zero - resulting in other issues.
>>>
>>> Most pfn walkers shouldn???t mess with reserved pages and simply skip them. That would be the right fix here.
>>>
>>
>> Ordinarily yes, pfn walkers should not care about reserved pages but it's
>> still surprising that the node/zone linkages would be wrong for memory
>> holes. If they are in the middle of a zone, it means that a hole with
>> valid struct pages could be mistaken for overlapping nodes (if the hole
>> was in node 1 for example) or overlapping zones which is just broken.
> 
> I agree within zones - but AFAIU, the issue is reserved memory between
> zones, right?

Double checking, I was confused. This applies also to memory holes
within zones in x86.
Vlastimil Babka Nov. 25, 2020, 12:08 p.m. UTC | #13
On 11/25/20 6:34 AM, Andrea Arcangeli wrote:
> Hello,
> 
> On Mon, Nov 23, 2020 at 02:01:16PM +0100, Vlastimil Babka wrote:
>> On 11/21/20 8:45 PM, Andrea Arcangeli wrote:
>> > A corollary issue was fixed in
>> > 39639000-39814fff : Unknown E820 type
>> > 
>> > pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM:
>> > 
>> > 7a17b000-7a216fff : Unknown E820 type
>> 
>> It would be nice to also provide a /proc/zoneinfo and how exactly the 
>> "zone_spans_pfn" was violated. I assume we end up below zone's 
>> start_pfn, but is it true?
> 
> Agreed, I was about to grab that info along with all page struct
> around the pfn 0x7a200 and phys address 0x7a216fff.
> 
> # grep -A1 E820 /proc/iomem
> 7a17b000-7a216fff : Unknown E820 type
> 7a217000-7bffffff : System RAM
> 
> DMA      zone_start_pfn 1            zone_end_pfn() 4096         contiguous 1
> DMA32    zone_start_pfn 4096         zone_end_pfn() 1048576      contiguous 0
> Normal   zone_start_pfn 1048576      zone_end_pfn() 4715392      contiguous 1
> Movable  zone_start_pfn 0            zone_end_pfn() 0            contiguous 0

So the above means that around the "Unknown E820 type" we have:

pfn 499712 - start of pageblock in ZONE_DMA32
pfn 500091 - start of the "Unknown E820 type" range
pfn 500224 - start of another pageblock
pfn 500246 - end of "Unknown E820 type"

So this is indeed not a zone boundary issue, but basically a hole not 
aligned to pageblock boundary and really unexpected.
We have CONFIG_HOLES_IN_ZONE (that x86 doesn't set) for architectures 
that do this, and even that config only affects pfn_valid_within(). But 
here pfn_valid() is true, but the zone/node linkage is unexpected.

> However the real bug seems that reserved pages have a zero zone_id in
> the page->flags when it should have the real zone id/nid. The patch I
> sent earlier to validate highest would only be needed to deal with
> pfn_valid.
> 
> Something must have changed more recently than v5.1 that caused the
> zoneid of reserved pages to be wrong, a possible candidate for the
> real would be this change below:
> 
> +               __init_single_page(pfn_to_page(pfn), pfn, 0, 0);
> 
> Even if it may not be it, at the light of how the reserved page
> zoneid/nid initialized went wrong, the above line like it's too flakey
> to stay.
> 
> It'd be preferable if the pfn_valid fails and the
> pfn_to_section_nr(pfn) returns an invalid section for the intermediate
> step. Even better memset 0xff over the whole page struct until the
> second stage comes around.
> 
> Whenever pfn_valid is true, it's better that the zoneid/nid is correct
> all times, otherwise if the second stage fails we end up in a bug with
> weird side effects.

Yeah I guess it would be simpler if zoneid/nid was correct for 
pfn_valid() pfns within a zone's range, even if they are reserved due 
not not being really usable memory.

I don't think we want to introduce CONFIG_HOLES_IN_ZONE to x86. If the 
chosen solution is to make this to a real hole, the hole should be 
extended to MAX_ORDER_NR_PAGES aligned boundaries.

In any case, compaction code can't fix this with better range checks.

> Maybe it's not the above that left a zero zoneid though, I haven't
> tried to bisect it yet to look how the page->flags looked like on a
> older kernel that didn't seem to reproduce this crash, I'm just
> guessing.
> 
> Thanks,
> Andrea
>
David Hildenbrand Nov. 25, 2020, 1:32 p.m. UTC | #14
On 25.11.20 13:08, Vlastimil Babka wrote:
> On 11/25/20 6:34 AM, Andrea Arcangeli wrote:
>> Hello,
>>
>> On Mon, Nov 23, 2020 at 02:01:16PM +0100, Vlastimil Babka wrote:
>>> On 11/21/20 8:45 PM, Andrea Arcangeli wrote:
>>>> A corollary issue was fixed in
>>>> 39639000-39814fff : Unknown E820 type
>>>>
>>>> pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM:
>>>>
>>>> 7a17b000-7a216fff : Unknown E820 type
>>>
>>> It would be nice to also provide a /proc/zoneinfo and how exactly the 
>>> "zone_spans_pfn" was violated. I assume we end up below zone's 
>>> start_pfn, but is it true?
>>
>> Agreed, I was about to grab that info along with all page struct
>> around the pfn 0x7a200 and phys address 0x7a216fff.
>>
>> # grep -A1 E820 /proc/iomem
>> 7a17b000-7a216fff : Unknown E820 type
>> 7a217000-7bffffff : System RAM
>>
>> DMA      zone_start_pfn 1            zone_end_pfn() 4096         contiguous 1
>> DMA32    zone_start_pfn 4096         zone_end_pfn() 1048576      contiguous 0
>> Normal   zone_start_pfn 1048576      zone_end_pfn() 4715392      contiguous 1
>> Movable  zone_start_pfn 0            zone_end_pfn() 0            contiguous 0
> 
> So the above means that around the "Unknown E820 type" we have:
> 
> pfn 499712 - start of pageblock in ZONE_DMA32
> pfn 500091 - start of the "Unknown E820 type" range
> pfn 500224 - start of another pageblock
> pfn 500246 - end of "Unknown E820 type"
> 
> So this is indeed not a zone boundary issue, but basically a hole not 
> aligned to pageblock boundary and really unexpected.
> We have CONFIG_HOLES_IN_ZONE (that x86 doesn't set) for architectures 
> that do this, and even that config only affects pfn_valid_within(). But 
> here pfn_valid() is true, but the zone/node linkage is unexpected.
> 
>> However the real bug seems that reserved pages have a zero zone_id in
>> the page->flags when it should have the real zone id/nid. The patch I
>> sent earlier to validate highest would only be needed to deal with
>> pfn_valid.
>>
>> Something must have changed more recently than v5.1 that caused the
>> zoneid of reserved pages to be wrong, a possible candidate for the
>> real would be this change below:
>>
>> +               __init_single_page(pfn_to_page(pfn), pfn, 0, 0);
>>
>> Even if it may not be it, at the light of how the reserved page
>> zoneid/nid initialized went wrong, the above line like it's too flakey
>> to stay.
>>
>> It'd be preferable if the pfn_valid fails and the
>> pfn_to_section_nr(pfn) returns an invalid section for the intermediate
>> step. Even better memset 0xff over the whole page struct until the
>> second stage comes around.
>>
>> Whenever pfn_valid is true, it's better that the zoneid/nid is correct
>> all times, otherwise if the second stage fails we end up in a bug with
>> weird side effects.
> 
> Yeah I guess it would be simpler if zoneid/nid was correct for 
> pfn_valid() pfns within a zone's range, even if they are reserved due 
> not not being really usable memory.
> 
> I don't think we want to introduce CONFIG_HOLES_IN_ZONE to x86. If the 
> chosen solution is to make this to a real hole, the hole should be 
> extended to MAX_ORDER_NR_PAGES aligned boundaries.

As we don't punch out pages of the memmap on x86-64, pfn_valid() keeps
working as expected. There is a memmap that can be accessed and that was
initialized. It's really just a matter of how to handle memory holes in
this scenario.

a) Try initializing them to the covering node/zone (I gave one example
   that might be tricky with hotplug)
b) Mark such pages (either special node/zone or pagetype) and make pfn
   walkers ignore these holes. For now, this can only be done via the
   reserved flag.
Mel Gorman Nov. 25, 2020, 1:33 p.m. UTC | #15
On Wed, Nov 25, 2020 at 12:04:15PM +0100, David Hildenbrand wrote:
> On 25.11.20 11:39, Mel Gorman wrote:
> > On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
> >>> Something must have changed more recently than v5.1 that caused the
> >>> zoneid of reserved pages to be wrong, a possible candidate for the
> >>> real would be this change below:
> >>>
> >>> +               __init_single_page(pfn_to_page(pfn), pfn, 0, 0);
> >>>
> >>
> >> Before that change, the memmap of memory holes were only zeroed out. So the zones/nid was 0, however, pages were not reserved and had a refcount of zero - resulting in other issues.
> >>
> >> Most pfn walkers shouldn???t mess with reserved pages and simply skip them. That would be the right fix here.
> >>
> > 
> > Ordinarily yes, pfn walkers should not care about reserved pages but it's
> > still surprising that the node/zone linkages would be wrong for memory
> > holes. If they are in the middle of a zone, it means that a hole with
> > valid struct pages could be mistaken for overlapping nodes (if the hole
> > was in node 1 for example) or overlapping zones which is just broken.
> 
> I agree within zones - but AFAIU, the issue is reserved memory between
> zones, right?
> 

It can also occur in the middle of the zone.

> Assume your end of memory falls within a section - what would be the
> right node/zone for such a memory hole at the end of the section?

Assuming a hole is not MAX_ORDER-aligned but there is real memory within
the page block, then the node/zone for the struct pages backing the hole
should match the real memorys node and zone.

As it stands, with the uninitialised node/zone, certain checks like
page_is_buddy(): page_zone_id(page) != page_zone_id(buddy) may only
work by co-incidence. page_is_buddy() happens to work anyway because
PageBuddy(buddy) would never be true for a PageReserved page.

> With
> memory hotplug after such a hole, we can easily have multiple
> nodes/zones spanning such a hole, unknown before hotplug.
> 

When hotplugged, the same logic would apply. Where the hole is not aligned,
the struct page linkages should match the "real" memory".

> > It would partially paper over the issue that setting the pageblock type
> > based on a reserved page. I agree that compaction should not be returning
> > pfns that are outside of the zone range because that is buggy in itself
> > but valid struct pages should have valid information. I don't think we
> > want to paper over that with unnecessary PageReserved checks.
> 
> Agreed as long as we can handle that issue using range checks.
> 

I think it'll be ok as long as the struct pages within a 1<<(MAX_ORDER-1)
range have proper linkages.
David Hildenbrand Nov. 25, 2020, 1:41 p.m. UTC | #16
On 25.11.20 14:33, Mel Gorman wrote:
> On Wed, Nov 25, 2020 at 12:04:15PM +0100, David Hildenbrand wrote:
>> On 25.11.20 11:39, Mel Gorman wrote:
>>> On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
>>>>> Something must have changed more recently than v5.1 that caused the
>>>>> zoneid of reserved pages to be wrong, a possible candidate for the
>>>>> real would be this change below:
>>>>>
>>>>> +               __init_single_page(pfn_to_page(pfn), pfn, 0, 0);
>>>>>
>>>>
>>>> Before that change, the memmap of memory holes were only zeroed out. So the zones/nid was 0, however, pages were not reserved and had a refcount of zero - resulting in other issues.
>>>>
>>>> Most pfn walkers shouldn???t mess with reserved pages and simply skip them. That would be the right fix here.
>>>>
>>>
>>> Ordinarily yes, pfn walkers should not care about reserved pages but it's
>>> still surprising that the node/zone linkages would be wrong for memory
>>> holes. If they are in the middle of a zone, it means that a hole with
>>> valid struct pages could be mistaken for overlapping nodes (if the hole
>>> was in node 1 for example) or overlapping zones which is just broken.
>>
>> I agree within zones - but AFAIU, the issue is reserved memory between
>> zones, right?
>>
> 
> It can also occur in the middle of the zone.
> 
>> Assume your end of memory falls within a section - what would be the
>> right node/zone for such a memory hole at the end of the section?
> 
> Assuming a hole is not MAX_ORDER-aligned but there is real memory within
> the page block, then the node/zone for the struct pages backing the hole
> should match the real memorys node and zone.
> 
> As it stands, with the uninitialised node/zone, certain checks like
> page_is_buddy(): page_zone_id(page) != page_zone_id(buddy) may only
> work by co-incidence. page_is_buddy() happens to work anyway because
> PageBuddy(buddy) would never be true for a PageReserved page.
> 
>> With
>> memory hotplug after such a hole, we can easily have multiple
>> nodes/zones spanning such a hole, unknown before hotplug.
>>
> 
> When hotplugged, the same logic would apply. Where the hole is not aligned,
> the struct page linkages should match the "real" memory".
> 
>>> It would partially paper over the issue that setting the pageblock type
>>> based on a reserved page. I agree that compaction should not be returning
>>> pfns that are outside of the zone range because that is buggy in itself
>>> but valid struct pages should have valid information. I don't think we
>>> want to paper over that with unnecessary PageReserved checks.
>>
>> Agreed as long as we can handle that issue using range checks.
>>
> 
> I think it'll be ok as long as the struct pages within a 1<<(MAX_ORDER-1)
> range have proper linkages.

Agreed.
Mike Rapoport Nov. 25, 2020, 2:13 p.m. UTC | #17
On Wed, Nov 25, 2020 at 02:32:02PM +0100, David Hildenbrand wrote:
> On 25.11.20 13:08, Vlastimil Babka wrote:
> > On 11/25/20 6:34 AM, Andrea Arcangeli wrote:
> >> Hello,
> >>
> >> On Mon, Nov 23, 2020 at 02:01:16PM +0100, Vlastimil Babka wrote:
> >>> On 11/21/20 8:45 PM, Andrea Arcangeli wrote:
> >>>> A corollary issue was fixed in
> >>>> 39639000-39814fff : Unknown E820 type
> >>>>
> >>>> pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM:
> >>>>
> >>>> 7a17b000-7a216fff : Unknown E820 type
> >>>
> >>> It would be nice to also provide a /proc/zoneinfo and how exactly the 
> >>> "zone_spans_pfn" was violated. I assume we end up below zone's 
> >>> start_pfn, but is it true?
> >>
> >> Agreed, I was about to grab that info along with all page struct
> >> around the pfn 0x7a200 and phys address 0x7a216fff.
> >>
> >> # grep -A1 E820 /proc/iomem
> >> 7a17b000-7a216fff : Unknown E820 type
> >> 7a217000-7bffffff : System RAM
> >>
> >> DMA      zone_start_pfn 1            zone_end_pfn() 4096         contiguous 1
> >> DMA32    zone_start_pfn 4096         zone_end_pfn() 1048576      contiguous 0
> >> Normal   zone_start_pfn 1048576      zone_end_pfn() 4715392      contiguous 1
> >> Movable  zone_start_pfn 0            zone_end_pfn() 0            contiguous 0
> > 
> > So the above means that around the "Unknown E820 type" we have:
> > 
> > pfn 499712 - start of pageblock in ZONE_DMA32
> > pfn 500091 - start of the "Unknown E820 type" range
> > pfn 500224 - start of another pageblock
> > pfn 500246 - end of "Unknown E820 type"
> > 
> > So this is indeed not a zone boundary issue, but basically a hole not 
> > aligned to pageblock boundary and really unexpected.
> > We have CONFIG_HOLES_IN_ZONE (that x86 doesn't set) for architectures 
> > that do this, and even that config only affects pfn_valid_within(). But 
> > here pfn_valid() is true, but the zone/node linkage is unexpected.
> > 
> >> However the real bug seems that reserved pages have a zero zone_id in
> >> the page->flags when it should have the real zone id/nid. The patch I
> >> sent earlier to validate highest would only be needed to deal with
> >> pfn_valid.
> >>
> >> Something must have changed more recently than v5.1 that caused the
> >> zoneid of reserved pages to be wrong, a possible candidate for the
> >> real would be this change below:
> >>
> >> +               __init_single_page(pfn_to_page(pfn), pfn, 0, 0);
> >>
> >> Even if it may not be it, at the light of how the reserved page
> >> zoneid/nid initialized went wrong, the above line like it's too flakey
> >> to stay.
> >>
> >> It'd be preferable if the pfn_valid fails and the
> >> pfn_to_section_nr(pfn) returns an invalid section for the intermediate
> >> step. Even better memset 0xff over the whole page struct until the
> >> second stage comes around.
> >>
> >> Whenever pfn_valid is true, it's better that the zoneid/nid is correct
> >> all times, otherwise if the second stage fails we end up in a bug with
> >> weird side effects.
> > 
> > Yeah I guess it would be simpler if zoneid/nid was correct for 
> > pfn_valid() pfns within a zone's range, even if they are reserved due 
> > not not being really usable memory.
> > 
> > I don't think we want to introduce CONFIG_HOLES_IN_ZONE to x86. If the 
> > chosen solution is to make this to a real hole, the hole should be 
> > extended to MAX_ORDER_NR_PAGES aligned boundaries.
> 
> As we don't punch out pages of the memmap on x86-64, pfn_valid() keeps
> working as expected. There is a memmap that can be accessed and that was
> initialized.

I suspect that memmap for the reserved pages is not properly initialized
after recent changes in free_area_init(). They are cleared at
init_unavailable_mem() to have zone=0 and node=0, but they seem to be
never re-initialized with proper zone and node links which was not the
case before commit 73a6e474cb37 ("mm: memmap_init: iterate over memblock
regions rather that check each PFN").

Back then, memmap_init_zone() looped from zone_start_pfn till
zone_end_pfn and struct page for reserved pages with pfns inside the
zone would be initialized.

Now the loop is for interesection of [zone_start_pfn, zone_end_pfn] with
memblock.memory and for x86 reserved ranges are not in memblock.memory,
so the memmap for them remains semi-initialized.

> It's really just a matter of how to handle memory holes in
> this scenario.
> 
> a) Try initializing them to the covering node/zone (I gave one example
>    that might be tricky with hotplug)
> b) Mark such pages (either special node/zone or pagetype) and make pfn
>    walkers ignore these holes. For now, this can only be done via the
>    reserved flag.
> 
> -- 
> Thanks,
> 
> David / dhildenb
>
David Hildenbrand Nov. 25, 2020, 2:42 p.m. UTC | #18
On 25.11.20 15:13, Mike Rapoport wrote:
> On Wed, Nov 25, 2020 at 02:32:02PM +0100, David Hildenbrand wrote:
>> On 25.11.20 13:08, Vlastimil Babka wrote:
>>> On 11/25/20 6:34 AM, Andrea Arcangeli wrote:
>>>> Hello,
>>>>
>>>> On Mon, Nov 23, 2020 at 02:01:16PM +0100, Vlastimil Babka wrote:
>>>>> On 11/21/20 8:45 PM, Andrea Arcangeli wrote:
>>>>>> A corollary issue was fixed in
>>>>>> 39639000-39814fff : Unknown E820 type
>>>>>>
>>>>>> pfn 0x7a200 -> 0x7a200000 min_pfn hit non-RAM:
>>>>>>
>>>>>> 7a17b000-7a216fff : Unknown E820 type
>>>>>
>>>>> It would be nice to also provide a /proc/zoneinfo and how exactly the 
>>>>> "zone_spans_pfn" was violated. I assume we end up below zone's 
>>>>> start_pfn, but is it true?
>>>>
>>>> Agreed, I was about to grab that info along with all page struct
>>>> around the pfn 0x7a200 and phys address 0x7a216fff.
>>>>
>>>> # grep -A1 E820 /proc/iomem
>>>> 7a17b000-7a216fff : Unknown E820 type
>>>> 7a217000-7bffffff : System RAM
>>>>
>>>> DMA      zone_start_pfn 1            zone_end_pfn() 4096         contiguous 1
>>>> DMA32    zone_start_pfn 4096         zone_end_pfn() 1048576      contiguous 0
>>>> Normal   zone_start_pfn 1048576      zone_end_pfn() 4715392      contiguous 1
>>>> Movable  zone_start_pfn 0            zone_end_pfn() 0            contiguous 0
>>>
>>> So the above means that around the "Unknown E820 type" we have:
>>>
>>> pfn 499712 - start of pageblock in ZONE_DMA32
>>> pfn 500091 - start of the "Unknown E820 type" range
>>> pfn 500224 - start of another pageblock
>>> pfn 500246 - end of "Unknown E820 type"
>>>
>>> So this is indeed not a zone boundary issue, but basically a hole not 
>>> aligned to pageblock boundary and really unexpected.
>>> We have CONFIG_HOLES_IN_ZONE (that x86 doesn't set) for architectures 
>>> that do this, and even that config only affects pfn_valid_within(). But 
>>> here pfn_valid() is true, but the zone/node linkage is unexpected.
>>>
>>>> However the real bug seems that reserved pages have a zero zone_id in
>>>> the page->flags when it should have the real zone id/nid. The patch I
>>>> sent earlier to validate highest would only be needed to deal with
>>>> pfn_valid.
>>>>
>>>> Something must have changed more recently than v5.1 that caused the
>>>> zoneid of reserved pages to be wrong, a possible candidate for the
>>>> real would be this change below:
>>>>
>>>> +               __init_single_page(pfn_to_page(pfn), pfn, 0, 0);
>>>>
>>>> Even if it may not be it, at the light of how the reserved page
>>>> zoneid/nid initialized went wrong, the above line like it's too flakey
>>>> to stay.
>>>>
>>>> It'd be preferable if the pfn_valid fails and the
>>>> pfn_to_section_nr(pfn) returns an invalid section for the intermediate
>>>> step. Even better memset 0xff over the whole page struct until the
>>>> second stage comes around.
>>>>
>>>> Whenever pfn_valid is true, it's better that the zoneid/nid is correct
>>>> all times, otherwise if the second stage fails we end up in a bug with
>>>> weird side effects.
>>>
>>> Yeah I guess it would be simpler if zoneid/nid was correct for 
>>> pfn_valid() pfns within a zone's range, even if they are reserved due 
>>> not not being really usable memory.
>>>
>>> I don't think we want to introduce CONFIG_HOLES_IN_ZONE to x86. If the 
>>> chosen solution is to make this to a real hole, the hole should be 
>>> extended to MAX_ORDER_NR_PAGES aligned boundaries.
>>
>> As we don't punch out pages of the memmap on x86-64, pfn_valid() keeps
>> working as expected. There is a memmap that can be accessed and that was
>> initialized.
> 
> I suspect that memmap for the reserved pages is not properly initialized
> after recent changes in free_area_init(). They are cleared at
> init_unavailable_mem() to have zone=0 and node=0, but they seem to be
> never re-initialized with proper zone and node links which was not the
> case before commit 73a6e474cb37 ("mm: memmap_init: iterate over memblock
> regions rather that check each PFN").
> 
> Back then, memmap_init_zone() looped from zone_start_pfn till
> zone_end_pfn and struct page for reserved pages with pfns inside the
> zone would be initialized.
> 
> Now the loop is for interesection of [zone_start_pfn, zone_end_pfn] with
> memblock.memory and for x86 reserved ranges are not in memblock.memory,
> so the memmap for them remains semi-initialized.

As far as I understood Mel, rounding these ranges up/down to cover full
MAX_ORDER blocks/pageblocks might work.
Andrea Arcangeli Nov. 25, 2020, 5:59 p.m. UTC | #19
On Wed, Nov 25, 2020 at 10:30:53AM +0000, Mel Gorman wrote:
> On Tue, Nov 24, 2020 at 03:56:22PM -0500, Andrea Arcangeli wrote:
> > Hello,
> > 
> > On Tue, Nov 24, 2020 at 01:32:05PM +0000, Mel Gorman wrote:
> > > I would hope that is not the case because they are not meant to overlap.
> > > However, if the beginning of the pageblock was not the start of a zone
> > > then the pages would be valid but the pfn would still be outside the
> > > zone boundary. If it was reserved, the struct page is valid but not
> > > suitable for set_pfnblock_flags_mask. However, it is a concern in
> > > general because the potential is there that pages are isolated from the
> > > wrong zone.
> > 
> > I guess we have more than one issue to correct in that function
> > because the same BUG_ON reproduced again even with the tentative patch
> > I posted earlier.
> > 
> > So my guess is that the problematic reserved page isn't pointed by the
> > min_pfn, but it must have been pointed by the "highest" variable
> > calculated below?
> > 
> > 			if (pfn >= highest)
> > 				highest = pageblock_start_pfn(pfn);
> > 
> > When I looked at where "highest" comes from, it lacks
> > pageblock_pfn_to_page check (which was added around v5.7 to min_pfn).
> > 
> > Is that the real bug, which may be fixed by something like this? (untested)
> > 
> 
> It's plausible as it is a potential source of leaking but as you note
> in another mail, it's surprising to me that valid struct pages, even if
> within memory holes and reserved would have broken node/zone information
> in the page flags.

I think the patch to add pageblock_pfn_to_page is still needed to cope
with !pfn_valid or a pageblock in between zones, but pfn_valid or
pageblock in between zones is not what happens here.

So the patch adding pageblock_pfn_to_page would have had the undesired
side effect of hiding the bug so it's best to deal with the other bug
first.
Andrea Arcangeli Nov. 25, 2020, 6:28 p.m. UTC | #20
On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
> Before that change, the memmap of memory holes were only zeroed
> out. So the zones/nid was 0, however, pages were not reserved and
> had a refcount of zero - resulting in other issues.

So maybe that "0,0" zoneid/nid was not actually the thing that
introduced the regression? Note: I didn't bisect anything yet, it was
just a guess.

In fact, I need first thing to boot with an old v5.3 kernel and to
dump the page->flags around the "Unknown E820 type" region to be 100%
certain that the older kernels had a correct page->flags for reserved
pages. I will clear this up before the end of the day.

> Most pfn walkers shouldn‘t mess with reserved pages and simply skip
> them. That would be the right fix here.

What would then be the advantage of keeping wrong page->flags in
reserved pages compared to actually set it correct?

How do you plan to enforce that every VM bit will call PageReserved
(as it should be introduced in pageblock_pfn_to_page in that case) if
it's not possible to disambiguate a real DMA zone and nid 0 from the
uninitialized value? How can we patch page_nodenum to enforce it won't
read an uninitialized value because a PageReserved check was missing
in the caller?

I don't see this easily enforceable by code review, it'd be pretty
flakey to leave a 0,0 wrong value in there with no way to verify if a
PageResered check in the caller was missing.

> > It'd be preferable if the pfn_valid fails and the
> > pfn_to_section_nr(pfn) returns an invalid section for the intermediate
> > step. Even better memset 0xff over the whole page struct until the
> > second stage comes around.
> 
> I recently discussed with Baoquan to
> 1. Using a new pagetype to mark memory holes
> 2. Using special nid/zid to mark memory holes
> 
> Setting the memmap to 0xff would be even more dangerous - pfn_zone() might simply BUG_ON.

What would need to call pfn_zone in between first and second stage?

If something calls pfn_zone in between first and second stage isn't it
a feature if it crashes the kernel at boot?

Note: I suggested 0xff kernel crashing "until the second stage comes
around" during meminit at boot, not permanently.

		/*
		 * Use a fake node/zone (0) for now. Some of these pages
		 * (in memblock.reserved but not in memblock.memory) will
		 * get re-initialized via reserve_bootmem_region() later.
		 */

Specifically I relied on the comment "get re-initialized via
reserve_bootmem_region() later".

I assumed the second stage overwrites the 0,0 to the real zoneid/nid
value, which is clearly not happening, hence it'd be preferable to get
a crash at boot reliably.

Now I have CONFIG_DEFERRED_STRUCT_PAGE_INIT=n so the second stage
calling init_reserved_page(start_pfn) won't do much with
CONFIG_DEFERRED_STRUCT_PAGE_INIT=n but I already tried to enable
CONFIG_DEFERRED_STRUCT_PAGE_INIT=y yesterday and it didn't help, the
page->flags were still wrong for reserved pages in the "Unknown E820
type" region.

> > Whenever pfn_valid is true, it's better that the zoneid/nid is correct
> > all times, otherwise if the second stage fails we end up in a bug with
> > weird side effects.
> 
> Memory holes with a valid memmap might not have a zone/nid. For now, skipping reserved pages should be good enough, no?

zoneid is always a pure abstract concept, not just the RAM-holes but
the RAM itself doesn't have a zoneid at all.

Nid is hardware defined for RAM, but for reserved RAM holes it becomes
a pure software construct as the zoneid.

So while it's right that they have no zone/nid in the hardware, they
still have a very well defined zoneid/nid in the software implementation.

The page struct belongs to one and only one mem_map array, that has a
specific nodeid and nid. So it can be always initialized right even
for non-RAM.

Only if the pfn is !pfn_valid, then it has no page struct and then
there's no zone/nid association, but in that case there's no reason to
even worry about it since nobody can possibly get a wrong value out of
the page struct because there's no page struct in the case.

Last but not the least, RAM pages can be marked reserved and assigned
to hardware and so it'd be really messy if real reserved RAM pages
given to hw, behaved different than non-RAM that accidentally got a
struct page because of section alignment issues.

Thanks,
Andrea
Andrea Arcangeli Nov. 25, 2020, 6:47 p.m. UTC | #21
On Wed, Nov 25, 2020 at 12:41:55PM +0100, David Hildenbrand wrote:
> On 25.11.20 12:04, David Hildenbrand wrote:
> > On 25.11.20 11:39, Mel Gorman wrote:
> >> On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
> >>>> Something must have changed more recently than v5.1 that caused the
> >>>> zoneid of reserved pages to be wrong, a possible candidate for the
> >>>> real would be this change below:
> >>>>
> >>>> +               __init_single_page(pfn_to_page(pfn), pfn, 0, 0);
> >>>>
> >>>
> >>> Before that change, the memmap of memory holes were only zeroed out. So the zones/nid was 0, however, pages were not reserved and had a refcount of zero - resulting in other issues.
> >>>
> >>> Most pfn walkers shouldn???t mess with reserved pages and simply skip them. That would be the right fix here.
> >>>
> >>
> >> Ordinarily yes, pfn walkers should not care about reserved pages but it's
> >> still surprising that the node/zone linkages would be wrong for memory
> >> holes. If they are in the middle of a zone, it means that a hole with
> >> valid struct pages could be mistaken for overlapping nodes (if the hole
> >> was in node 1 for example) or overlapping zones which is just broken.
> > 
> > I agree within zones - but AFAIU, the issue is reserved memory between
> > zones, right?
> 
> Double checking, I was confused. This applies also to memory holes
> within zones in x86.

Yes this is a memory hole within the DMA32 zone.

Still why there should be any difference?

As long as a page struct exists it's in a well defined mem_map array
which comes for one and only one zoneid/nid combination.

So what would be the benefit of treating memory holes within zones or
in between zones differently and leave one or the other with a
zoneid/nid uninitialized?
Andrea Arcangeli Nov. 25, 2020, 7:01 p.m. UTC | #22
On Wed, Nov 25, 2020 at 01:08:54PM +0100, Vlastimil Babka wrote:
> Yeah I guess it would be simpler if zoneid/nid was correct for 
> pfn_valid() pfns within a zone's range, even if they are reserved due 
> not not being really usable memory.
> 
> I don't think we want to introduce CONFIG_HOLES_IN_ZONE to x86. If the 
> chosen solution is to make this to a real hole, the hole should be 
> extended to MAX_ORDER_NR_PAGES aligned boundaries.

The way pfn_valid works it's not possible to render all non-RAM pfn as
!pfn_valid, CONFIG_HOLES_IN_ZONE would not achieve it 100% either. So
I don't think we can rely on that to eliminate all non-RAM reserved
pages from the mem_map and avoid having to initialize them in the
first place. Some could remain as in this case since in the same
pageblock there's non-RAM followed by RAM and all pfn are valid.

> In any case, compaction code can't fix this with better range checks.

David's correct that it can, by adding enough PageReserved (I'm
running all systems reproducing this with plenty of PageReserved
checks in all places to work around it until we do a proper fix).

My problem with that is that 1) it's simply non enforceable at runtime
that there is not missing PageReserved check and 2) what benefit it
would provide to leave a wrong zoneid in reserved pages and having to
add extra PageReserved checks?

A struct page has a deterministic zoneid/nid, if it's pointed by a
valid pfn (as in pfn_valid()) the simplest is that the zoneid/nid in
the page remain correct no matter if it's reserved at boot, it was
marked reserved by a driver that swap the page somewhere else with the
GART or EFI or something else. All reserved pages should work the
same, RAM and non-RAM, since the non-RAM status can basically change
at runtime if a driver assigns the page to hw somehow.

NOTE: on the compaction side, we still need to add
thepageblock_pfn_to_page to validate the "highest" pfn because the
pfn_valid() check is missing on the first pfn on the pageblock as it's
also missing the check of a pageblock that spans over two different
zones.

Thanks,
Andrea
Andrea Arcangeli Nov. 25, 2020, 7:14 p.m. UTC | #23
On Wed, Nov 25, 2020 at 04:13:25PM +0200, Mike Rapoport wrote:
> I suspect that memmap for the reserved pages is not properly initialized
> after recent changes in free_area_init(). They are cleared at
> init_unavailable_mem() to have zone=0 and node=0, but they seem to be

I'd really like if we would not leave those to 0,0 and if we set the
whole struct page at 0xff, if we miss the second stage that corrects
the uninitialized value. The hope is that it'll crash faster and more
reproducible that way.

> never re-initialized with proper zone and node links which was not the
> case before commit 73a6e474cb37 ("mm: memmap_init: iterate over memblock
> regions rather that check each PFN").

What's strange is that 73a6e474cb37 was suggested as fix for this
bug...

	https://lkml.kernel.org/r/20200505124314.GA5029@MiWiFi-R3L-srv

The addition of "pageblock_pfn_to_page" to validate min_pfn was added
in commit 73a6e474cb37, so I assumed that the first report below
didn't have commit 73a6e474cb37 already applied.

	https://lkml.kernel.org/r/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw

However if you're correct perhaps the patch was already applied in
5.7.0-rc2-next-20200423+, it landed upstream in v5.8 after all.

> Back then, memmap_init_zone() looped from zone_start_pfn till
> zone_end_pfn and struct page for reserved pages with pfns inside the
> zone would be initialized.
> 
> Now the loop is for interesection of [zone_start_pfn, zone_end_pfn] with
> memblock.memory and for x86 reserved ranges are not in memblock.memory,
> so the memmap for them remains semi-initialized.

That would matches the symptoms. I'll test it as first thing after
confirming older kernels had the right zoneid/nid on the reserved
pages.

Thanks,
Andrea
David Hildenbrand Nov. 25, 2020, 7:27 p.m. UTC | #24
On 25.11.20 19:28, Andrea Arcangeli wrote:
> On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
>> Before that change, the memmap of memory holes were only zeroed
>> out. So the zones/nid was 0, however, pages were not reserved and
>> had a refcount of zero - resulting in other issues.
> 
> So maybe that "0,0" zoneid/nid was not actually the thing that
> introduced the regression? Note: I didn't bisect anything yet, it was
> just a guess.

I guess 0/0 is the issue, but that existed before when we had a simple
memmset(0). The root issue should be what Mike said:

73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather
that check each PFN")


>> Most pfn walkers shouldn‘t mess with reserved pages and simply skip
>> them. That would be the right fix here.
> 
> What would then be the advantage of keeping wrong page->flags in
> reserved pages compared to actually set it correct?

"correct" is problematic. If you have an actual memory hole, there is
not always a right answer - unless I am missing something important.


Assume you have a layout like this

[  zone X ] [ hole ] [ zone Y ]

If either X and or Y starts within a memory section, you have a valid
memmap for X - but what would be the right node/zone?


Assume you have a layout like this

[ zone X ]

whereby X ends inside a memory section. The you hotplug memory. Assume
it goes to X

[ zone X ][ hole in X ][ zone X]

or it goes to y

[ zone X ][ hole ][ zone Y ]

This can easily be reproduced by starting a VM in qemu with a memory
size not aligned to 128 MB (e.g., -M 4000) and hotplugging memory.

> 
> How do you plan to enforce that every VM bit will call PageReserved
> (as it should be introduced in pageblock_pfn_to_page in that case) if
> it's not possible to disambiguate a real DMA zone and nid 0 from the
> uninitialized value? How can we patch page_nodenum to enforce it won't
> read an uninitialized value because a PageReserved check was missing
> in the caller?

We can't. The general rule is (as I was once told by Michal IIRC) that
there is no trusting on memmap content when the reserved bit is set -
unless you know "why it was set" - e.g., to distinguish early boot
allocations from ordinary allocations.

See below for pointer to documentation.

> 
> I don't see this easily enforceable by code review, it'd be pretty
> flakey to leave a 0,0 wrong value in there with no way to verify if a
> PageResered check in the caller was missing.

I'm not rooting for "keep this at 0/0" - I'm saying that I think there
are corner cases where it might not be that easy.

> 
>>> It'd be preferable if the pfn_valid fails and the
>>> pfn_to_section_nr(pfn) returns an invalid section for the intermediate
>>> step. Even better memset 0xff over the whole page struct until the
>>> second stage comes around.
>>
>> I recently discussed with Baoquan to
>> 1. Using a new pagetype to mark memory holes
>> 2. Using special nid/zid to mark memory holes
>>
>> Setting the memmap to 0xff would be even more dangerous - pfn_zone() might simply BUG_ON.
> 
> What would need to call pfn_zone in between first and second stage?
> 
> If something calls pfn_zone in between first and second stage isn't it
> a feature if it crashes the kernel at boot?
> 
> Note: I suggested 0xff kernel crashing "until the second stage comes
> around" during meminit at boot, not permanently.

Yes, then it makes sense - if we're able to come up with a way to
initialize any memmap we might have - including actual memory holes that
have a memmap.

> 
> 		/*
> 		 * Use a fake node/zone (0) for now. Some of these pages
> 		 * (in memblock.reserved but not in memblock.memory) will
> 		 * get re-initialized via reserve_bootmem_region() later.
> 		 */
> 
> Specifically I relied on the comment "get re-initialized via
> reserve_bootmem_region() later".

Yes, but there is a "Some of these" :)

Boot a VM with "-M 4000" and observe the memmap in the last section -
they won't get initialized a second time.

> 
> I assumed the second stage overwrites the 0,0 to the real zoneid/nid
> value, which is clearly not happening, hence it'd be preferable to get
> a crash at boot reliably.
> 
> Now I have CONFIG_DEFERRED_STRUCT_PAGE_INIT=n so the second stage
> calling init_reserved_page(start_pfn) won't do much with
> CONFIG_DEFERRED_STRUCT_PAGE_INIT=n but I already tried to enable
> CONFIG_DEFERRED_STRUCT_PAGE_INIT=y yesterday and it didn't help, the
> page->flags were still wrong for reserved pages in the "Unknown E820
> type" region.
> 
>>> Whenever pfn_valid is true, it's better that the zoneid/nid is correct
>>> all times, otherwise if the second stage fails we end up in a bug with
>>> weird side effects.
>>
>> Memory holes with a valid memmap might not have a zone/nid. For now, skipping reserved pages should be good enough, no?
> 
> zoneid is always a pure abstract concept, not just the RAM-holes but
> the RAM itself doesn't have a zoneid at all.
> 
> Nid is hardware defined for RAM, but for reserved RAM holes it becomes
> a pure software construct as the zoneid.
> 
> So while it's right that they have no zone/nid in the hardware, they
> still have a very well defined zoneid/nid in the software implementation.
> 
> The page struct belongs to one and only one mem_map array, that has a
> specific nodeid and nid. So it can be always initialized right even
> for non-RAM.

AFAIK, the mem_map array might have multiple NIDs - and it's set when
initializing the zones.

> 
> Only if the pfn is !pfn_valid, then it has no page struct and then
> there's no zone/nid association, but in that case there's no reason to
> even worry about it since nobody can possibly get a wrong value out of
> the page struct because there's no page struct in the case.
> 
> Last but not the least, RAM pages can be marked reserved and assigned
> to hardware and so it'd be really messy if real reserved RAM pages
> given to hw, behaved different than non-RAM that accidentally got a
> struct page because of section alignment issues.

Well, "reserved" is not a good indication "what" something actually is.

I documented that a while ago in include/linux/page-flags.h

"PG_reserved is set for special pages. The "struct page" of such a page
 should in general not be touched (e.g. set dirty) except by its owner.
 Pages marked as PG_reserved include:."

I suggest looking at that.

AFAIR, we have been setting *most* memmap in memory holes/non-ram
reserved for a long time - long before I added the __init_single_page -
see init_reserved_page() for example.
David Hildenbrand Nov. 25, 2020, 7:33 p.m. UTC | #25
On 25.11.20 20:01, Andrea Arcangeli wrote:
> On Wed, Nov 25, 2020 at 01:08:54PM +0100, Vlastimil Babka wrote:
>> Yeah I guess it would be simpler if zoneid/nid was correct for 
>> pfn_valid() pfns within a zone's range, even if they are reserved due 
>> not not being really usable memory.
>>
>> I don't think we want to introduce CONFIG_HOLES_IN_ZONE to x86. If the 
>> chosen solution is to make this to a real hole, the hole should be 
>> extended to MAX_ORDER_NR_PAGES aligned boundaries.
> 
> The way pfn_valid works it's not possible to render all non-RAM pfn as
> !pfn_valid, CONFIG_HOLES_IN_ZONE would not achieve it 100% either. So

Well, we could do it the arm64 way and provide a custom pfn_valid() and
check memblock for RAM - please don't! :D

> I don't think we can rely on that to eliminate all non-RAM reserved
> pages from the mem_map and avoid having to initialize them in the
> first place. Some could remain as in this case since in the same
> pageblock there's non-RAM followed by RAM and all pfn are valid.
> 
>> In any case, compaction code can't fix this with better range checks.
> 
> David's correct that it can, by adding enough PageReserved (I'm
> running all systems reproducing this with plenty of PageReserved
> checks in all places to work around it until we do a proper fix).
> 
> My problem with that is that 1) it's simply non enforceable at runtime
> that there is not missing PageReserved check and 2) what benefit it
> would provide to leave a wrong zoneid in reserved pages and having to
> add extra PageReserved checks?

See my other mail. If we have a clean way to set *any* memmap (non-RAM,
memory holes at any place) to a proper nid/zid, then we won't need
reserved checks. I raised some cases that need more thought than a
simple "hole in zone".
Andrea Arcangeli Nov. 25, 2020, 8:41 p.m. UTC | #26
On Wed, Nov 25, 2020 at 08:27:21PM +0100, David Hildenbrand wrote:
> On 25.11.20 19:28, Andrea Arcangeli wrote:
> > On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
> >> Before that change, the memmap of memory holes were only zeroed
> >> out. So the zones/nid was 0, however, pages were not reserved and
> >> had a refcount of zero - resulting in other issues.
> > 
> > So maybe that "0,0" zoneid/nid was not actually the thing that
> > introduced the regression? Note: I didn't bisect anything yet, it was
> > just a guess.
> 
> I guess 0/0 is the issue, but that existed before when we had a simple
> memmset(0). The root issue should be what Mike said:

Yes, the second stage must have stopped running somehow.

Is there anything we can do to induce a deterministically reproducible
kernel crashing behavior if the second stage doesn't run?

Why did we start doing a more graceful initialization in the first
stage, instead of making a less graceful by setting it to 0xff instead
of 0x00?

> 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather
> that check each PFN")

So if that's not intentional, are you suggesting nodeid/nid was a bug
if it was set to 0,0 for a non-RAM valid pfn?

> "correct" is problematic. If you have an actual memory hole, there is
> not always a right answer - unless I am missing something important.
> 
> 
> Assume you have a layout like this
> 
> [  zone X ] [ hole ] [ zone Y ]
> 
> If either X and or Y starts within a memory section, you have a valid
> memmap for X - but what would be the right node/zone?
> 
> 
> Assume you have a layout like this
> 
> [ zone X ]
> 
> whereby X ends inside a memory section. The you hotplug memory. Assume
> it goes to X
> 
> [ zone X ][ hole in X ][ zone X]
> 
> or it goes to y
> 
> [ zone X ][ hole ][ zone Y ]
> 
> This can easily be reproduced by starting a VM in qemu with a memory
> size not aligned to 128 MB (e.g., -M 4000) and hotplugging memory.

I don't get what the problem is sorry.

You have a pfn, if pfn_valid() is true, pfn_to_page returns a page
deterministically.

It's up to the kernel to decide which page structure blongs to any pfn
in the pfn_to_page function.

Now if the pfn_to_page(pfn) function returns a page whose nid/zone_id
in page->flags points to a node->zone whose zone_start_pfn -
end_zone_pfn range doesn't contain "pfn" that is a bug in
page_alloc.c.

I don't see how is it not possible to deterministically enforce the
above never happens. Only then it would be true that there's not
always a right answer.

zone can overlap, but it can't be that you do pfn_to_page of a
pfn_valid and you obtain a page whose zone doesn't contain that
pfn. Which is what is currently crashing compaction.

I don't see how this is an unsolvable problem and why we should accept
to live with a bogus page->flags for reserved pages.

> We can't. The general rule is (as I was once told by Michal IIRC) that

The fact we can't kernel crash reliably when somebody uses the wrong
0,0 uninitialized value by not adding an explicit PageReserved check,
is my primary concern in keeping those nodeid/nid uninitialized, but
non-kernel-crashing, since it already created this unreproducible bug.

> I'm not rooting for "keep this at 0/0" - I'm saying that I think there
> are corner cases where it might not be that easy.

I'm not saying it's easy. What I don't see is how you don't always
have the right answer and why it would be an unsolvable problem.

It is certainly problematic and difficult to solve in the mem_map
iniitalization logic, but to me having pfn_valid() &&
page_zone(pfn_to_page(pfn)) randomly returning the DMA zone on first
node also looks problematic and difficult to handle across all VM
code, so overall it looks preferable to keep the complexity of the
mem_map initialization self contained and not spilling over the rest
of the VM.

> Yes, but there is a "Some of these" :)
> 
> Boot a VM with "-M 4000" and observe the memmap in the last section -
> they won't get initialized a second time.

Is the beyond the end of the zone yet another case? I guess that's
less likely to give us problems because it's beyond the end of the
zone. Would pfn_valid return true for those pfn? If pfn_valid is not
true it's not really a concern but the again I'd rather prefer if
those struct pages beyond the end of the zone were kernel crashing set
to 0xff.

In other words I just don't see why we should ever prefer to leave
some pages at a graceful and erroneous nid 0 nodeid 0 that wouldn't
easily induce a crash if used.

> AFAIK, the mem_map array might have multiple NIDs - and it's set when
> initializing the zones.

Well because there's no mem_map array with SPARSEMEM, but it's not
conceptually too different than if there was one. Even with flatmem
there could be multiple page struct for each pfn, the disambiguation
has to be handled by pfn_to_page regardless of SPARSEMEM or not.

The point is that if zone_page(pfn_to_page(pfn)) points to DMA zone of
first node, and the pfn isn't part of the DMA of first node that looks
a bug and it can be enforced it doesn't happen.

> Well, "reserved" is not a good indication "what" something actually is.
> 
> I documented that a while ago in include/linux/page-flags.h
> 
> "PG_reserved is set for special pages. The "struct page" of such a page
>  should in general not be touched (e.g. set dirty) except by its owner.
>  Pages marked as PG_reserved include:."
> 
> I suggest looking at that.
> 
> AFAIR, we have been setting *most* memmap in memory holes/non-ram
> reserved for a long time - long before I added the __init_single_page -
> see init_reserved_page() for example.

Sure, non-RAM with valid page struct always has been marked
PG_reserved. I wasn't suggesting that it shouldn't be PG_reserved.

I was pointing out that RAM can also be marked PG_reserved later by
the kernel, long after boot, as you mentioned for all other cases of
PG_reserved, the most notable are drivers doing PG_reserved after
allocating RAM either vmalloc or GART swapping RAM around at other
alias physical address.

That is all born as RAM at boot, it gets page->flags done right, with
the right zoneid, and it becomes PG_reserved later.

So I was suggesting physical ranges "pfn" of non-RAM (be those holes
withtin zones, or in between zones doesn't matter) with a pfn_valid
returning true and a pfn_to_page pointing deterministically to one and
only one struct page, should have such struct page initialized exactly
the same as if it was RAM.

Either that or we can define a new NO_ZONE NO_ID id and crash in
page_zonenum or page_to_nid if it is ever called on such a page
struct.

Thanks,
Andrea
Mike Rapoport Nov. 25, 2020, 9:04 p.m. UTC | #27
On Wed, Nov 25, 2020 at 08:27:21PM +0100, David Hildenbrand wrote:
> On 25.11.20 19:28, Andrea Arcangeli wrote:
> > On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
> >
> > What would need to call pfn_zone in between first and second stage?
> > 
> > If something calls pfn_zone in between first and second stage isn't it
> > a feature if it crashes the kernel at boot?
> > 
> > Note: I suggested 0xff kernel crashing "until the second stage comes
> > around" during meminit at boot, not permanently.
> 
> Yes, then it makes sense - if we're able to come up with a way to
> initialize any memmap we might have - including actual memory holes that
> have a memmap.
> 
> > 
> > 		/*
> > 		 * Use a fake node/zone (0) for now. Some of these pages
> > 		 * (in memblock.reserved but not in memblock.memory) will
> > 		 * get re-initialized via reserve_bootmem_region() later.
> > 		 */
> > 
> > Specifically I relied on the comment "get re-initialized via
> > reserve_bootmem_region() later".
> 
> Yes, but there is a "Some of these" :)
> 
> Boot a VM with "-M 4000" and observe the memmap in the last section -
> they won't get initialized a second time.
> 
> > 
> > I assumed the second stage overwrites the 0,0 to the real zoneid/nid
> > value, which is clearly not happening, hence it'd be preferable to get
> > a crash at boot reliably.
> > 
> > Now I have CONFIG_DEFERRED_STRUCT_PAGE_INIT=n so the second stage
> > calling init_reserved_page(start_pfn) won't do much with
> > CONFIG_DEFERRED_STRUCT_PAGE_INIT=n but I already tried to enable
> > CONFIG_DEFERRED_STRUCT_PAGE_INIT=y yesterday and it didn't help, the
> > page->flags were still wrong for reserved pages in the "Unknown E820
> > type" region.

I think the very root cause is how e820__memblock_setup() registers
memory with memblock:

		if (entry->type == E820_TYPE_SOFT_RESERVED)
			memblock_reserve(entry->addr, entry->size);

		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
			continue;

		memblock_add(entry->addr, entry->size);

From that point the system has inconsistent view of RAM in both
memblock.memory and memblock.reserved and, which is then translated to
memmap etc.

Unfortunately, simply adding all RAM to memblock is not possible as
there are systems that for them "the addresses listed in the reserved
range must never be accessed, or (as we discovered) even be reachable by
an active page table entry" [1].

[1] https://lore.kernel.org/lkml/20200528151510.GA6154@raspberrypi/
David Hildenbrand Nov. 25, 2020, 9:13 p.m. UTC | #28
> Am 25.11.2020 um 21:41 schrieb Andrea Arcangeli <aarcange@redhat.com>:
> 
> On Wed, Nov 25, 2020 at 08:27:21PM +0100, David Hildenbrand wrote:
>>> On 25.11.20 19:28, Andrea Arcangeli wrote:
>>> On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
>>>> Before that change, the memmap of memory holes were only zeroed
>>>> out. So the zones/nid was 0, however, pages were not reserved and
>>>> had a refcount of zero - resulting in other issues.
>>> 
>>> So maybe that "0,0" zoneid/nid was not actually the thing that
>>> introduced the regression? Note: I didn't bisect anything yet, it was
>>> just a guess.
>> 
>> I guess 0/0 is the issue, but that existed before when we had a simple
>> memmset(0). The root issue should be what Mike said:
> 
> Yes, the second stage must have stopped running somehow.
> 
> Is there anything we can do to induce a deterministically reproducible
> kernel crashing behavior if the second stage doesn't run?
> 
> Why did we start doing a more graceful initialization in the first
> stage, instead of making a less graceful by setting it to 0xff instead
> of 0x00?

I guess because we weren‘t aware of the issues we have :)

> 
>> 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather
>> that check each PFN")
> 
> So if that's not intentional, are you suggesting nodeid/nid was a bug
> if it was set to 0,0 for a non-RAM valid pfn?
> 

Depends on how we think checks for reserved pages should be performed. I am more of a friend of indicating „this memmap is just garbage, skip it“. If the reserved flag is not good enough, then via a special node/zone - as you also suggest below.

>> "correct" is problematic. If you have an actual memory hole, there is
>> not always a right answer - unless I am missing something important.
>> 
>> 
>> Assume you have a layout like this
>> 
>> [  zone X ] [ hole ] [ zone Y ]
>> 
>> If either X and or Y starts within a memory section, you have a valid
>> memmap for X - but what would be the right node/zone?
>> 
>> 
>> Assume you have a layout like this
>> 
>> [ zone X ]
>> 
>> whereby X ends inside a memory section. The you hotplug memory. Assume
>> it goes to X
>> 
>> [ zone X ][ hole in X ][ zone X]
>> 
>> or it goes to y
>> 
>> [ zone X ][ hole ][ zone Y ]
>> 
>> This can easily be reproduced by starting a VM in qemu with a memory
>> size not aligned to 128 MB (e.g., -M 4000) and hotplugging memory.
> 
> I don't get what the problem is sorry.
> 
> You have a pfn, if pfn_valid() is true, pfn_to_page returns a page
> deterministically.
> 
> It's up to the kernel to decide which page structure blongs to any pfn
> in the pfn_to_page function.
> 
> Now if the pfn_to_page(pfn) function returns a page whose nid/zone_id
> in page->flags points to a node->zone whose zone_start_pfn -
> end_zone_pfn range doesn't contain "pfn" that is a bug in
> page_alloc.c.
> 
> I don't see how is it not possible to deterministically enforce the
> above never happens. Only then it would be true that there's not
> always a right answer.
> 
> zone can overlap, but it can't be that you do pfn_to_page of a
> pfn_valid and you obtain a page whose zone doesn't contain that
> pfn. Which is what is currently crashing compaction.
> 
> I don't see how this is an unsolvable problem and why we should accept
> to live with a bogus page->flags for reserved pages.
> 

I said it‘s problematic, not unsolvable. Using a special zone/node is certainly easier - but might reveal some issues we have to fix - I guess? Fair enough.

>> We can't. The general rule is (as I was once told by Michal IIRC) that
> 
> The fact we can't kernel crash reliably when somebody uses the wrong
> 0,0 uninitialized value by not adding an explicit PageReserved check,
> is my primary concern in keeping those nodeid/nid uninitialized, but
> non-kernel-crashing, since it already created this unreproducible bug.

Agreed.

> 
>> I'm not rooting for "keep this at 0/0" - I'm saying that I think there
>> are corner cases where it might not be that easy.
> 
> I'm not saying it's easy. What I don't see is how you don't always
> have the right answer and why it would be an unsolvable problem.

„Problematic“ does not imply unsolvable.

> 
> It is certainly problematic and difficult to solve in the mem_map
> iniitalization logic, but to me having pfn_valid() &&
> page_zone(pfn_to_page(pfn)) randomly returning the DMA zone on first
> node also looks problematic and difficult to handle across all VM
> code, so overall it looks preferable to keep the complexity of the
> mem_map initialization self contained and not spilling over the rest
> of the VM.
> 
>> Yes, but there is a "Some of these" :)
>> 
>> Boot a VM with "-M 4000" and observe the memmap in the last section -
>> they won't get initialized a second time.
> 
> Is the beyond the end of the zone yet another case? I guess that's
> less likely to give us problems because it's beyond the end of the
> zone. Would pfn_valid return true for those pfn? If pfn_valid is not

Yes. Especially, exposed after memory hotplug when zone/nid span changes.

> true it's not really a concern but the again I'd rather prefer if
> those struct pages beyond the end of the zone were kernel crashing set
> to 0xff.
> 
> In other words I just don't see why we should ever prefer to leave
> some pages at a graceful and erroneous nid 0 nodeid 0 that wouldn't
> easily induce a crash if used.

I agree.

> 
>> AFAIK, the mem_map array might have multiple NIDs - and it's set when
>> initializing the zones.
> 
> Well because there's no mem_map array with SPARSEMEM, but it's not
> conceptually too different than if there was one. Even with flatmem
> there could be multiple page struct for each pfn, the disambiguation
> has to be handled by pfn_to_page regardless of SPARSEMEM or not.
> 
> The point is that if zone_page(pfn_to_page(pfn)) points to DMA zone of
> first node, and the pfn isn't part of the DMA of first node that looks
> a bug and it can be enforced it doesn't happen.
> 
>> Well, "reserved" is not a good indication "what" something actually is.
>> 
>> I documented that a while ago in include/linux/page-flags.h
>> 
>> "PG_reserved is set for special pages. The "struct page" of such a page
>> should in general not be touched (e.g. set dirty) except by its owner.
>> Pages marked as PG_reserved include:."
>> 
>> I suggest looking at that.
>> 
>> AFAIR, we have been setting *most* memmap in memory holes/non-ram
>> reserved for a long time - long before I added the __init_single_page -
>> see init_reserved_page() for example.
> 
> Sure, non-RAM with valid page struct always has been marked
> PG_reserved. I wasn't suggesting that it shouldn't be PG_reserved.
> 
> I was pointing out that RAM can also be marked PG_reserved later by
> the kernel, long after boot, as you mentioned for all other cases of
> PG_reserved, the most notable are drivers doing PG_reserved after
> allocating RAM either vmalloc or GART swapping RAM around at other
> alias physical address.
> 
> That is all born as RAM at boot, it gets page->flags done right, with
> the right zoneid, and it becomes PG_reserved later.
> 
> So I was suggesting physical ranges "pfn" of non-RAM (be those holes
> withtin zones, or in between zones doesn't matter) with a pfn_valid
> returning true and a pfn_to_page pointing deterministically to one and
> only one struct page, should have such struct page initialized exactly
> the same as if it was RAM.
> 
> Either that or we can define a new NO_ZONE NO_ID id and crash in
> page_zonenum or page_to_nid if it is ever called on such a page
> struct.

I feel like that is easier and maybe cleaner. Mark memmaps that exist but should be completely ignored. Could even check that in pfn_valid() and return „false“ - might be expensive, though.

Anyhow, I do agree that properly catching these problematic pages, bailing out and fixing them (however we decide) is the right approach.

> 
> Thanks,
> Andrea
Andrea Arcangeli Nov. 25, 2020, 9:38 p.m. UTC | #29
On Wed, Nov 25, 2020 at 11:04:14PM +0200, Mike Rapoport wrote:
> I think the very root cause is how e820__memblock_setup() registers
> memory with memblock:
> 
> 		if (entry->type == E820_TYPE_SOFT_RESERVED)
> 			memblock_reserve(entry->addr, entry->size);
> 
> 		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> 			continue;
> 
> 		memblock_add(entry->addr, entry->size);
> 
> From that point the system has inconsistent view of RAM in both
> memblock.memory and memblock.reserved and, which is then translated to
> memmap etc.
> 
> Unfortunately, simply adding all RAM to memblock is not possible as
> there are systems that for them "the addresses listed in the reserved
> range must never be accessed, or (as we discovered) even be reachable by
> an active page table entry" [1].
> 
> [1] https://lore.kernel.org/lkml/20200528151510.GA6154@raspberrypi/

It looks like what's missing is a blockmem_reserve which I don't think
would interfere at all with the issue above since it won't create
direct mapping and it'll simply invoke the second stage that wasn't
invoked here.

I guess this would have a better chance to have the second
initialization stage run in reserve_bootmem_region and it would likely
solve the problem without breaking E820_TYPE_RESERVED which is known
by the kernel:

> 		if (entry->type == E820_TYPE_SOFT_RESERVED)
> 			memblock_reserve(entry->addr, entry->size);
> 

+		if (entry->type == 20)
+			memblock_reserve(entry->addr, entry->size);

> 		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> 			continue;
> 

This is however just to show the problem, I didn't check what type 20
is.

To me it doesn't look the root cause though, the root cause is that if
you don't call memblock_reserve the page->flags remains uninitialized.

I think the page_alloc.c need to be more robust and detect at least if
if holes within zones (but ideally all pfn_valid of all struct pages
in system even if beyond the end of the zone) aren't being initialized
in the second stage without relying on the arch code to remember to
call memblock_reserve.

In fact it's not clear why memblock_reserve even exists, that
information can be calculated reliably by page_alloc in function of
memblock.memory alone by walking all nodes and all zones. It doesn't
even seem to help in destroying the direct mapping,
reserve_bootmem_region just initializes the struct pages so it doesn't
need a special memeblock_reserved to find those ranges.

In fact it's scary that codes then does stuff like this trusting the
memblock_reserve is nearly complete information (which obviously isn't
given type 20 doesn't get queued and I got that type 20 in all my systems):

	for_each_reserved_mem_region(i, &start, &end) {
		if (addr >= start && addr_end <= end)
			return true;
	}

That code in irq-gic-v3-its.c should stop using
for_each_reserved_mem_region and start doing
pfn_valid(addr>>PAGE_SHIFT) if
PageReserved(pfn_to_page(addr>>PAGE_SHIFT)) instead.

At best memory.reserved should be calculated automatically by the
page_alloc.c based on the zone_start_pfn/zone_end_pfn and not passed
by the e820 caller, instead of adding the memory_reserve call for type
20 we should delete the memory_reserve function.

Thanks,
Andrea
Andrea Arcangeli Nov. 26, 2020, 3:40 a.m. UTC | #30
On Wed, Nov 25, 2020 at 12:34:41AM -0500, Andrea Arcangeli wrote:
> pfn    physaddr   page->flags
> 500224 0x7a200000 0x1fff000000001000 reserved True
> 500225 0x7a201000 0x1fff000000001000 reserved True
> *snip*
> 500245 0x7a215000 0x1fff000000001000 reserved True
> 500246 0x7a216000 0x1fff000000001000 reserved True
> 500247 0x7a217000 0x3fff000000000000 reserved False
> 500248 0x7a218000 0x3fff000000000000 reserved False

The range is still this:

7a17b000-7a216fff : Unknown E820 type
7a217000-7ffdbfff : System RAM

This is with v5.3:

Interestingly the pages below 0x7a217000 weren't even marked reserved.

DMA      zone_start_pfn 1            zone_end_pfn() 4096         contiguous 1     
DMA32    zone_start_pfn 4096         zone_end_pfn() 524252       contiguous 1     
Normal   zone_start_pfn 0            zone_end_pfn() 0            contiguous 0     
Movable  zone_start_pfn 0            zone_end_pfn() 0            contiguous 0     

500246 0x7a216000 0x1fff000000000000 reserved False
500247 0x7a217000 0x1fff000000022036 reserved False
500248 0x7a218000 0x1fff000000002032 reserved False

4b094b7851bf4bf551ad456195d3f26e1c03bd74 no change (because the
unknown e820 type 20 was still initialized by memmap_init_zone despite
it was not part of memblock.memory):

DMA      zone_start_pfn 1            zone_end_pfn() 4096         contiguous 1     
DMA32    zone_start_pfn 4096         zone_end_pfn() 524252       contiguous 1     
Normal   zone_start_pfn 0            zone_end_pfn() 0            contiguous 0     
Movable  zone_start_pfn 0            zone_end_pfn() 0            contiguous 0     

500246 0x7a216000 0x1fff000000000000 reserved False
500247 0x7a217000 0x1fff000000022036 reserved False
500248 0x7a218000 0x1fff000000002032 reserved False

73a6e474cb376921a311786652782155eac2fdf0 this changed it:

DMA      zone_start_pfn 1            zone_end_pfn() 4096         contiguous 1     
DMA32    zone_start_pfn 4096         zone_end_pfn() 524252       contiguous 0     
Normal   zone_start_pfn 0            zone_end_pfn() 0            contiguous 0     
Movable  zone_start_pfn 0            zone_end_pfn() 0            contiguous 0     

500246 0x7a216000 0xfff000000001000 reserved True
500247 0x7a217000 0x1fff000000000000 reserved False
500248 0x7a218000 0x1fff000000010200 reserved False

da50c57bdbb8e37ec6f8c934a2f8acbf4e4fdfb9
(73a6e474cb376921a311786652782155eac2fdf0^) no change:

DMA      zone_start_pfn 1            zone_end_pfn() 4096         contiguous 1     
DMA32    zone_start_pfn 4096         zone_end_pfn() 524252       contiguous 1     
Normal   zone_start_pfn 0            zone_end_pfn() 0            contiguous 0     
Movable  zone_start_pfn 0            zone_end_pfn() 0            contiguous 0     

500246 0x7a216000 0x1fff000000000000 reserved False
500247 0x7a217000 0x1fff000000002032 reserved False
500248 0x7a218000 0x1fff000000002032 reserved False

So like Mike suggested this started in
73a6e474cb376921a311786652782155eac2fdf0, although the previous code
looked buggy too by not setting PageReserved when it should have.

It appears the pfn in the e820 type 20 were initialized by
memmap_init_zone before commit
73a6e474cb376921a311786652782155eac2fdf0 and they're not initialized
anymore after 73a6e474cb376921a311786652782155eac2fdf0 because the
memblock.memory didn't include the e820 type 20 so it couldn't call
memmap_init_zone anymore.

So after 4b094b7851bf4bf551ad456195d3f26e1c03bd74 uninitialized pfn
gets the 0,0 assignment (but even before
4b094b7851bf4bf551ad456195d3f26e1c03bd74 they would have gotten a
completely zero page->flags, 4b094b7851bf4bf551ad456195d3f26e1c03bd74
is completely unrelated to this issue).

Mike, your patch alone doesn't help because nothing calls
memblock_reserve on the range starting at 0x7a17b000. I appended at
the end the output of memblock=debug.

Only after I apply the below hack on top of your patch finally all
sign of bugs are gone, PG_reserved is finally set (old code was wrong
not setting it) and the zoneid/nid is right (new code leaves it
uninitialized and so it's wrong):

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 983cd53ed4c9..bbb5b4d7a117 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1303,6 +1303,8 @@ void __init e820__memblock_setup(void)
 
 		if (entry->type == E820_TYPE_SOFT_RESERVED)
 			memblock_reserve(entry->addr, entry->size);
+		if (entry->type == 20)
+			memblock_reserve(entry->addr, entry->size);
 
 		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
 			continue;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3fb35fe6a9e4..2e2047282ff3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6145,7 +6145,9 @@ void __meminit __weak memmap_init(unsigned long size, int nid,
 {
 	unsigned long start_pfn, end_pfn;
 	unsigned long range_end_pfn = range_start_pfn + size;
+	phys_addr_t start, end;
 	int i;
+	u64 j;
 
 	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
 		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
@@ -6157,6 +6159,19 @@ void __meminit __weak memmap_init(unsigned long size, int nid,
 					 MEMINIT_EARLY, NULL);
 		}
 	}
+
+	for_each_mem_range(j, &memblock.reserved, NULL, nid,  MEMBLOCK_NONE,
+			     &start, &end, NULL) {
+		start_pfn = clamp(PHYS_PFN(start), range_start_pfn,
+				  range_end_pfn);
+		end_pfn = clamp(PHYS_PFN(end), range_start_pfn, range_end_pfn);
+
+		if (end_pfn > start_pfn) {
+			size = end_pfn - start_pfn;
+			memmap_init_zone(size, nid, zone, start_pfn,
+					 MEMINIT_EARLY, NULL);
+		}
+	}
 }
 
 static int zone_batchsize(struct zone *zone)

Result with above patch all good:

500246 0x7a216000 0x1fff000000001000 reserved True pfn_valid True
500247 0x7a217000 0x1fff000000002032 reserved False pfn_valid True
500248 0x7a218000 0x1fff000000020036 reserved False pfn_valid True

Summary: both old code (missing PG_reserved) and the current code
(leaving the page struct uninitialized and with wrong nodeid/nid) look
wrong.

Overall this brings more questions:

- why memblock.reserved exists and isn't automatically calculated as
  inversion of memblock.memory? (or if you prefer: why is the above
  call to memblock_reserve needed to initialize the memory holes?)

- why there's no initialization of the memblock.reserved regions given
  they exists, was it just an oversight? (this one is fixed by Mike's
  patch, although I wish it was possible to drop the function
  memblock_reserve instead)

- what can we do instead of setting the uninitialized nodeid/nid to
  0,0 and to reliably detect at boot if some page structure within
  zones (but ideally also outside the zone boundary for any pfn where
  pfn_valid returns true) is left uninitialized, as it is happening
  currently on the e820 type 20 range?

Thanks,
Andrea

[    0.006587] memblock_reserve: [0x00000000000f6bf0-0x00000000000f6bff] smp_scan_config+0xaa/0xee
[    0.006590] memblock_reserve: [0x00000000000f6c00-0x00000000000f6d7b] smp_scan_config+0xc1/0xee
[    0.006595] memblock_reserve: [0x0000000003400000-0x0000000003406fff] setup_arch+0x51b/0xb4e
[    0.006599] memblock_add: [0x0000000000001000-0x000000000009fbff] e820__memblock_setup+0x63/0x89
[    0.006614] memblock_add: [0x0000000000100000-0x000000007a17afff] e820__memblock_setup+0x63/0x89
[    0.006615] memblock_add: [0x000000007a217000-0x000000007ffdbfff] e820__memblock_setup+0x63/0x89
[    0.006627] memblock_reserve: [0x000000000009f000-0x00000000000fffff] setup_arch+0x53f/0xb4e
[    0.006634] memblock_reserve: [0x0000000000099000-0x000000000009efff] reserve_real_mode+0x6f/0x8d
[    0.006665] memblock_reserve: [0x0000000000000000-0x000000000000ffff] setup_arch+0x56b/0xb4e
[    0.006907] memblock_reserve: [0x000000007fdff000-0x000000007fdfffff] alloc_low_pages+0x16b/0x180
[    0.007219] memblock_reserve: [0x000000007ffd9000-0x000000007ffdbfff] memblock_alloc_range_nid+0xb1/0x11f
[    0.007245] memblock_alloc_try_nid: 16384 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 sparse_init+0x8c/0x2c9
[    0.007246] memblock_reserve: [0x000000007ffd5000-0x000000007ffd8fff] memblock_alloc_range_nid+0xb1/0x11f
[    0.007250] memblock_alloc_try_nid: 4096 bytes align=0x40 nid=0 from=0x0000000000000000 max_addr=0x0000000000000000 sparse_index_alloc+0x44/0x67
[    0.007252] memblock_reserve: [0x000000007ffd4000-0x000000007ffd4fff] memblock_alloc_range_nid+0xb1/0x11f
[    0.007260] memblock_alloc_try_nid: 640 bytes align=0x40 nid=0 from=0x0000000000000000 max_addr=0x0000000000000000 sparse_init_nid+0x3d/0x224
[    0.007262] memblock_reserve: [0x000000007ffd3d80-0x000000007ffd3fff] memblock_alloc_range_nid+0xb1/0x11f
[    0.007265] memblock_alloc_exact_nid_raw: 33554432 bytes align=0x200000 nid=0 from=0x0000000001000000 max_addr=0x0000000000000000 sparse_init_nid+0x8a/0x224
[    0.007266] memblock_reserve: [0x000000007dc00000-0x000000007fbfffff] memblock_alloc_range_nid+0xb1/0x11f
[    0.012443] memblock_alloc_try_nid_raw: 4096 bytes align=0x1000 nid=0 from=0x0000000001000000 max_addr=0x0000000000000000 vmemmap_alloc_block_zero.constprop.0+0xc/0x24
[    0.012450] memblock_reserve: [0x000000007ffd2000-0x000000007ffd2fff] memblock_alloc_range_nid+0xb1/0x11f
[    0.012453] memblock_alloc_try_nid_raw: 4096 bytes align=0x1000 nid=0 from=0x0000000001000000 max_addr=0x0000000000000000 vmemmap_alloc_block_zero.constprop.0+0xc/0x24
[    0.012454] memblock_reserve: [0x000000007ffd1000-0x000000007ffd1fff] memblock_alloc_range_nid+0xb1/0x11f
[    0.016975] memblock_alloc_try_nid: 73 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 acpi_parse_hpet+0xca/0x132
[    0.016983] memblock_reserve: [0x000000007ffd3d00-0x000000007ffd3d48] memblock_alloc_range_nid+0xb1/0x11f
[    0.016992] memblock_alloc_try_nid: 75 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 io_apic_init_mappings+0x38/0x1c0
[    0.016993] memblock_reserve: [0x000000007ffd3c80-0x000000007ffd3cca] memblock_alloc_range_nid+0xb1/0x11f
[    0.017012] memblock_alloc_try_nid: 768 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 e820__reserve_resources+0x2c/0x1be
[    0.017013] memblock_reserve: [0x000000007ffd3980-0x000000007ffd3c7f] memblock_alloc_range_nid+0xb1/0x11f
[    0.017017] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52
[    0.017019] memblock_reserve: [0x000000007ffd3900-0x000000007ffd3967] memblock_alloc_range_nid+0xb1/0x11f
[    0.017023] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52
[    0.017025] memblock_reserve: [0x000000007ffd3880-0x000000007ffd38e7] memblock_alloc_range_nid+0xb1/0x11f
[    0.017026] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52
[    0.017028] memblock_reserve: [0x000000007ffd3800-0x000000007ffd3867] memblock_alloc_range_nid+0xb1/0x11f
[    0.017029] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52
[    0.017031] memblock_reserve: [0x000000007ffd3780-0x000000007ffd37e7] memblock_alloc_range_nid+0xb1/0x11f
[    0.017032] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52
[    0.017033] memblock_reserve: [0x000000007ffd3700-0x000000007ffd3767] memblock_alloc_range_nid+0xb1/0x11f
[    0.017035] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52
[    0.017036] memblock_reserve: [0x000000007ffd3680-0x000000007ffd36e7] memblock_alloc_range_nid+0xb1/0x11f
[    0.017038] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52
[    0.017039] memblock_reserve: [0x000000007ffd3600-0x000000007ffd3667] memblock_alloc_range_nid+0xb1/0x11f
[    0.017041] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52
[    0.017042] memblock_reserve: [0x000000007ffd3580-0x000000007ffd35e7] memblock_alloc_range_nid+0xb1/0x11f
[    0.017044] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52
[    0.017045] memblock_reserve: [0x000000007ffd3500-0x000000007ffd3567] memblock_alloc_range_nid+0xb1/0x11f
[    0.017047] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52
[    0.017048] memblock_reserve: [0x000000007ffd3480-0x000000007ffd34e7] memblock_alloc_range_nid+0xb1/0x11f
[    0.017050] memblock_alloc_try_nid: 104 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 firmware_map_add_early+0x2a/0x52
[    0.017051] memblock_reserve: [0x000000007ffd3400-0x000000007ffd3467] memblock_alloc_range_nid+0xb1/0x11f
[    0.017069] memblock_alloc_try_nid: 32 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 __register_nosave_region+0x7a/0xf7
[    0.017071] memblock_reserve: [0x000000007ffd33c0-0x000000007ffd33df] memblock_alloc_range_nid+0xb1/0x11f
[    0.017073] memblock_alloc_try_nid: 32 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 __register_nosave_region+0x7a/0xf7
[    0.017075] memblock_reserve: [0x000000007ffd3380-0x000000007ffd339f] memblock_alloc_range_nid+0xb1/0x11f
[    0.017099] memblock_alloc_try_nid: 32 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 __register_nosave_region+0x7a/0xf7
[    0.017101] memblock_reserve: [0x000000007ffd3340-0x000000007ffd335f] memblock_alloc_range_nid+0xb1/0x11f
[    0.020380] memblock_alloc_try_nid: 70 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 start_kernel+0x110/0x57c
[    0.020382] memblock_reserve: [0x000000007ffd32c0-0x000000007ffd3305] memblock_alloc_range_nid+0xb1/0x11f
[    0.020383] memblock_alloc_try_nid: 70 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 start_kernel+0x124/0x57c
[    0.020385] memblock_reserve: [0x000000007ffd3240-0x000000007ffd3285] memblock_alloc_range_nid+0xb1/0x11f
[    0.020391] memblock_alloc_try_nid: 4096 bytes align=0x1000 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_alloc_alloc_info+0x5e/0x8f
[    0.020392] memblock_reserve: [0x000000007ffd0000-0x000000007ffd0fff] memblock_alloc_range_nid+0xb1/0x11f
[    0.020394] memblock_alloc_try_nid: 4096 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_embed_first_chunk+0x6a/0x299
[    0.020396] memblock_reserve: [0x000000007ffcf000-0x000000007ffcffff] memblock_alloc_range_nid+0xb1/0x11f
[    0.020397] memblock_alloc_try_nid: 2097152 bytes align=0x200000 nid=0 from=0x0000000001000000 max_addr=0x0000000000000000 pcpu_embed_first_chunk+0xd0/0x299
[    0.020399] memblock_reserve: [0x000000007da00000-0x000000007dbfffff] memblock_alloc_range_nid+0xb1/0x11f
[    0.020584] memblock_free: [0x000000007da37000-0x000000007da3ffff] pcpu_embed_first_chunk+0x1b8/0x299
[    0.020590] memblock_free: [0x000000007da77000-0x000000007da7ffff] pcpu_embed_first_chunk+0x1b8/0x299
[    0.020595] memblock_free: [0x000000007dab7000-0x000000007dabffff] pcpu_embed_first_chunk+0x1b8/0x299
[    0.020600] memblock_free: [0x000000007daf7000-0x000000007dafffff] pcpu_embed_first_chunk+0x1b8/0x299
[    0.020606] memblock_free: [0x000000007db37000-0x000000007db3ffff] pcpu_embed_first_chunk+0x1b8/0x299
[    0.020611] memblock_free: [0x000000007db77000-0x000000007db7ffff] pcpu_embed_first_chunk+0x1b8/0x299
[    0.020616] memblock_free: [0x000000007dbb7000-0x000000007dbbffff] pcpu_embed_first_chunk+0x1b8/0x299
[    0.020621] memblock_free: [0x000000007dbf7000-0x000000007dbfffff] pcpu_embed_first_chunk+0x1b8/0x299
[    0.020624] memblock_alloc_try_nid: 8 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_setup_first_chunk+0x2f2/0x7ae
[    0.020625] memblock_reserve: [0x000000007ffd3200-0x000000007ffd3207] memblock_alloc_range_nid+0xb1/0x11f
[    0.020626] memblock_alloc_try_nid: 8 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_setup_first_chunk+0x30f/0x7ae
[    0.020628] memblock_reserve: [0x000000007ffd31c0-0x000000007ffd31c7] memblock_alloc_range_nid+0xb1/0x11f
[    0.020629] memblock_alloc_try_nid: 32 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_setup_first_chunk+0x345/0x7ae
[    0.020630] memblock_reserve: [0x000000007ffd3180-0x000000007ffd319f] memblock_alloc_range_nid+0xb1/0x11f
[    0.020632] memblock_alloc_try_nid: 64 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_setup_first_chunk+0x37f/0x7ae
[    0.020633] memblock_reserve: [0x000000007ffd3140-0x000000007ffd317f] memblock_alloc_range_nid+0xb1/0x11f
[    0.020639] memblock_alloc_try_nid: 576 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_setup_first_chunk+0x638/0x7ae
[    0.020640] memblock_reserve: [0x000000007ffcedc0-0x000000007ffcefff] memblock_alloc_range_nid+0xb1/0x11f
[    0.020645] memblock_alloc_try_nid: 144 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_alloc_first_chunk+0x7a/0x259
[    0.020646] memblock_reserve: [0x000000007ffd3080-0x000000007ffd310f] memblock_alloc_range_nid+0xb1/0x11f
[    0.020648] memblock_alloc_try_nid: 384 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_alloc_first_chunk+0xd0/0x259
[    0.020649] memblock_reserve: [0x000000007ffcec40-0x000000007ffcedbf] memblock_alloc_range_nid+0xb1/0x11f
[    0.020650] memblock_alloc_try_nid: 392 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_alloc_first_chunk+0x10f/0x259
[    0.020652] memblock_reserve: [0x000000007ffcea80-0x000000007ffcec07] memblock_alloc_range_nid+0xb1/0x11f
[    0.020653] memblock_alloc_try_nid: 96 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_alloc_first_chunk+0x12f/0x259
[    0.020654] memblock_reserve: [0x000000007ffd3000-0x000000007ffd305f] memblock_alloc_range_nid+0xb1/0x11f
[    0.020657] memblock_alloc_try_nid: 144 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_alloc_first_chunk+0x7a/0x259
[    0.020658] memblock_reserve: [0x000000007ffce9c0-0x000000007ffcea4f] memblock_alloc_range_nid+0xb1/0x11f
[    0.020660] memblock_alloc_try_nid: 1024 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_alloc_first_chunk+0xd0/0x259
[    0.020661] memblock_reserve: [0x000000007ffce5c0-0x000000007ffce9bf] memblock_alloc_range_nid+0xb1/0x11f
[    0.020662] memblock_alloc_try_nid: 1032 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_alloc_first_chunk+0x10f/0x259
[    0.020664] memblock_reserve: [0x000000007ffce180-0x000000007ffce587] memblock_alloc_range_nid+0xb1/0x11f
[    0.020665] memblock_alloc_try_nid: 256 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 pcpu_alloc_first_chunk+0x12f/0x259
[    0.020667] memblock_reserve: [0x000000007ffce080-0x000000007ffce17f] memblock_alloc_range_nid+0xb1/0x11f
[    0.020668] memblock_free: [0x000000007ffd0000-0x000000007ffd0fff] pcpu_embed_first_chunk+0x25a/0x299
[    0.020670] memblock_free: [0x000000007ffcf000-0x000000007ffcffff] pcpu_embed_first_chunk+0x287/0x299
[    0.020681] memblock_alloc_try_nid: 4096 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 alloc_large_system_hash+0x16b/0x25a
[    0.020682] memblock_reserve: [0x000000007ffd0000-0x000000007ffd0fff] memblock_alloc_range_nid+0xb1/0x11f
[    0.020741] memblock_alloc_try_nid: 2097152 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 alloc_large_system_hash+0x16b/0x25a
[    0.020742] memblock_reserve: [0x000000007d800000-0x000000007d9fffff] memblock_alloc_range_nid+0xb1/0x11f
[    0.020854] memblock_alloc_try_nid: 1048576 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 alloc_large_system_hash+0x16b/0x25a
[    0.020856] memblock_reserve: [0x000000007fece080-0x000000007ffce07f] memblock_alloc_range_nid+0xb1/0x11f
[    0.020891] memblock_alloc_try_nid: 4096 bytes align=0x1000 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 spp_getpage+0x3d/0x70
[    0.020893] memblock_reserve: [0x000000007ffcf000-0x000000007ffcffff] memblock_alloc_range_nid+0xb1/0x11f
[    0.020895] memblock_alloc_try_nid: 4096 bytes align=0x1000 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 spp_getpage+0x3d/0x70
[    0.020896] memblock_reserve: [0x000000007fecd000-0x000000007fecdfff] memblock_alloc_range_nid+0xb1/0x11f
[    0.020898] memblock_alloc_try_nid: 4096 bytes align=0x1000 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 spp_getpage+0x3d/0x70
[    0.020900] memblock_reserve: [0x000000007fecc000-0x000000007feccfff] memblock_alloc_range_nid+0xb1/0x11f
Mike Rapoport Nov. 26, 2020, 9:36 a.m. UTC | #31
On Wed, Nov 25, 2020 at 04:38:16PM -0500, Andrea Arcangeli wrote:
> On Wed, Nov 25, 2020 at 11:04:14PM +0200, Mike Rapoport wrote:
> > I think the very root cause is how e820__memblock_setup() registers
> > memory with memblock:
> > 
> > 		if (entry->type == E820_TYPE_SOFT_RESERVED)
> > 			memblock_reserve(entry->addr, entry->size);
> > 
> > 		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > 			continue;
> > 
> > 		memblock_add(entry->addr, entry->size);
> > 
> > From that point the system has inconsistent view of RAM in both
> > memblock.memory and memblock.reserved and, which is then translated to
> > memmap etc.
> > 
> > Unfortunately, simply adding all RAM to memblock is not possible as
> > there are systems that for them "the addresses listed in the reserved
> > range must never be accessed, or (as we discovered) even be reachable by
> > an active page table entry" [1].
> > 
> > [1] https://lore.kernel.org/lkml/20200528151510.GA6154@raspberrypi/
> 
> It looks like what's missing is a blockmem_reserve which I don't think
> would interfere at all with the issue above since it won't create
> direct mapping and it'll simply invoke the second stage that wasn't
> invoked here.
> 
> I guess this would have a better chance to have the second
> initialization stage run in reserve_bootmem_region and it would likely
> solve the problem without breaking E820_TYPE_RESERVED which is known
> by the kernel:
> 
> > 		if (entry->type == E820_TYPE_SOFT_RESERVED)
> > 			memblock_reserve(entry->addr, entry->size);
> > 
> 
> +		if (entry->type == 20)
> +			memblock_reserve(entry->addr, entry->size);
> 
> > 		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > 			continue;
> > 
> 
> This is however just to show the problem, I didn't check what type 20
> is.

I think it's inveneted by your BIOS vendor :)

> To me it doesn't look the root cause though, the root cause is that if
> you don't call memblock_reserve the page->flags remains uninitialized.

I didn't mean that root cause is that we don't call memblock_reserve().
I meant that the root cause is inconsitency in memory representation.

On most architectures, memblock.memory represents the entire RAM in the
system and memblock.reserved represents memory regions that were
reserved either by the firmware or by the kernel during early boot.

On x86 the memory that firmware reserved for its use is never considered
memory and some of the reserved memory types are never registered with
memblock at all.

As memblock data is used to initialize the memory map, we end up with
some page structs not being properly initialized.

> I think the page_alloc.c need to be more robust and detect at least if
> if holes within zones (but ideally all pfn_valid of all struct pages
> in system even if beyond the end of the zone) aren't being initialized
> in the second stage without relying on the arch code to remember to
> call memblock_reserve.

I agree that page_alloc.c needs to be more robust, but it anyway needs
to rely on some data supplied by arch to know where valid memory is.
With SPARSMEM, pfn_valid() only says where memmap exists, it's not
necessary there is an actual page frame behind a valid pfn.

> In fact it's not clear why memblock_reserve even exists, that
> information can be calculated reliably by page_alloc in function of
> memblock.memory alone by walking all nodes and all zones. It doesn't
> even seem to help in destroying the direct mapping,
> reserve_bootmem_region just initializes the struct pages so it doesn't
> need a special memeblock_reserved to find those ranges.

memblock_reserve() is there to allow architectures to mark memory
regions as busy so this memory won't be used by buddy as free pages. It
could be memory that firmware reported as reserved, memory occupied by
the kernel image and initrd, or the early memory allocations kernel does
before page allocator is up.

> In fact it's scary that codes then does stuff like this trusting the
> memblock_reserve is nearly complete information (which obviously isn't
> given type 20 doesn't get queued and I got that type 20 in all my systems):
> 
> 	for_each_reserved_mem_region(i, &start, &end) {
> 		if (addr >= start && addr_end <= end)
> 			return true;
> 	}
> 
> That code in irq-gic-v3-its.c should stop using
> for_each_reserved_mem_region and start doing
> pfn_valid(addr>>PAGE_SHIFT) if
> PageReserved(pfn_to_page(addr>>PAGE_SHIFT)) instead.

I think that for coldpluged CPUs this code runs before memmap us set up,
so pfn_valid() or PageReserved() are not yet available then.

> At best memory.reserved should be calculated automatically by the
> page_alloc.c based on the zone_start_pfn/zone_end_pfn and not passed
> by the e820 caller, instead of adding the memory_reserve call for type
> 20 we should delete the memory_reserve function.
 
memory.reserved cannot be calculated automatically. It represents all
the memory allocations made before page allocator is up. And as
memblock_reserve() is the most basic to allocate memory early at boot we
cannot really delete it ;-)

As for e820 and type 20, unless it is in memblock, page_alloc.c has no
way to properly initialize memmap for it. It can continue to guess, like
it does with init_unavailable_memory().

> Thanks,
> Andrea
>
David Hildenbrand Nov. 26, 2020, 10:05 a.m. UTC | #32
On 26.11.20 10:36, Mike Rapoport wrote:
> On Wed, Nov 25, 2020 at 04:38:16PM -0500, Andrea Arcangeli wrote:
>> On Wed, Nov 25, 2020 at 11:04:14PM +0200, Mike Rapoport wrote:
>>> I think the very root cause is how e820__memblock_setup() registers
>>> memory with memblock:
>>>
>>> 		if (entry->type == E820_TYPE_SOFT_RESERVED)
>>> 			memblock_reserve(entry->addr, entry->size);
>>>
>>> 		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
>>> 			continue;
>>>
>>> 		memblock_add(entry->addr, entry->size);
>>>
>>> From that point the system has inconsistent view of RAM in both
>>> memblock.memory and memblock.reserved and, which is then translated to
>>> memmap etc.
>>>
>>> Unfortunately, simply adding all RAM to memblock is not possible as
>>> there are systems that for them "the addresses listed in the reserved
>>> range must never be accessed, or (as we discovered) even be reachable by
>>> an active page table entry" [1].
>>>
>>> [1] https://lore.kernel.org/lkml/20200528151510.GA6154@raspberrypi/
>>
>> It looks like what's missing is a blockmem_reserve which I don't think
>> would interfere at all with the issue above since it won't create
>> direct mapping and it'll simply invoke the second stage that wasn't
>> invoked here.
>>
>> I guess this would have a better chance to have the second
>> initialization stage run in reserve_bootmem_region and it would likely
>> solve the problem without breaking E820_TYPE_RESERVED which is known
>> by the kernel:
>>
>>> 		if (entry->type == E820_TYPE_SOFT_RESERVED)
>>> 			memblock_reserve(entry->addr, entry->size);
>>>
>>
>> +		if (entry->type == 20)
>> +			memblock_reserve(entry->addr, entry->size);
>>
>>> 		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
>>> 			continue;
>>>
>>
>> This is however just to show the problem, I didn't check what type 20
>> is.
> 
> I think it's inveneted by your BIOS vendor :)
> 
>> To me it doesn't look the root cause though, the root cause is that if
>> you don't call memblock_reserve the page->flags remains uninitialized.
> 
> I didn't mean that root cause is that we don't call memblock_reserve().
> I meant that the root cause is inconsitency in memory representation.
> 
> On most architectures, memblock.memory represents the entire RAM in the
> system and memblock.reserved represents memory regions that were
> reserved either by the firmware or by the kernel during early boot.
> 
> On x86 the memory that firmware reserved for its use is never considered
> memory and some of the reserved memory types are never registered with
> memblock at all.
> 
> As memblock data is used to initialize the memory map, we end up with
> some page structs not being properly initialized.
> 
>> I think the page_alloc.c need to be more robust and detect at least if
>> if holes within zones (but ideally all pfn_valid of all struct pages
>> in system even if beyond the end of the zone) aren't being initialized
>> in the second stage without relying on the arch code to remember to
>> call memblock_reserve.
> 
> I agree that page_alloc.c needs to be more robust, but it anyway needs
> to rely on some data supplied by arch to know where valid memory is.
> With SPARSMEM, pfn_valid() only says where memmap exists, it's not
> necessary there is an actual page frame behind a valid pfn.
> 
>> In fact it's not clear why memblock_reserve even exists, that
>> information can be calculated reliably by page_alloc in function of
>> memblock.memory alone by walking all nodes and all zones. It doesn't
>> even seem to help in destroying the direct mapping,
>> reserve_bootmem_region just initializes the struct pages so it doesn't
>> need a special memeblock_reserved to find those ranges.
> 
> memblock_reserve() is there to allow architectures to mark memory
> regions as busy so this memory won't be used by buddy as free pages. It
> could be memory that firmware reported as reserved, memory occupied by
> the kernel image and initrd, or the early memory allocations kernel does
> before page allocator is up.
> 
>> In fact it's scary that codes then does stuff like this trusting the
>> memblock_reserve is nearly complete information (which obviously isn't
>> given type 20 doesn't get queued and I got that type 20 in all my systems):
>>
>> 	for_each_reserved_mem_region(i, &start, &end) {
>> 		if (addr >= start && addr_end <= end)
>> 			return true;
>> 	}
>>
>> That code in irq-gic-v3-its.c should stop using
>> for_each_reserved_mem_region and start doing
>> pfn_valid(addr>>PAGE_SHIFT) if
>> PageReserved(pfn_to_page(addr>>PAGE_SHIFT)) instead.
> 
> I think that for coldpluged CPUs this code runs before memmap us set up,
> so pfn_valid() or PageReserved() are not yet available then.
> 
>> At best memory.reserved should be calculated automatically by the
>> page_alloc.c based on the zone_start_pfn/zone_end_pfn and not passed
>> by the e820 caller, instead of adding the memory_reserve call for type
>> 20 we should delete the memory_reserve function.
>  
> memory.reserved cannot be calculated automatically. It represents all
> the memory allocations made before page allocator is up. And as
> memblock_reserve() is the most basic to allocate memory early at boot we
> cannot really delete it ;-)
> 
> As for e820 and type 20, unless it is in memblock, page_alloc.c has no
> way to properly initialize memmap for it. It can continue to guess, like
> it does with init_unavailable_memory().

AFAIS, init_unavailable_mem() has to logic to setup *any* such struct
page properly up - except the node/zid, because that's unknown at that
time. It is documented that (big comment above init_unavailable_mem())
which pages won't get initialized again. They *are* initialized, just
the node/zone is "fake" - that's the real problem we seem to have.

I agree that this is sub-optimal, as such pages are impossible to detect
(PageReserved is just not clear as discussed with Andrea). The basic
question is how we want to proceed:

a) Make sure any online struct page has a valid nid/zid, and is spanned
by the nid/zid.
b) Use a fake nid that will bail out when used for page_zone() and
page_pgdat(), and make pfn walkers detect that.

AFAIU, Andrea seems to prefer a). I thing b) might be easier in corner
cases. Thoughts?

---

The tricky thing for b) is to fix all call places that do a
page_zone()/page_pgdat() without checking if they are allowed to do so.
We would have to audit all callers of page_zone() / page_zone_id() /
page_pgdat() ...


E.g., using a pattern like
mm/memory_hotplug.c:find_biggest_section_pfn() is fine


if (unlikely(!pfn_to_online_page(pfn)))
	continue;
if (unlikely(pfn_to_nid(pfn) != nid))
	continue;


e.g., mm/page_alloc.c:__pageblock_pfn_to_page() would have to be adjusted

start_page = pfn_to_online_page(start_pfn);
if (!start_page)
	return NULL;

// necessary to avoid ...
if (page_to_nid(page) == NODE_INVALID)
	return NULL;

// page_zone() from bailing out
if (page_zone(start_page) != zone)
	return NULL;
Mike Rapoport Nov. 26, 2020, 10:43 a.m. UTC | #33
On Wed, Nov 25, 2020 at 10:40:20PM -0500, Andrea Arcangeli wrote:
> On Wed, Nov 25, 2020 at 12:34:41AM -0500, Andrea Arcangeli wrote:
> 
> Summary: both old code (missing PG_reserved) and the current code
> (leaving the page struct uninitialized and with wrong nodeid/nid) look
> wrong.
> 
> Overall this brings more questions:
> 
> - why memblock.reserved exists and isn't automatically calculated as
>   inversion of memblock.memory? (or if you prefer: why is the above
>   call to memblock_reserve needed to initialize the memory holes?)

memblock.reserved represents memory areas that were allocated before
page allocator is up. This could be memory reported by firmware as
resrved, kernel image, initrd, and any memory kernel allocates before
page allocator is ready.

> - why there's no initialization of the memblock.reserved regions given
>   they exists, was it just an oversight? (this one is fixed by Mike's
>   patch, although I wish it was possible to drop the function
>   memblock_reserve instead)

It was an oversight when I posted
73a6e474cb376921a311786652782155eac2fdf0. I overlooked the fact that x86
does not consider memory allocated by firmware as memory and only parts
of it are considred reserved.

> - what can we do instead of setting the uninitialized nodeid/nid to
>   0,0 and to reliably detect at boot if some page structure within
>   zones (but ideally also outside the zone boundary for any pfn where
>   pfn_valid returns true) is left uninitialized, as it is happening
>   currently on the e820 type 20 range?

I think that we need to fix the e820 and memblock interaction at the
first place. The "non-RAM" holes reported as various types other than
E820_TYPE_RAM are actually seem to be "RAM that firmware grabbed for
itself", so they should be seen by the system as such, like in all other
architectures.
For the regions that cannot be mapped, like those on HPE UV (was it
SGI?) systems, we have MEMBLOCK_NOMAP exactly for that purpose.

Once this is done, we can straihgten logic around memmap intialization
in page_alloc, and I feel it will get simpler that today in the end.

> Thanks,
> Andrea
Mel Gorman Nov. 26, 2020, 10:47 a.m. UTC | #34
On Wed, Nov 25, 2020 at 12:59:58PM -0500, Andrea Arcangeli wrote:
> On Wed, Nov 25, 2020 at 10:30:53AM +0000, Mel Gorman wrote:
> > On Tue, Nov 24, 2020 at 03:56:22PM -0500, Andrea Arcangeli wrote:
> > > Hello,
> > > 
> > > On Tue, Nov 24, 2020 at 01:32:05PM +0000, Mel Gorman wrote:
> > > > I would hope that is not the case because they are not meant to overlap.
> > > > However, if the beginning of the pageblock was not the start of a zone
> > > > then the pages would be valid but the pfn would still be outside the
> > > > zone boundary. If it was reserved, the struct page is valid but not
> > > > suitable for set_pfnblock_flags_mask. However, it is a concern in
> > > > general because the potential is there that pages are isolated from the
> > > > wrong zone.
> > > 
> > > I guess we have more than one issue to correct in that function
> > > because the same BUG_ON reproduced again even with the tentative patch
> > > I posted earlier.
> > > 
> > > So my guess is that the problematic reserved page isn't pointed by the
> > > min_pfn, but it must have been pointed by the "highest" variable
> > > calculated below?
> > > 
> > > 			if (pfn >= highest)
> > > 				highest = pageblock_start_pfn(pfn);
> > > 
> > > When I looked at where "highest" comes from, it lacks
> > > pageblock_pfn_to_page check (which was added around v5.7 to min_pfn).
> > > 
> > > Is that the real bug, which may be fixed by something like this? (untested)
> > > 
> > 
> > It's plausible as it is a potential source of leaking but as you note
> > in another mail, it's surprising to me that valid struct pages, even if
> > within memory holes and reserved would have broken node/zone information
> > in the page flags.
> 
> I think the patch to add pageblock_pfn_to_page is still needed to cope
> with !pfn_valid or a pageblock in between zones, but pfn_valid or
> pageblock in between zones is not what happens here.
> 
> So the patch adding pageblock_pfn_to_page would have had the undesired
> side effect of hiding the bug so it's best to deal with the other bug
> first.
> 

Agreed. This thread has a lot of different directions in it at this
point so what I'd hope for is first, a patch that initialises holes with
zone/node linkages within a 1<<(MAX_ORDER-1) alignment. If there is a
hole, it would be expected the pages are PageReserved. Second, a fix to
fast_isolate that forces PFNs returned to always be within the stated
zone boundaries.

The first is because there are assumptions that without HOLES_IN_ZONE, a
true pfn_valid within 1<<(MAX_ORDER-1) means pfn_valid would be true for
any PFN within that range. That assumption is relaxed in many cases --
e.g. the page allocator may not care at the moment because of how it
orders checks but compaction assumes that pfn_valid within a pageblock
means that all PFNs within that pageblock are valid.
Mel Gorman Nov. 26, 2020, 10:51 a.m. UTC | #35
On Wed, Nov 25, 2020 at 03:42:37PM +0100, David Hildenbrand wrote:
> > Now the loop is for interesection of [zone_start_pfn, zone_end_pfn] with
> > memblock.memory and for x86 reserved ranges are not in memblock.memory,
> > so the memmap for them remains semi-initialized.
> 
> As far as I understood Mel, rounding these ranges up/down to cover full
> MAX_ORDER blocks/pageblocks might work.
> 

Yes, round down the lower end of the hole and round up the higher end to
the MAX_ORDER boundary for the purposes of having valid zone/node
linkages even if the underlying page is PageReserved.
Mike Rapoport Nov. 26, 2020, 5:46 p.m. UTC | #36
On Thu, Nov 26, 2020 at 11:05:14AM +0100, David Hildenbrand wrote:
> On 26.11.20 10:36, Mike Rapoport wrote:
> > On Wed, Nov 25, 2020 at 04:38:16PM -0500, Andrea Arcangeli wrote:
> > 
> >> At best memory.reserved should be calculated automatically by the
> >> page_alloc.c based on the zone_start_pfn/zone_end_pfn and not passed
> >> by the e820 caller, instead of adding the memory_reserve call for type
> >> 20 we should delete the memory_reserve function.
> >  
> > memory.reserved cannot be calculated automatically. It represents all
> > the memory allocations made before page allocator is up. And as
> > memblock_reserve() is the most basic to allocate memory early at boot we
> > cannot really delete it ;-)
> > 
> > As for e820 and type 20, unless it is in memblock, page_alloc.c has no
> > way to properly initialize memmap for it. It can continue to guess, like
> > it does with init_unavailable_memory().
> 
> AFAIS, init_unavailable_mem() has to logic to setup *any* such struct
> page properly up - except the node/zid, because that's unknown at that
> time. It is documented that (big comment above init_unavailable_mem())

// The comment is a bit inaccurate, but that's another story :)

> which pages won't get initialized again. They *are* initialized, just
> the node/zone is "fake" - that's the real problem we seem to have.

Let's try to merge init_unavailable_memory() into memmap_init().
Than it'll be able to set zone/nid for those nasty pfns that BIOS
decided to keep to itself, like in Andrea's case and will also take care
of struct pages that do not really have a frame in DRAM, but are there
because of arbitrary section size.

Something like this:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eaa227a479e4..072e94042a11 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6185,24 +6185,84 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 	}
 }
 
-void __meminit __weak memmap_init(unsigned long size, int nid,
-				  unsigned long zone,
-				  unsigned long range_start_pfn)
+#if !defined(CONFIG_FLAT_NODE_MEM_MAP)
+/*
+ * Only struct pages that are backed by physical memory available to the
+ * kernel are zeroed and initialized by memmap_init_zone().
+ * But, there are some struct pages that are either reserved by firmware or
+ * do not correspond to physical page frames becuase actual memory bank is
+ * not a multiple of SECTION_SIZE. Fields of those struct pages may be
+ * accessed (for example page_to_pfn() on some configuration accesses
+ * flags) so we must explicitly initialize those struct pages.
+ */
+static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn,
+					 int zone, int node)
 {
-	unsigned long start_pfn, end_pfn;
+	unsigned long pfn;
+	u64 pgcnt = 0;
+
+	for (pfn = spfn; pfn < epfn; pfn++) {
+		if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
+			pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
+				+ pageblock_nr_pages - 1;
+			continue;
+		}
+		__init_single_page(pfn_to_page(pfn), pfn, zone, node);
+		__SetPageReserved(pfn_to_page(pfn));
+		pgcnt++;
+	}
+
+	return pgcnt;
+}
+#else
+static inline u64 init_unavailable_range(unsigned long spfn, unsigned long epfn,
+					 int zone, int node)
+{
+	return 0;
+}
+#endif
+
+void __init __weak memmap_init(unsigned long size, int nid,
+			       unsigned long zone,
+			       unsigned long range_start_pfn)
+{
+	unsigned long start_pfn, end_pfn, next_pfn = 0;
 	unsigned long range_end_pfn = range_start_pfn + size;
+	u64 pgcnt = 0;
 	int i;
 
 	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
 		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
 		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
+		next_pfn = clamp(next_pfn, range_start_pfn, range_end_pfn);
 
 		if (end_pfn > start_pfn) {
 			size = end_pfn - start_pfn;
 			memmap_init_zone(size, nid, zone, start_pfn,
 					 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
 		}
+
+		if (next_pfn < start_pfn)
+			pgcnt += init_unavailable_range(next_pfn, start_pfn,
+							zone, nid);
+		next_pfn = end_pfn;
 	}
+
+	/*
+	 * Early sections always have a fully populated memmap for the whole
+	 * section - see pfn_valid(). If the last section has holes at the
+	 * end and that section is marked "online", the memmap will be
+	 * considered initialized. Make sure that memmap has a well defined
+	 * state.
+	 */
+	if (next_pfn < range_end_pfn)
+		pgcnt += init_unavailable_range(next_pfn, range_end_pfn,
+						zone, nid);
+
+	if (pgcnt)
+		pr_info("%s: Zeroed struct page in unavailable ranges: %lld\n",
+			zone_names[zone], pgcnt);
+
 }
 
 static int zone_batchsize(struct zone *zone)
@@ -6995,88 +7055,6 @@ void __init free_area_init_memoryless_node(int nid)
 	free_area_init_node(nid);
 }
 
-#if !defined(CONFIG_FLAT_NODE_MEM_MAP)
-/*
- * Initialize all valid struct pages in the range [spfn, epfn) and mark them
- * PageReserved(). Return the number of struct pages that were initialized.
- */
-static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn)
-{
-	unsigned long pfn;
-	u64 pgcnt = 0;
-
-	for (pfn = spfn; pfn < epfn; pfn++) {
-		if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
-			pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
-				+ pageblock_nr_pages - 1;
-			continue;
-		}
-		/*
-		 * Use a fake node/zone (0) for now. Some of these pages
-		 * (in memblock.reserved but not in memblock.memory) will
-		 * get re-initialized via reserve_bootmem_region() later.
-		 */
-		__init_single_page(pfn_to_page(pfn), pfn, 0, 0);
-		__SetPageReserved(pfn_to_page(pfn));
-		pgcnt++;
-	}
-
-	return pgcnt;
-}
-
-/*
- * Only struct pages that are backed by physical memory are zeroed and
- * initialized by going through __init_single_page(). But, there are some
- * struct pages which are reserved in memblock allocator and their fields
- * may be accessed (for example page_to_pfn() on some configuration accesses
- * flags). We must explicitly initialize those struct pages.
- *
- * This function also addresses a similar issue where struct pages are left
- * uninitialized because the physical address range is not covered by
- * memblock.memory or memblock.reserved. That could happen when memblock
- * layout is manually configured via memmap=, or when the highest physical
- * address (max_pfn) does not end on a section boundary.
- */
-static void __init init_unavailable_mem(void)
-{
-	phys_addr_t start, end;
-	u64 i, pgcnt;
-	phys_addr_t next = 0;
-
-	/*
-	 * Loop through unavailable ranges not covered by memblock.memory.
-	 */
-	pgcnt = 0;
-	for_each_mem_range(i, &start, &end) {
-		if (next < start)
-			pgcnt += init_unavailable_range(PFN_DOWN(next),
-							PFN_UP(start));
-		next = end;
-	}
-
-	/*
-	 * Early sections always have a fully populated memmap for the whole
-	 * section - see pfn_valid(). If the last section has holes at the
-	 * end and that section is marked "online", the memmap will be
-	 * considered initialized. Make sure that memmap has a well defined
-	 * state.
-	 */
-	pgcnt += init_unavailable_range(PFN_DOWN(next),
-					round_up(max_pfn, PAGES_PER_SECTION));
-
-	/*
-	 * Struct pages that do not have backing memory. This could be because
-	 * firmware is using some of this memory, or for some other reasons.
-	 */
-	if (pgcnt)
-		pr_info("Zeroed struct page in unavailable ranges: %lld pages", pgcnt);
-}
-#else
-static inline void __init init_unavailable_mem(void)
-{
-}
-#endif /* !CONFIG_FLAT_NODE_MEM_MAP */
-
 #if MAX_NUMNODES > 1
 /*
  * Figure out the number of possible node ids.
@@ -7500,7 +7478,6 @@ void __init free_area_init(unsigned long *max_zone_pfn)
 	/* Initialise every node */
 	mminit_verify_pageflags_layout();
 	setup_nr_node_ids();
-	init_unavailable_mem();
 	for_each_online_node(nid) {
 		pg_data_t *pgdat = NODE_DATA(nid);
 		free_area_init_node(nid);


> I agree that this is sub-optimal, as such pages are impossible to detect
> (PageReserved is just not clear as discussed with Andrea). The basic
> question is how we want to proceed:
> 
> a) Make sure any online struct page has a valid nid/zid, and is spanned
> by the nid/zid.
> b) Use a fake nid that will bail out when used for page_zone() and
> page_pgdat(), and make pfn walkers detect that.
> 
> AFAIU, Andrea seems to prefer a). I thing b) might be easier in corner
> cases. Thoughts?

I'd also prefer (a). 

The hardware defines what physical addresses correspond to which node,
so for any populated DIMM (or soldered DRAM for that matter) we can
detect page <-> node relationship.

As for the struct pages that just "hang" in the end of a section (your
example with 4000M), the addresses of these pages still obey the same
rules, so again, we have page <-> node correspondence.

The zones are software construct but they also correspond to some
hardware defined restrictions - each zone has a maximal PFN that HW
allows. Here again we can always know which zone spans that or another
page.

We'd have to fix a couple of things to get there, though :)

> ---
> 
> The tricky thing for b) is to fix all call places that do a
> page_zone()/page_pgdat() without checking if they are allowed to do so.
> We would have to audit all callers of page_zone() / page_zone_id() /
> page_pgdat() ...
> 
> 
> E.g., using a pattern like
> mm/memory_hotplug.c:find_biggest_section_pfn() is fine
> 
> 
> if (unlikely(!pfn_to_online_page(pfn)))
> 	continue;
> if (unlikely(pfn_to_nid(pfn) != nid))
> 	continue;
> 
> 
> e.g., mm/page_alloc.c:__pageblock_pfn_to_page() would have to be adjusted
> 
> start_page = pfn_to_online_page(start_pfn);
> if (!start_page)
> 	return NULL;
> 
> // necessary to avoid ...
> if (page_to_nid(page) == NODE_INVALID)
> 	return NULL;
> 
> // page_zone() from bailing out
> if (page_zone(start_page) != zone)
> 	return NULL;
> 
> -- 
> Thanks,
> 
> David / dhildenb
>
Andrea Arcangeli Nov. 26, 2020, 6:15 p.m. UTC | #37
On Thu, Nov 26, 2020 at 11:05:14AM +0100, David Hildenbrand wrote:
> I agree that this is sub-optimal, as such pages are impossible to detect
> (PageReserved is just not clear as discussed with Andrea). The basic
> question is how we want to proceed:
> 
> a) Make sure any online struct page has a valid nid/zid, and is spanned
> by the nid/zid.
> b) Use a fake nid that will bail out when used for page_zone() and
> page_pgdat(), and make pfn walkers detect that.
> 
> AFAIU, Andrea seems to prefer a). I thing b) might be easier in corner
> cases. Thoughts?

Yes that's a good summary.

My reason I slightly prefer a) is that it will perform faster at
runtime.

I seen also the proposed patch that adds the pfn_to_page embedded in
pfn_valid to show those pfn as invalid, that is especially slow and I
don't like it for that reason. Additional bugchecks for NO_NODEID will
slowdown things too, so they should be justified by some benefit. It
concerns me if pfn_valid becomes slower.

b) will also make compaction bail out on that pageblock which it
doesn't need to so it'll provide a worse runtime to compaction as
well.

a) is what the code already does if only the e820 map range was
reserved with memblock_reserved, after applying Mike's patch to
initialize reserved memblock regions too.

It turns out the VM_BUG_ON_PAGE, as far as compaction is concerned, is
just a false positive, it detected a broken VM invariant, but it was
fine to proceed in the DEBUG_VM=n case that wouldn't crash.

Before da50c57bdbb8e37ec6f8c934a2f8acbf4e4fdfb9 the struct page
corresponding to the e820 unknown type range page wouldn't be marked
PageReserved, that also looked wrong but it was also basically harmless.

Neither of the two defects is too bad: it could be ignored if we just
remove the bugcheck, but it's just nice if the bugcheck turn out to be
correct in the pageblock.

If we initialize all RAM and non-RAM ranges in the same way, then
there's no discrepancy. Mike's patch seem to do just that by walking
the memblock.reserved memblocks in the same way as the
memblock.memory.

NOTE: I would also much prefer b) if only it would guarantee zero
runtime slowdown. (Which is I especially dislike pfn_valid internally
calling pfn_to_page)

Thanks,
Andrea
Andrea Arcangeli Nov. 26, 2020, 6:29 p.m. UTC | #38
On Thu, Nov 26, 2020 at 11:36:02AM +0200, Mike Rapoport wrote:
> memory.reserved cannot be calculated automatically. It represents all
> the memory allocations made before page allocator is up. And as
> memblock_reserve() is the most basic to allocate memory early at boot we
> cannot really delete it ;-)

Well this explanation totally covers "memory allocated early at
boot" that overlaps with memblock.memory.

Does the E820_TYPE_SOFT_RESERVED range added to memblock.reserve
define as "memory allocated early at boot"?

Does it overlap ranges added with any RAM added to memblock.memory?

		if (entry->type == E820_TYPE_SOFT_RESERVED)
			memblock_reserve(entry->addr, entry->size);

		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
			continue;

		memblock_add(entry->addr, entry->size);

To me the above looks it's being used for something completely
different than from reserving "memory allocated early at boot".

Why there is no warning at boot if there's no overlap between
memblock.resereve and memblock.memory?

My question about memblock.reserve is really about the non overlapping
ranges: why are ranges non overlapping with memblock.memory regions,
added to memblock.reserve, and why aren't those calculated
automatically as reverse of memblock.memory?

It's easy to see that when memblock.reserve overlaps fully, it makes
perfect sense and it has to stay for it. I was really only thinking at
the usage like above of memblock_reserve that looks like it should be
turned into a noop and deleted.

Thanks,
Andrea
Andrea Arcangeli Nov. 26, 2020, 7:21 p.m. UTC | #39
On Thu, Nov 26, 2020 at 11:36:02AM +0200, Mike Rapoport wrote:
> I think it's inveneted by your BIOS vendor :)

BTW, all systems I use on a daily basis have that type 20... Only two
of them are reproducing the VM_BUG_ON on a weekly basis on v5.9.

If you search 'E820 "type 20"' you'll get plenty of hits, so it's not
just me at least :).. In fact my guess is there are probably more
workstation/laptops with that type 20 than without. Maybe it only
showup if booting with EFI?

Easy to check with `dmesg | grep "type 20"` after boot.

One guess why this wasn't frequently reproduced is some desktop distro
is doing the mistake of keeping THP enabled = madvise by default and
they're reducing the overall compaction testing? Or maybe they're not
all setting DEBUG_VM=y (but some other distro I'm sure ships v5.9 with
DEBUG_VM=y). Often I hit this bug in kcompactd0 for example, that
wouldn't happen with THP enabled=madvise.

The two bpf tracing tools below can proof how the current
defrag=madvise default only increase the allocation latency from a few
usec to a dozen usec. Only if setting defrag=always the latency goes
up to single digit milliseconds, because of the cost of direct
compaction which is only worth paying for, for MADV_HUGEPAGE ranges
doing long-lived allocations (we know by now that defrag=always was
a suboptimal default).

https://www.kernel.org/pub/linux/kernel/people/andrea/ebpf/thp-comm.bp
https://www.kernel.org/pub/linux/kernel/people/andrea/ebpf/thp.bp

Since 3917c80280c93a7123f1a3a6dcdb10a3ea19737d even app like Redis
using fork for snapshotting purposes should prefer THP
enabled. (besides it would be better if it used uffd-wp as alternative
to fork)

3917c80280c93a7123f1a3a6dcdb10a3ea19737d also resolved another concern
because the decade old "fork() vs gup/O_DIRECT vs thread" race was
supposed to be unnoticeable from userland if the O_DIRECT min I/O
granularity was enforced to be >=PAGE_SIZE. However with THP backed
anon memory, that minimum granularity requirement increase to
HPAGE_PMD_SIZE. Recent kernels are going in the direction of solving
that race by doing cow during fork as it was originally proposed long
time ago
(https://lkml.kernel.org/r/20090311165833.GI27823@random.random) which
I suppose will solve the race with sub-PAGE_SIZE granularity too, but
3917c80280c93a7123f1a3a6dcdb10a3ea19737d alone is enough to reduce the
minumum I/O granularity requirement from HPAGE_PMD_SIZE to PAGE_SIZE
as some userland may have expected. The best of course is to fully
prevent that race condition by setting MADV_DONTFORK on the regions
under O_DIRECT (as qemu does for example).

Overall the only tangible concern left is potential higher memory
usage for servers handling tiny object storage freed at PAGE_SIZE
granularity with MADV_DONTNEED (instead of having a way to copy and
defrag the virtual space of small objects through a callback that
updates the pointer to the object...).

Small object storage relying on jemalloc/tcmalloc for tiny object
management simply need to selectively disable THP to avoid wasting
memory either with MADV_NOHUGEPAGE or with the prctl
PR_SET_THP_DISABLE. Flipping a switch in the OCI schema allows to
disable THP too for those object storage apps making heavy use of
MADV_DONTNEED, not even a single line of code need changing in the app
for it if deployed through the OCI container runtime.

Recent jemalloc uses MADV_NOHUGEPAGE. I didn't check exactly how it's
being used but I've an hope it already does the right thing and
separates small object arena zapped with MADV_DONTNEED at PAGE_SIZE
granularity, with large object arena where THP shall remain
enabled. glibc also should learn to separate small objects and big
objects in different arenas. This has to be handled by the app, like
it is handled optimally already in scylladb that in fact invokes
MADV_HUGEPAGE, or plenty of other databases are using not just THP but
also hugetlbfs which certainly won't fly if MADV_DONTNEED is attempted
at PAGE_SIZE granularity.. or elastic search that also gets a
significant boost from THP etc..

Thanks,
Andrea
Mike Rapoport Nov. 26, 2020, 7:44 p.m. UTC | #40
On Thu, Nov 26, 2020 at 01:29:30PM -0500, Andrea Arcangeli wrote:
> On Thu, Nov 26, 2020 at 11:36:02AM +0200, Mike Rapoport wrote:
> > memory.reserved cannot be calculated automatically. It represents all
> > the memory allocations made before page allocator is up. And as
> > memblock_reserve() is the most basic to allocate memory early at boot we
> > cannot really delete it ;-)
> 
> Well this explanation totally covers "memory allocated early at
> boot" that overlaps with memblock.memory.
> 
> Does the E820_TYPE_SOFT_RESERVED range added to memblock.reserve
> define as "memory allocated early at boot"?
> 
> Does it overlap ranges added with any RAM added to memblock.memory?
> 
> 		if (entry->type == E820_TYPE_SOFT_RESERVED)
> 			memblock_reserve(entry->addr, entry->size);
> 
> 		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> 			continue;
> 
> 		memblock_add(entry->addr, entry->size);
> 
> To me the above looks it's being used for something completely
> different than from reserving "memory allocated early at boot".
> 
> Why there is no warning at boot if there's no overlap between
> memblock.resereve and memblock.memory?

> My question about memblock.reserve is really about the non overlapping
> ranges: why are ranges non overlapping with memblock.memory regions,
> added to memblock.reserve, and why aren't those calculated
> automatically as reverse of memblock.memory?

Once there was this comment in arch/x86/kernel/e820.c:

	/*
	 * all !E820_TYPE_RAM ranges (including gap ranges) are put
	 * into memblock.reserved to make sure that struct pages in
	 * such regions are not left uninitialized after bootup.
	 */

I presume there were struct pages that corresponded to some unusable
memory and they were not initilized, so the solution was to add them to
memblock.reserved.

> It's easy to see that when memblock.reserve overlaps fully, it makes
> perfect sense and it has to stay for it. I was really only thinking at
> the usage like above of memblock_reserve that looks like it should be
> turned into a noop and deleted.

TBH, the whole interaction between e820 and memblock keeps me puzzled
and I can only make educated guesses why some ranges here are
memblock_reserve()'d and some memblock_add()ed.

I think what should be there is that e820 entries that are essentially
RAM, used by BIOS or not, should be listed in memblock.memory. Then
using memblock_reserve() for parts that BIOS claimed for itself would
have the same semantics as for memory allocated by kernel.

I.e. if there is a DIMM from 0 to, say 512M, memblock.memory will have a
range [0, 512M]. And areas such as 0x000-0xfff, 0x9d000-0x9ffff will be
in memblock.reserved.

Than in page_alloc.c we'll know that we have a physical memory bank from
0 to 512M but there are some ranges that we cannot use.

I suggested it back then when the issue with compaction was reported at
the first time, but Baoquan mentioned that there are systems that cannot
even tolerate having BIOS reserved areas in the page tables and I didn't
continue to pursue this.

Now I'm thinking to resurrect this patch with some additions so that
init_mem_mapping could skip such regions.

[1] https://lore.kernel.org/lkml/20200528090731.GI20045@MiWiFi-R3L-srv/#t


> Thanks,
> Andrea
>
Andrea Arcangeli Nov. 26, 2020, 8:30 p.m. UTC | #41
On Thu, Nov 26, 2020 at 09:44:26PM +0200, Mike Rapoport wrote:
> TBH, the whole interaction between e820 and memblock keeps me puzzled
> and I can only make educated guesses why some ranges here are
> memblock_reserve()'d and some memblock_add()ed.

The mixed usage in that interaction between memblock.reserve and
memblock.memory where sometime it's used to reserve overlapping
memblock.memory ranges (clearly ok), and sometimes is used on ranges
with no overlap (not clear even why it's being called), is what makes
the current code messy.

We're basically passing down the exact same information (inverted),
through two different APIs, if there is no overlap.

> I think what should be there is that e820 entries that are essentially
> RAM, used by BIOS or not, should be listed in memblock.memory. Then
> using memblock_reserve() for parts that BIOS claimed for itself would
> have the same semantics as for memory allocated by kernel.
>
> I.e. if there is a DIMM from 0 to, say 512M, memblock.memory will have a
> range [0, 512M]. And areas such as 0x000-0xfff, 0x9d000-0x9ffff will be
> in memblock.reserved.
> 
> Than in page_alloc.c we'll know that we have a physical memory bank from
> 0 to 512M but there are some ranges that we cannot use.
>
> I suggested it back then when the issue with compaction was reported at
> the first time, but Baoquan mentioned that there are systems that cannot
> even tolerate having BIOS reserved areas in the page tables and I didn't
> continue to pursue this.

That explains why we can't add the e820 non-RAM regions to
memblock_add, what's not clear is why it should be required to call
memblock_reserve on a region that has no overlap with any memblock_add.

Instead of the patch that adds a walk to the memblock.reserve and that
requires adding even more memblock_reserve to e820__memblock_setup for
type 20, we can add a walk for the memblock.memory holes and then we
can remove the memblock_reserve for E820_TYPE_SOFT_RESERVED too.

Thanks,
Andrea
Mike Rapoport Nov. 26, 2020, 9:03 p.m. UTC | #42
On Thu, Nov 26, 2020 at 03:30:01PM -0500, Andrea Arcangeli wrote:
> On Thu, Nov 26, 2020 at 09:44:26PM +0200, Mike Rapoport wrote:
> > TBH, the whole interaction between e820 and memblock keeps me puzzled
> > and I can only make educated guesses why some ranges here are
> > memblock_reserve()'d and some memblock_add()ed.
> 
> The mixed usage in that interaction between memblock.reserve and
> memblock.memory where sometime it's used to reserve overlapping
> memblock.memory ranges (clearly ok), and sometimes is used on ranges
> with no overlap (not clear even why it's being called), is what makes
> the current code messy.
> 
> We're basically passing down the exact same information (inverted),
> through two different APIs, if there is no overlap.
> 
> > I think what should be there is that e820 entries that are essentially
> > RAM, used by BIOS or not, should be listed in memblock.memory. Then
> > using memblock_reserve() for parts that BIOS claimed for itself would
> > have the same semantics as for memory allocated by kernel.
> >
> > I.e. if there is a DIMM from 0 to, say 512M, memblock.memory will have a
> > range [0, 512M]. And areas such as 0x000-0xfff, 0x9d000-0x9ffff will be
> > in memblock.reserved.
> > 
> > Than in page_alloc.c we'll know that we have a physical memory bank from
> > 0 to 512M but there are some ranges that we cannot use.
> >
> > I suggested it back then when the issue with compaction was reported at
> > the first time, but Baoquan mentioned that there are systems that cannot
> > even tolerate having BIOS reserved areas in the page tables and I didn't
> > continue to pursue this.
> 
> That explains why we can't add the e820 non-RAM regions to
> memblock_add, what's not clear is why it should be required to call
> memblock_reserve on a region that has no overlap with any memblock_add.
> 
> Instead of the patch that adds a walk to the memblock.reserve and that
> requires adding even more memblock_reserve to e820__memblock_setup for
> type 20, we can add a walk for the memblock.memory holes and then we
> can remove the memblock_reserve for E820_TYPE_SOFT_RESERVED too.

This is more or less what I have done here:

https://lore.kernel.org/lkml/20201126174601.GT123287@linux.ibm.com/

just without the removal of the call to memblock_reserve() for
E820_TYPE_SOFT_RESERVED.


> Thanks,
> Andrea
>
Mike Rapoport Nov. 29, 2020, 12:32 p.m. UTC | #43
Hello Andrea,

On Thu, Nov 26, 2020 at 07:46:05PM +0200, Mike Rapoport wrote:
> On Thu, Nov 26, 2020 at 11:05:14AM +0100, David Hildenbrand wrote:
> 
> Let's try to merge init_unavailable_memory() into memmap_init().
> Than it'll be able to set zone/nid for those nasty pfns that BIOS
> decided to keep to itself, like in Andrea's case and will also take care
> of struct pages that do not really have a frame in DRAM, but are there
> because of arbitrary section size.

Did you have a chance to check if this solves your issue?
If yes, I'll resend this as a formal patch.

> Something like this:
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index eaa227a479e4..072e94042a11 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6185,24 +6185,84 @@ static void __meminit zone_init_free_lists(struct zone *zone)
>  	}
>  }
>  
> -void __meminit __weak memmap_init(unsigned long size, int nid,
> -				  unsigned long zone,
> -				  unsigned long range_start_pfn)
> +#if !defined(CONFIG_FLAT_NODE_MEM_MAP)
> +/*
> + * Only struct pages that are backed by physical memory available to the
> + * kernel are zeroed and initialized by memmap_init_zone().
> + * But, there are some struct pages that are either reserved by firmware or
> + * do not correspond to physical page frames becuase actual memory bank is
> + * not a multiple of SECTION_SIZE. Fields of those struct pages may be
> + * accessed (for example page_to_pfn() on some configuration accesses
> + * flags) so we must explicitly initialize those struct pages.
> + */
> +static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn,
> +					 int zone, int node)
>  {
> -	unsigned long start_pfn, end_pfn;
> +	unsigned long pfn;
> +	u64 pgcnt = 0;
> +
> +	for (pfn = spfn; pfn < epfn; pfn++) {
> +		if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
> +			pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
> +				+ pageblock_nr_pages - 1;
> +			continue;
> +		}
> +		__init_single_page(pfn_to_page(pfn), pfn, zone, node);
> +		__SetPageReserved(pfn_to_page(pfn));
> +		pgcnt++;
> +	}
> +
> +	return pgcnt;
> +}
> +#else
> +static inline u64 init_unavailable_range(unsigned long spfn, unsigned long epfn,
> +					 int zone, int node)
> +{
> +	return 0;
> +}
> +#endif
> +
> +void __init __weak memmap_init(unsigned long size, int nid,
> +			       unsigned long zone,
> +			       unsigned long range_start_pfn)
> +{
> +	unsigned long start_pfn, end_pfn, next_pfn = 0;
>  	unsigned long range_end_pfn = range_start_pfn + size;
> +	u64 pgcnt = 0;
>  	int i;
>  
>  	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
>  		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
>  		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> +		next_pfn = clamp(next_pfn, range_start_pfn, range_end_pfn);
>  
>  		if (end_pfn > start_pfn) {
>  			size = end_pfn - start_pfn;
>  			memmap_init_zone(size, nid, zone, start_pfn,
>  					 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
>  		}
> +
> +		if (next_pfn < start_pfn)
> +			pgcnt += init_unavailable_range(next_pfn, start_pfn,
> +							zone, nid);
> +		next_pfn = end_pfn;
>  	}
> +
> +	/*
> +	 * Early sections always have a fully populated memmap for the whole
> +	 * section - see pfn_valid(). If the last section has holes at the
> +	 * end and that section is marked "online", the memmap will be
> +	 * considered initialized. Make sure that memmap has a well defined
> +	 * state.
> +	 */
> +	if (next_pfn < range_end_pfn)
> +		pgcnt += init_unavailable_range(next_pfn, range_end_pfn,
> +						zone, nid);
> +
> +	if (pgcnt)
> +		pr_info("%s: Zeroed struct page in unavailable ranges: %lld\n",
> +			zone_names[zone], pgcnt);
> +
>  }
>  
>  static int zone_batchsize(struct zone *zone)
> @@ -6995,88 +7055,6 @@ void __init free_area_init_memoryless_node(int nid)
>  	free_area_init_node(nid);
>  }
>  
> -#if !defined(CONFIG_FLAT_NODE_MEM_MAP)
> -/*
> - * Initialize all valid struct pages in the range [spfn, epfn) and mark them
> - * PageReserved(). Return the number of struct pages that were initialized.
> - */
> -static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn)
> -{
> -	unsigned long pfn;
> -	u64 pgcnt = 0;
> -
> -	for (pfn = spfn; pfn < epfn; pfn++) {
> -		if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
> -			pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
> -				+ pageblock_nr_pages - 1;
> -			continue;
> -		}
> -		/*
> -		 * Use a fake node/zone (0) for now. Some of these pages
> -		 * (in memblock.reserved but not in memblock.memory) will
> -		 * get re-initialized via reserve_bootmem_region() later.
> -		 */
> -		__init_single_page(pfn_to_page(pfn), pfn, 0, 0);
> -		__SetPageReserved(pfn_to_page(pfn));
> -		pgcnt++;
> -	}
> -
> -	return pgcnt;
> -}
> -
> -/*
> - * Only struct pages that are backed by physical memory are zeroed and
> - * initialized by going through __init_single_page(). But, there are some
> - * struct pages which are reserved in memblock allocator and their fields
> - * may be accessed (for example page_to_pfn() on some configuration accesses
> - * flags). We must explicitly initialize those struct pages.
> - *
> - * This function also addresses a similar issue where struct pages are left
> - * uninitialized because the physical address range is not covered by
> - * memblock.memory or memblock.reserved. That could happen when memblock
> - * layout is manually configured via memmap=, or when the highest physical
> - * address (max_pfn) does not end on a section boundary.
> - */
> -static void __init init_unavailable_mem(void)
> -{
> -	phys_addr_t start, end;
> -	u64 i, pgcnt;
> -	phys_addr_t next = 0;
> -
> -	/*
> -	 * Loop through unavailable ranges not covered by memblock.memory.
> -	 */
> -	pgcnt = 0;
> -	for_each_mem_range(i, &start, &end) {
> -		if (next < start)
> -			pgcnt += init_unavailable_range(PFN_DOWN(next),
> -							PFN_UP(start));
> -		next = end;
> -	}
> -
> -	/*
> -	 * Early sections always have a fully populated memmap for the whole
> -	 * section - see pfn_valid(). If the last section has holes at the
> -	 * end and that section is marked "online", the memmap will be
> -	 * considered initialized. Make sure that memmap has a well defined
> -	 * state.
> -	 */
> -	pgcnt += init_unavailable_range(PFN_DOWN(next),
> -					round_up(max_pfn, PAGES_PER_SECTION));
> -
> -	/*
> -	 * Struct pages that do not have backing memory. This could be because
> -	 * firmware is using some of this memory, or for some other reasons.
> -	 */
> -	if (pgcnt)
> -		pr_info("Zeroed struct page in unavailable ranges: %lld pages", pgcnt);
> -}
> -#else
> -static inline void __init init_unavailable_mem(void)
> -{
> -}
> -#endif /* !CONFIG_FLAT_NODE_MEM_MAP */
> -
>  #if MAX_NUMNODES > 1
>  /*
>   * Figure out the number of possible node ids.
> @@ -7500,7 +7478,6 @@ void __init free_area_init(unsigned long *max_zone_pfn)
>  	/* Initialise every node */
>  	mminit_verify_pageflags_layout();
>  	setup_nr_node_ids();
> -	init_unavailable_mem();
>  	for_each_online_node(nid) {
>  		pg_data_t *pgdat = NODE_DATA(nid);
>  		free_area_init_node(nid);
> 
> 
> > I agree that this is sub-optimal, as such pages are impossible to detect
> > (PageReserved is just not clear as discussed with Andrea). The basic
> > question is how we want to proceed:
> > 
> > a) Make sure any online struct page has a valid nid/zid, and is spanned
> > by the nid/zid.
> > b) Use a fake nid that will bail out when used for page_zone() and
> > page_pgdat(), and make pfn walkers detect that.
> > 
> > AFAIU, Andrea seems to prefer a). I thing b) might be easier in corner
> > cases. Thoughts?
> 
> I'd also prefer (a). 
> 
> The hardware defines what physical addresses correspond to which node,
> so for any populated DIMM (or soldered DRAM for that matter) we can
> detect page <-> node relationship.
> 
> As for the struct pages that just "hang" in the end of a section (your
> example with 4000M), the addresses of these pages still obey the same
> rules, so again, we have page <-> node correspondence.
> 
> The zones are software construct but they also correspond to some
> hardware defined restrictions - each zone has a maximal PFN that HW
> allows. Here again we can always know which zone spans that or another
> page.
> 
> We'd have to fix a couple of things to get there, though :)
> 
> > ---
> > 
> > The tricky thing for b) is to fix all call places that do a
> > page_zone()/page_pgdat() without checking if they are allowed to do so.
> > We would have to audit all callers of page_zone() / page_zone_id() /
> > page_pgdat() ...
> > 
> > 
> > E.g., using a pattern like
> > mm/memory_hotplug.c:find_biggest_section_pfn() is fine
> > 
> > 
> > if (unlikely(!pfn_to_online_page(pfn)))
> > 	continue;
> > if (unlikely(pfn_to_nid(pfn) != nid))
> > 	continue;
> > 
> > 
> > e.g., mm/page_alloc.c:__pageblock_pfn_to_page() would have to be adjusted
> > 
> > start_page = pfn_to_online_page(start_pfn);
> > if (!start_page)
> > 	return NULL;
> > 
> > // necessary to avoid ...
> > if (page_to_nid(page) == NODE_INVALID)
> > 	return NULL;
> > 
> > // page_zone() from bailing out
> > if (page_zone(start_page) != zone)
> > 	return NULL;
> > 
> > -- 
> > Thanks,
> > 
> > David / dhildenb
> > 
> 
> -- 
> Sincerely yours,
> Mike.
Andrea Arcangeli Dec. 2, 2020, 12:44 a.m. UTC | #44
Hello Mike,

On Sun, Nov 29, 2020 at 02:32:57PM +0200, Mike Rapoport wrote:
> Hello Andrea,
> 
> On Thu, Nov 26, 2020 at 07:46:05PM +0200, Mike Rapoport wrote:
> > On Thu, Nov 26, 2020 at 11:05:14AM +0100, David Hildenbrand wrote:
> > 
> > Let's try to merge init_unavailable_memory() into memmap_init().
> > Than it'll be able to set zone/nid for those nasty pfns that BIOS
> > decided to keep to itself, like in Andrea's case and will also take care
> > of struct pages that do not really have a frame in DRAM, but are there
> > because of arbitrary section size.
> 
> Did you have a chance to check if this solves your issue?
> If yes, I'll resend this as a formal patch.

I tested the patch you sent, but it doesn't seem to boot. Reverting it
booted.

Also I noticed leaving pages uninitialized already happened before and
was fixed in 124049decbb121ec32742c94fb5d9d6bed8f24d8 where literally
all holes were registered in memblock_reserve() by hand to workaround
this very same defect in memblock callee we're fixing again.

Then it was lifted in 9fd61bc95130d4971568b89c9548b5e0a4e18e0e since
the memblock was supposed to be able to initialize all pages.

Since this seems the second time this happens, I'd suggest to change
the graceful pfn, 0,0 initialization to memset(page, 0xff,
sizeof(struct page)) too like we mentioned earlier and then have at
least a DEBUG_SOMETHING=y to search for struct pages with all 1 in
page->flags to ensure the boot stage worked right so perhaps there's a
graceful notification at boot before a VM_BUG_ON triggers later. The
page struct validation could be done based on DEBUG_VM=y too since it
won't cause any runtime cost post boot.

Thanks,
Andrea
Mike Rapoport Dec. 2, 2020, 5:39 p.m. UTC | #45
Hello Andrea,

On Tue, Dec 01, 2020 at 07:44:51PM -0500, Andrea Arcangeli wrote:
> Hello Mike,
> 
> On Sun, Nov 29, 2020 at 02:32:57PM +0200, Mike Rapoport wrote:
> > Hello Andrea,
> > 
> > On Thu, Nov 26, 2020 at 07:46:05PM +0200, Mike Rapoport wrote:
> > > On Thu, Nov 26, 2020 at 11:05:14AM +0100, David Hildenbrand wrote:
> > > 
> > > Let's try to merge init_unavailable_memory() into memmap_init().
> > > Than it'll be able to set zone/nid for those nasty pfns that BIOS
> > > decided to keep to itself, like in Andrea's case and will also take care
> > > of struct pages that do not really have a frame in DRAM, but are there
> > > because of arbitrary section size.
> > 
> > Did you have a chance to check if this solves your issue?
> > If yes, I'll resend this as a formal patch.
> 
> I tested the patch you sent, but it doesn't seem to boot. Reverting it
> booted.

Hmm, do you have any logs? It worked on my box and with various memory
configurations in qemu.

> Also I noticed leaving pages uninitialized already happened before and
> was fixed in 124049decbb121ec32742c94fb5d9d6bed8f24d8 where literally
> all holes were registered in memblock_reserve() by hand to workaround
> this very same defect in memblock callee we're fixing again.
>
> Then it was lifted in 9fd61bc95130d4971568b89c9548b5e0a4e18e0e since
> the memblock was supposed to be able to initialize all pages.

I still that think both 124049decbb1 ("x86/e820: put !E820_TYPE_RAM
regions into memblock.reserved") and 9fd61bc95130 ("Revert "x86/e820:
put !E820_TYPE_RAM regions into memblock.reserved") are band aids and
they do not address the root cause but rather try to workaround it.

The problem is that we have no proper way to initialize struct pages
corresponding to holes. The holes can appear in case when the memory is
not a multiple of SECTION_SIZE and, at least on x86, in case BIOS
reserves pages and they are not E820_TYPE_RAM.

The first case is not much of a problem, as I doubt there are still
systems that have not MAX_ORDER aligned memory banks, so the "trailing"
pages in semi-empty section will never be used by the page allocator.

The BIOS case, however is more problematic as it creates holes inside
MAX_ORDER aligned pageblocks.

Pushing the pages in such holes in and out of memblock.reserved won't
really help because they are not in memblock.memory while most
calculations during initialization of page allocator and memory map use
memblock.memory and iterate over it.

As these holes do not seem to appear on zone/node boundaries, we do not
see reports about wrong zone/node sizing because of the holes.

I believe that the proper solution would be to have all the memory
reserved by the firmware added to memblock.memory, like all other
architectures do, but I'm not sure if we can easily and reliably
determine what of E820 reserved types are actually backed by RAM.
So this is not feasible as a short term solution.

My patch [1], though, is an immediate improvement and I think it's worth
trying to fix off-by-one's that prevent your system from booting.

[1] https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@kernel.org

> Since this seems the second time this happens, I'd suggest to change
> the graceful pfn, 0,0 initialization to memset(page, 0xff,
> sizeof(struct page)) too like we mentioned earlier and then have at
> least a DEBUG_SOMETHING=y to search for struct pages with all 1 in
> page->flags to ensure the boot stage worked right so perhaps there's a
> graceful notification at boot before a VM_BUG_ON triggers later. The
> page struct validation could be done based on DEBUG_VM=y too since it
> won't cause any runtime cost post boot.
>
> Thanks,
> Andrea
>
Andrea Arcangeli Dec. 3, 2020, 6:23 a.m. UTC | #46
On Wed, Dec 02, 2020 at 07:39:23PM +0200, Mike Rapoport wrote:
> Hmm, do you have any logs? It worked on my box and with various memory
> configurations in qemu.

It crashes in reserve_bootmem_region so no logs at all since the
serial console isn't active yet.

> I believe that the proper solution would be to have all the memory
> reserved by the firmware added to memblock.memory, like all other
> architectures do, but I'm not sure if we can easily and reliably
> determine what of E820 reserved types are actually backed by RAM.
> So this is not feasible as a short term solution.
> 
> My patch [1], though, is an immediate improvement and I think it's worth
> trying to fix off-by-one's that prevent your system from booting.
> 
> [1] https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@kernel.org

Yes that's the patch I applied.

Relevant points after debugging:

1) can you try again with DEBUG_VM=y?

2) DEBUG_VM=y already enforces the memset(page, 0xff, sizeof(struct
   page)) on all struct pages allocated, exactly I was suggesting to
   do in the previous email. I wonder why we're not doing that even
   with DEBUG_VM=n, perhaps it's too slow for TiB systems. See
   page_init_poison().

3) given 2), your patch below by deleting "0,0" initialization
   achieves the debug feature to keep PagePoison forever on all
   uninitialized pages, imagine PagePoison is really
   NO_NODEID/NO_ZONEID and doesn't need handling other than a crash.
  
-		__init_single_page(pfn_to_page(pfn), pfn, 0, 0);

4) because of the feature in 3) I now hit the PagePoison VM_BUG_ON
   because pfn 0 wasn't initialized when reserve_bootmem_region tries
   to call __SetPageReserved on a PagePoison page.

5) pfn 0 is the classical case where pfn 0 is in a reserved zone in
   memblock.reserve that doesn't overlap any memblock.memory zone.

   The memblock_start_of_DRAM moves the start of the DMA zone above
   the pfn 0, so then pfn 0 already ends up in the no zone land David
   mentioned where it's not trivial to decide to give it a zoneid if
   it's not even spanned in the zone.

   Not just the zoneid, there's not even a nid.

   So we have a pfn with no zoneid, and no nid, and not part of the
   zone ranges, not part of the nid ranges not part of the
   memory.memblock.

   We can't really skip the initialization of the the pfn 0, it has to
   get the zoneid 0 treatment or if pfn 1 ends up in compaction code,
   we'd crash again. (although we'd crash with a nice PagePoison(page)
   == true behavior)

The below fixes the issue and it boots fine again and the compaction
crash should be solved with both patches applied.

DMA      zone_start_pfn 0            zone_end_pfn() 4096         contiguous 1     
DMA32    zone_start_pfn 4096         zone_end_pfn() 524252       contiguous 1     
Normal   zone_start_pfn 0            zone_end_pfn() 0            contiguous 0     
Movable  zone_start_pfn 0            zone_end_pfn() 0            contiguous 0     

500246 0x7a216000 0x1fff000000001000 reserved True pfn_valid True
500247 0x7a217000 0x1fff000000000000 reserved False pfn_valid True
500248 0x7a218000 0x1fff000000000000 reserved False pfn_valid True

==== quote from previous email ====
73a6e474cb376921a311786652782155eac2fdf0 this changed it:

DMA      zone_start_pfn 1            zone_end_pfn() 4096         contiguous 1
DMA32    zone_start_pfn 4096         zone_end_pfn() 524252       contiguous 0
Normal   zone_start_pfn 0            zone_end_pfn() 0            contiguous 0
Movable  zone_start_pfn 0            zone_end_pfn() 0            contiguous 0

500246 0x7a216000 0xfff000000001000 reserved True
500247 0x7a217000 0x1fff000000000000 reserved False
500248 0x7a218000 0x1fff000000010200 reserved False ]
==== quote from previous email ====

From 89469f063c192ae70aea0bd6e1e2a7e99894e5b6 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Wed, 2 Dec 2020 23:23:26 -0500
Subject: [PATCH 1/1] mm: initialize struct pages in reserved regions outside
 of the zone ranges

pfn 0 wasn't initialized and the PagePoison remained set when
reserve_bootmem_region() called __SetPageReserved, inducing a silent
boot failure with DEBUG_VM (and correctly so, because the crash
signaled the nodeid/nid of pfn 0 would be again wrong).

Without this change, the pfn 0 part of a memblock.reserved range,
isn't in any zone spanned range, nor in any nid spanned range, when we
initialize the memblock.memory holes pfn 0 won't be included because
it has no nid and no zoneid.

There's no enforcement that all memblock.reserved ranges must overlap
memblock.memory ranges, so the memblock.reserved ranges also require
an explicit initialization and the zones and nid ranges need to be
extended to include all memblock.reserved ranges with struct pages, or
they'll be left uninitialized with PagePoison as it happened to pfn 0.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/memblock.h | 17 +++++++++---
 mm/debug.c               |  3 ++-
 mm/memblock.c            |  4 +--
 mm/page_alloc.c          | 57 ++++++++++++++++++++++++++++++++--------
 4 files changed, 63 insertions(+), 18 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index ef131255cedc..c8e30cd69564 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -251,7 +251,8 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
 			    unsigned long  *end_pfn);
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
-			  unsigned long *out_end_pfn, int *out_nid);
+			  unsigned long *out_end_pfn, int *out_nid,
+			  struct memblock_type *type);
 
 /**
  * for_each_mem_pfn_range - early memory pfn range iterator
@@ -263,9 +264,17 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
  *
  * Walks over configured memory ranges.
  */
-#define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid)		\
-	for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid); \
-	     i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))
+#define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid)		  \
+	for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid, \
+					  &memblock.memory);		  \
+	     i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid, \
+					  &memblock.memory))
+
+#define for_each_res_pfn_range(i, nid, p_start, p_end, p_nid)		  \
+	for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid, \
+					  &memblock.reserved);		  \
+	     i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid, \
+					  &memblock.reserved))
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
diff --git a/mm/debug.c b/mm/debug.c
index ccca576b2899..6a1d534f5ffc 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -64,7 +64,8 @@ void __dump_page(struct page *page, const char *reason)
 	 * dump_page() when detected.
 	 */
 	if (page_poisoned) {
-		pr_warn("page:%px is uninitialized and poisoned", page);
+		pr_warn("page:%px pfn:%ld is uninitialized and poisoned",
+			page, page_to_pfn(page));
 		goto hex_only;
 	}
 
diff --git a/mm/memblock.c b/mm/memblock.c
index b68ee86788af..3964a5e8914f 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1198,9 +1198,9 @@ void __init_memblock __next_mem_range_rev(u64 *idx, int nid,
  */
 void __init_memblock __next_mem_pfn_range(int *idx, int nid,
 				unsigned long *out_start_pfn,
-				unsigned long *out_end_pfn, int *out_nid)
+				unsigned long *out_end_pfn, int *out_nid,
+				struct memblock_type *type)
 {
-	struct memblock_type *type = &memblock.memory;
 	struct memblock_region *r;
 	int r_nid;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ce2bdaabdf96..3eed49598d66 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1458,6 +1458,7 @@ static void __meminit init_reserved_page(unsigned long pfn)
 {
 	pg_data_t *pgdat;
 	int nid, zid;
+	bool found = false;
 
 	if (!early_page_uninitialised(pfn))
 		return;
@@ -1468,10 +1469,15 @@ static void __meminit init_reserved_page(unsigned long pfn)
 	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 		struct zone *zone = &pgdat->node_zones[zid];
 
-		if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone))
+		if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone)) {
+			found = true;
 			break;
+		}
 	}
-	__init_single_page(pfn_to_page(pfn), pfn, zid, nid);
+	if (likely(found))
+		__init_single_page(pfn_to_page(pfn), pfn, zid, nid);
+	else
+		WARN_ON_ONCE(1);
 }
 #else
 static inline void init_reserved_page(unsigned long pfn)
@@ -6227,7 +6233,7 @@ void __init __weak memmap_init(unsigned long size, int nid,
 			       unsigned long zone,
 			       unsigned long range_start_pfn)
 {
-	unsigned long start_pfn, end_pfn, next_pfn = 0;
+	unsigned long start_pfn, end_pfn, prev_pfn = 0;
 	unsigned long range_end_pfn = range_start_pfn + size;
 	u64 pgcnt = 0;
 	int i;
@@ -6235,7 +6241,7 @@ void __init __weak memmap_init(unsigned long size, int nid,
 	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
 		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
 		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
-		next_pfn = clamp(next_pfn, range_start_pfn, range_end_pfn);
+		prev_pfn = clamp(prev_pfn, range_start_pfn, range_end_pfn);
 
 		if (end_pfn > start_pfn) {
 			size = end_pfn - start_pfn;
@@ -6243,10 +6249,10 @@ void __init __weak memmap_init(unsigned long size, int nid,
 					 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
 		}
 
-		if (next_pfn < start_pfn)
-			pgcnt += init_unavailable_range(next_pfn, start_pfn,
+		if (prev_pfn < start_pfn)
+			pgcnt += init_unavailable_range(prev_pfn, start_pfn,
 							zone, nid);
-		next_pfn = end_pfn;
+		prev_pfn = end_pfn;
 	}
 
 	/*
@@ -6256,12 +6262,31 @@ void __init __weak memmap_init(unsigned long size, int nid,
 	 * considered initialized. Make sure that memmap has a well defined
 	 * state.
 	 */
-	if (next_pfn < range_end_pfn)
-		pgcnt += init_unavailable_range(next_pfn, range_end_pfn,
+	if (prev_pfn < range_end_pfn)
+		pgcnt += init_unavailable_range(prev_pfn, range_end_pfn,
 						zone, nid);
 
+	/*
+	 * memblock.reserved isn't enforced to overlap with
+	 * memblock.memory so initialize the struct pages for
+	 * memblock.reserved too in case it wasn't overlapping.
+	 *
+	 * If any struct page associated with a memblock.reserved
+	 * range isn't overlapping with a zone range, it'll be left
+	 * uninitialized, ideally with PagePoison, and it'll be a more
+	 * easily detectable error.
+	 */
+	for_each_res_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
+		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
+		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
+
+		if (end_pfn > start_pfn)
+			pgcnt += init_unavailable_range(start_pfn, end_pfn,
+							zone, nid);
+	}
+
 	if (pgcnt)
-		pr_info("%s: Zeroed struct page in unavailable ranges: %lld\n",
+		pr_info("%s: pages in unavailable ranges: %lld\n",
 			zone_names[zone], pgcnt);
 }
 
@@ -6499,6 +6524,10 @@ void __init get_pfn_range_for_nid(unsigned int nid,
 		*start_pfn = min(*start_pfn, this_start_pfn);
 		*end_pfn = max(*end_pfn, this_end_pfn);
 	}
+	for_each_res_pfn_range(i, nid, &this_start_pfn, &this_end_pfn, NULL) {
+		*start_pfn = min(*start_pfn, this_start_pfn);
+		*end_pfn = max(*end_pfn, this_end_pfn);
+	}
 
 	if (*start_pfn == -1UL)
 		*start_pfn = 0;
@@ -7126,7 +7155,13 @@ unsigned long __init node_map_pfn_alignment(void)
  */
 unsigned long __init find_min_pfn_with_active_regions(void)
 {
-	return PHYS_PFN(memblock_start_of_DRAM());
+	/*
+	 * reserved regions must be included so that their page
+	 * structure can be part of a zone and obtain a valid zoneid
+	 * before __SetPageReserved().
+	 */
+	return min(PHYS_PFN(memblock_start_of_DRAM()),
+		   PHYS_PFN(memblock.reserved.regions[0].base));
 }
 
 /*
Mike Rapoport Dec. 3, 2020, 10:51 a.m. UTC | #47
On Thu, Dec 03, 2020 at 01:23:02AM -0500, Andrea Arcangeli wrote:
> On Wed, Dec 02, 2020 at 07:39:23PM +0200, Mike Rapoport wrote:
> > Hmm, do you have any logs? It worked on my box and with various memory
> > configurations in qemu.
> 
> It crashes in reserve_bootmem_region so no logs at all since the
> serial console isn't active yet.
> 
> > I believe that the proper solution would be to have all the memory
> > reserved by the firmware added to memblock.memory, like all other
> > architectures do, but I'm not sure if we can easily and reliably
> > determine what of E820 reserved types are actually backed by RAM.
> > So this is not feasible as a short term solution.
> > 
> > My patch [1], though, is an immediate improvement and I think it's worth
> > trying to fix off-by-one's that prevent your system from booting.
> > 
> > [1] https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@kernel.org
> 
> Yes that's the patch I applied.
> 
> Relevant points after debugging:
> 
> 1) can you try again with DEBUG_VM=y?

Mea cupla :)

> 2) DEBUG_VM=y already enforces the memset(page, 0xff, sizeof(struct
>    page)) on all struct pages allocated, exactly I was suggesting to
>    do in the previous email. I wonder why we're not doing that even
>    with DEBUG_VM=n, perhaps it's too slow for TiB systems. See
>    page_init_poison().
> 
> 3) given 2), your patch below by deleting "0,0" initialization
>    achieves the debug feature to keep PagePoison forever on all
>    uninitialized pages, imagine PagePoison is really
>    NO_NODEID/NO_ZONEID and doesn't need handling other than a crash.
>   
> -		__init_single_page(pfn_to_page(pfn), pfn, 0, 0);
> 
> 4) because of the feature in 3) I now hit the PagePoison VM_BUG_ON
>    because pfn 0 wasn't initialized when reserve_bootmem_region tries
>    to call __SetPageReserved on a PagePoison page.

So this is the off-by-one a was talking about. My patch removed
initializaion of the hole before the memblock.memory

> 5) pfn 0 is the classical case where pfn 0 is in a reserved zone in
>    memblock.reserve that doesn't overlap any memblock.memory zone.

And, IMHO, this is the fundamental BUG.

There is RAM at address 0, there is a DIMM there, so why on earth this
should not be a part of memblock.memory?

>    The memblock_start_of_DRAM moves the start of the DMA zone above
>    the pfn 0, so then pfn 0 already ends up in the no zone land David
>    mentioned where it's not trivial to decide to give it a zoneid if
>    it's not even spanned in the zone.
> 
>    Not just the zoneid, there's not even a nid.
> 
>    So we have a pfn with no zoneid, and no nid, and not part of the
>    zone ranges, not part of the nid ranges not part of the
>    memory.memblock.

We have a pfn that should have been in memblock.memory, in ZONE_DMA and
in the first node with memory. If it was not trimmed from there 

>    We can't really skip the initialization of the the pfn 0, it has to
>    get the zoneid 0 treatment or if pfn 1 ends up in compaction code,
>    we'd crash again. (although we'd crash with a nice PagePoison(page)
>    == true behavior)

Agree. Struct page for pfn should get the same zoneid and nodeid as pfn 1.

For me the below fixup worked (both with and without DEBUG_VM=y).

From 84a1c2531374706f3592a638523278aa29aaa448 Mon Sep 17 00:00:00 2001
From: Mike Rapoport <rppt@linux.ibm.com>
Date: Thu, 3 Dec 2020 11:40:17 +0200
Subject: [PATCH] fixup for "mm: refactor initialization of stuct page for holes"

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 mm/page_alloc.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ce2bdaabdf96..86fde4424e87 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6227,7 +6227,8 @@ void __init __weak memmap_init(unsigned long size, int nid,
 			       unsigned long zone,
 			       unsigned long range_start_pfn)
 {
-	unsigned long start_pfn, end_pfn, next_pfn = 0;
+	static unsigned long hole_start_pfn;
+	unsigned long start_pfn, end_pfn;
 	unsigned long range_end_pfn = range_start_pfn + size;
 	u64 pgcnt = 0;
 	int i;
@@ -6235,7 +6236,6 @@ void __init __weak memmap_init(unsigned long size, int nid,
 	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
 		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
 		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
-		next_pfn = clamp(next_pfn, range_start_pfn, range_end_pfn);
 
 		if (end_pfn > start_pfn) {
 			size = end_pfn - start_pfn;
@@ -6243,10 +6243,10 @@ void __init __weak memmap_init(unsigned long size, int nid,
 					 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
 		}
 
-		if (next_pfn < start_pfn)
-			pgcnt += init_unavailable_range(next_pfn, start_pfn,
-							zone, nid);
-		next_pfn = end_pfn;
+		if (hole_start_pfn < start_pfn)
+			pgcnt += init_unavailable_range(hole_start_pfn,
+							start_pfn, zone, nid);
+		hole_start_pfn = end_pfn;
 	}
 
 	/*
@@ -6256,8 +6256,8 @@ void __init __weak memmap_init(unsigned long size, int nid,
 	 * considered initialized. Make sure that memmap has a well defined
 	 * state.
 	 */
-	if (next_pfn < range_end_pfn)
-		pgcnt += init_unavailable_range(next_pfn, range_end_pfn,
+	if (hole_start_pfn < range_end_pfn)
+		pgcnt += init_unavailable_range(hole_start_pfn, range_end_pfn,
 						zone, nid);
 
 	if (pgcnt)
Andrea Arcangeli Dec. 3, 2020, 5:31 p.m. UTC | #48
On Thu, Dec 03, 2020 at 12:51:07PM +0200, Mike Rapoport wrote:
> On Thu, Dec 03, 2020 at 01:23:02AM -0500, Andrea Arcangeli wrote:
> > 5) pfn 0 is the classical case where pfn 0 is in a reserved zone in
> >    memblock.reserve that doesn't overlap any memblock.memory zone.
> 
> And, IMHO, this is the fundamental BUG.

We have a dozen arches that change all the time, new efi/e820 stuff,
always new bios configs, the memblock code must cope and be able to
cope with the caller being wrong or changing behavior if the e820 map
changes behaviour.

Trying to work around memblock core deficiencies in the caller is what
124049decbb121ec32742c94fb5d9d6bed8f24d8 already did, and it doesn't
work and it's not even clean thing to do.

In fact here there's really nothing to blame the caller for (i.e. the
e820 init code), unless you declare that there must be always overlap
(which I understood would break stuff for regions that must not have a
direct mapping).

The caller here is correct and it can be summarized as below:

    if (e820 map type != system_ram)
       memblock_reserve(range)
    else
       memblock_add(range)

Then:

   zone_boundaries = [ 16M, 4G, 0 ]
   free_area_init(zone_boundaries)

It's not the caller fault if the e820 map in the bios starts with:

pfn 0-1 reserved
pfn 1-N system RAM

This is better fixed in a way that must not require any change in the
caller...

> There is RAM at address 0, there is a DIMM there, so why on earth this
> should not be a part of memblock.memory?

How can you blame the caller if you explicitly didn't required that
.reserved ranges must always overlap .memory ranges? In fact it was
explained as a feature to avoid the direct mapping.

Declaring that there must be always overlap would be one way to cope
with this issue, but even then memblock must detect a wrong caller,
workaround it by doing a memblock_add call inside the
memblock_reserved and by printing a warning to improve the caller. It
shouldn't crash at boot with console off if the overlap is not there.

In my view the caller here is not to blame, all these issues are
because memblock needs improvement and must cope with any caller.

> >    The memblock_start_of_DRAM moves the start of the DMA zone above
> >    the pfn 0, so then pfn 0 already ends up in the no zone land David
> >    mentioned where it's not trivial to decide to give it a zoneid if
> >    it's not even spanned in the zone.
> > 
> >    Not just the zoneid, there's not even a nid.
> > 
> >    So we have a pfn with no zoneid, and no nid, and not part of the
> >    zone ranges, not part of the nid ranges not part of the
> >    memory.memblock.
> 
> We have a pfn that should have been in memblock.memory, in ZONE_DMA and
> in the first node with memory. If it was not trimmed from there 

My incremental patch already solves how to extend the zones spans and
the nid spans by taking memblock.reserved into account, not just
memblock.memory, but still without actually messing with the direct
mapping, otherwise I would have called memblock_add instead of walking
memblock.reserved when detecting the nid and zoneid ranges.

> 
> >    We can't really skip the initialization of the the pfn 0, it has to
> >    get the zoneid 0 treatment or if pfn 1 ends up in compaction code,
> >    we'd crash again. (although we'd crash with a nice PagePoison(page)
> >    == true behavior)
> 
> Agree. Struct page for pfn should get the same zoneid and nodeid as pfn 1.

How are you sure there's no a zone ID 0 that starts at pfn 0 and ends
at pfn 1 and the zone dma starts at pfn 1? In such case the pfn 0
would have a zoneid different from pfn 1.

I'm not exactly sure why we should initialize a pfn 0 with a zoneid of
a zone whose pfn 0 is not part of, that looks wrong.

I'm not exactly sure why you don't fix the zone->zone_start_pfn and
the respective nid node_start_pfn to start from pfn 0 instead, just
like I did in my patch.

> From 84a1c2531374706f3592a638523278aa29aaa448 Mon Sep 17 00:00:00 2001
> From: Mike Rapoport <rppt@linux.ibm.com>
> Date: Thu, 3 Dec 2020 11:40:17 +0200
> Subject: [PATCH] fixup for "mm: refactor initialization of stuct page for holes"
> 
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>  mm/page_alloc.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ce2bdaabdf96..86fde4424e87 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6227,7 +6227,8 @@ void __init __weak memmap_init(unsigned long size, int nid,
>  			       unsigned long zone,
>  			       unsigned long range_start_pfn)
>  {
> -	unsigned long start_pfn, end_pfn, next_pfn = 0;
> +	static unsigned long hole_start_pfn;
> +	unsigned long start_pfn, end_pfn;
>  	unsigned long range_end_pfn = range_start_pfn + size;
>  	u64 pgcnt = 0;
>  	int i;
> @@ -6235,7 +6236,6 @@ void __init __weak memmap_init(unsigned long size, int nid,
>  	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
>  		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
>  		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> -		next_pfn = clamp(next_pfn, range_start_pfn, range_end_pfn);
>  
>  		if (end_pfn > start_pfn) {
>  			size = end_pfn - start_pfn;
> @@ -6243,10 +6243,10 @@ void __init __weak memmap_init(unsigned long size, int nid,
>  					 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
>  		}
>  
> -		if (next_pfn < start_pfn)
> -			pgcnt += init_unavailable_range(next_pfn, start_pfn,
> -							zone, nid);
> -		next_pfn = end_pfn;
> +		if (hole_start_pfn < start_pfn)
> +			pgcnt += init_unavailable_range(hole_start_pfn,
> +							start_pfn, zone, nid);
> +		hole_start_pfn = end_pfn;
>  	}
>  
>  	/*
> @@ -6256,8 +6256,8 @@ void __init __weak memmap_init(unsigned long size, int nid,
>  	 * considered initialized. Make sure that memmap has a well defined
>  	 * state.
>  	 */
> -	if (next_pfn < range_end_pfn)
> -		pgcnt += init_unavailable_range(next_pfn, range_end_pfn,
> +	if (hole_start_pfn < range_end_pfn)
> +		pgcnt += init_unavailable_range(hole_start_pfn, range_end_pfn,
>  						zone, nid);
>  

1) this looks very specific for pfn 0

2) it's not optimal to call init_unavailable_range(0,) the first time
   at every zone initialization, that's a pretty inefficient loop for
   large RAM systems.

3) you give pfn 0 a zoneid despite it's not part of the zone, that
   looks wrong from a zoneid initialization point of view.


The bug here is that the DMA zone starts in pfn 1 which doesn't make
any technical sense since the start of the pageblock would then be
beyond the start_zone_pfn.

I don't really see how you can fix this, without ending with an even
bigger undefined special case to keep in mind where a pfn gets
initialized beyond the zone.

If something the real good alternative to changing the zone_start_pfn
to start from 0 and not from 1, is to keep the pfn 0 PagePoison and to
never call __setPageReserved on it in reserve_bootmem_region. That is
also a very clean solution where everything remains in check. The
poison is effectively a NO_NODEID NO_NID because that pfn 0 really is
not spanned in any nid or any zone. Although we need to ensure
compaction won't ever try to access it by means of checking
zone_start_pfn that it can't rewind to the start of the
pageblock. It'll cause an inefficiency but that also looks cleaner
than giving a random zoneid beyond the zone span, to pfn 0.

Thanks,
Andrea
Andrea Arcangeli Dec. 6, 2020, 2:26 a.m. UTC | #49
Hi Mel,

On Thu, Nov 26, 2020 at 10:47:20AM +0000, Mel Gorman wrote:
> Agreed. This thread has a lot of different directions in it at this
> point so what I'd hope for is first, a patch that initialises holes with
> zone/node linkages within a 1<<(MAX_ORDER-1) alignment. If there is a
> hole, it would be expected the pages are PageReserved. Second, a fix to
> fast_isolate that forces PFNs returned to always be within the stated
> zone boundaries.

The last two patches should resolve the struct page
initialization
https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/ and the
VM_BUG_ON_PAGE never happened again as expected.

So I looked back to see how the "distance" logic is accurate. I added
those checks and I get negative hits:

diff --git a/mm/compaction.c b/mm/compaction.c
index cc1a7f600a86..844a90b0acdf 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1331,6 +1331,12 @@ fast_isolate_freepages(struct compact_control *cc)
 	low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2));
 	min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1));
 
+	WARN_ON_ONCE((long) low_pfn < 0);
+	WARN_ON_ONCE((long) min_pfn < 0);
+	if ((long) low_pfn < 0)
+		return cc->free_pfn;
+	if ((long) min_pfn < 0)
+		return cc->free_pfn;
 	if (WARN_ON_ONCE(min_pfn > low_pfn))
 		low_pfn = min_pfn;

Both warn-on-once triggers, so it goes negative. While it appears not
kernel crashing since pfn_valid won't succeed on negative entries and
they'll always be higher than any pfn in the freelist, is this sign
that there's further room for improvement here?

Thanks,
Andrea
Mike Rapoport Dec. 6, 2020, 8:09 a.m. UTC | #50
On Thu, Dec 03, 2020 at 12:31:44PM -0500, Andrea Arcangeli wrote:
> On Thu, Dec 03, 2020 at 12:51:07PM +0200, Mike Rapoport wrote:
> > On Thu, Dec 03, 2020 at 01:23:02AM -0500, Andrea Arcangeli wrote:
> > > 5) pfn 0 is the classical case where pfn 0 is in a reserved zone in
> > >    memblock.reserve that doesn't overlap any memblock.memory zone.
> > 
> > And, IMHO, this is the fundamental BUG.
> 
> We have a dozen arches that change all the time,

We have ~20 arches that work just fine and only x86 runs into troubles
from time to time ;-)

> new efi/e820 stuff, always new bios configs, the memblock code must
> cope and be able to cope with the caller being wrong or changing
> behavior if the e820 map changes behaviour.

> Trying to work around memblock core deficiencies in the caller is what
> 124049decbb121ec32742c94fb5d9d6bed8f24d8 already did, and it doesn't
> work and it's not even clean thing to do.

It's not exactly memblock core deficiencies, it's rather the whole path
from memory detection to the point where we have memmap and page
allocator functional.

I agree that the initialization of memmap and page allocator should be
improved. And 73a6e474cb376921a311786652782155eac2fdf0 was a part of
large series that largely reduced the complexity of
free_area_init() and friends. Particularly this commit removed a couple
of cycles from memmap init.

And when you touch such a sensitive and fragile area mistakes can
happen. So 73a6e474cb376921a311786652782155eac2fdf0 missed a corner
case, but I don't understand why you are so annoyed by "memblock
core"...

> In fact here there's really nothing to blame the caller for (i.e. the
> e820 init code), unless you declare that there must be always overlap
> (which I understood would break stuff for regions that must not have a
> direct mapping).
> 
> The caller here is correct and it can be summarized as below:
> 
>     if (e820 map type != system_ram)
>        memblock_reserve(range)
>     else
>        memblock_add(range)

I don't agree that memblock_add(range) should be in else. PFN 0 is
*memory* so why don't we treat it as such?

I have a preliminary patch that actually does this and updates x86 parts
of x86 memory initialization to keep it backward compatible.

> Then:
> 
>    zone_boundaries = [ 16M, 4G, 0 ]
>    free_area_init(zone_boundaries)
> 
> It's not the caller fault if the e820 map in the bios starts with:
> 
> pfn 0-1 reserved
> pfn 1-N system RAM
 
So in your example, is pfn 0 a part of memory or not? It there a
physical memory at address 0?

I don't think that because BIOS uses this memory it stops being memory.

> This is better fixed in a way that must not require any change in the
> caller...

The generic code needs to be more robust, but I'm not sure that the
complexity around holes initialization vs. relatively simple update to
the caller is the best way to solve it.

> > There is RAM at address 0, there is a DIMM there, so why on earth this
> > should not be a part of memblock.memory?
> 
> How can you blame the caller if you explicitly didn't required that
> .reserved ranges must always overlap .memory ranges? In fact it was
> explained as a feature to avoid the direct mapping.
> 
> Declaring that there must be always overlap would be one way to cope
> with this issue, but even then memblock must detect a wrong caller,
> workaround it by doing a memblock_add call inside the
> memblock_reserved and by printing a warning to improve the caller. It
> shouldn't crash at boot with console off if the overlap is not there.

If memblock would have done this and would had an implicit
memblock_add() for the reserved areas we would hit the same issue with
regions that any access to them crashes the system.

> In my view the caller here is not to blame, all these issues are
> because memblock needs improvement and must cope with any caller.

The caller, which is low level architecture code, has the best knowledge
of the memory layout. It's the caller who translates that memory
layout to the generic abstraction. If that translation lacks details,
the generic code is limited in its ability to workaround those missing
details.

In the case of e820 it seems that the existing abstraction of 'memory'
and 'reserved' is not sufficient and we need more details to describe
the physical memory. For instance 'memory', 'used by BIOS', 'reserved by
kernel'.

Still, this requires consensus between arch and mm about what is what.

> > >    The memblock_start_of_DRAM moves the start of the DMA zone above
> > >    the pfn 0, so then pfn 0 already ends up in the no zone land David
> > >    mentioned where it's not trivial to decide to give it a zoneid if
> > >    it's not even spanned in the zone.
> > > 
> > >    Not just the zoneid, there's not even a nid.
> > > 
> > >    So we have a pfn with no zoneid, and no nid, and not part of the
> > >    zone ranges, not part of the nid ranges not part of the
> > >    memory.memblock.
> > 
> > We have a pfn that should have been in memblock.memory, in ZONE_DMA and
> > in the first node with memory. If it was not trimmed from there 
> 
> My incremental patch already solves how to extend the zones spans and
> the nid spans by taking memblock.reserved into account, not just
> memblock.memory, but still without actually messing with the direct
> mapping, otherwise I would have called memblock_add instead of walking
> memblock.reserved when detecting the nid and zoneid ranges.

I've commented on the patch at other thread.

> > 
> > >    We can't really skip the initialization of the the pfn 0, it has to
> > >    get the zoneid 0 treatment or if pfn 1 ends up in compaction code,
> > >    we'd crash again. (although we'd crash with a nice PagePoison(page)
> > >    == true behavior)
> > 
> > Agree. Struct page for pfn should get the same zoneid and nodeid as pfn 1.
> 
> How are you sure there's no a zone ID 0 that starts at pfn 0 and ends
> at pfn 1 and the zone dma starts at pfn 1? In such case the pfn 0
> would have a zoneid different from pfn 1.
>
> I'm not exactly sure why we should initialize a pfn 0 with a zoneid of
> a zone whose pfn 0 is not part of, that looks wrong.
> 
> I'm not exactly sure why you don't fix the zone->zone_start_pfn and
> the respective nid node_start_pfn to start from pfn 0 instead, just
> like I did in my patch.

I think the correct fix for this is to have pfn 0 in memblock.memory,
Meanwhile, having this patch on top of "mm: refactor initialization of
stuct page for holes in memory layout" is the least of all evils, see
also comments below.

> > From 84a1c2531374706f3592a638523278aa29aaa448 Mon Sep 17 00:00:00 2001
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > Date: Thu, 3 Dec 2020 11:40:17 +0200
> > Subject: [PATCH] fixup for "mm: refactor initialization of stuct page for holes"
> > 
> > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > ---
> >  mm/page_alloc.c | 16 ++++++++--------
> >  1 file changed, 8 insertions(+), 8 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index ce2bdaabdf96..86fde4424e87 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6227,7 +6227,8 @@ void __init __weak memmap_init(unsigned long size, int nid,
> >  			       unsigned long zone,
> >  			       unsigned long range_start_pfn)
> >  						zone, nid);

...
 
> 1) this looks very specific for pfn 0

Nope, this fix takes care of pfns "before start of RAM" in general.
To the best of my knowledge, this only happens to be pfn 0 on x86.

> 2) it's not optimal to call init_unavailable_range(0,) the first time
>    at every zone initialization, that's a pretty inefficient loop for
>    large RAM systems.

This is not called every time because hole_start_pfn is static.

> 3) you give pfn 0 a zoneid despite it's not part of the zone, that
>    looks wrong from a zoneid initialization point of view.
>
> The bug here is that the DMA zone starts in pfn 1 which doesn't make
> any technical sense since the start of the pageblock would then be
> beyond the start_zone_pfn.

Right, the bug is that DMA zone starts in pfn 1 because pfn 0 is not in
.memory and thus bypasses all the zone and node sizing calculations.

As architectures only specify max_zone_pfns when they call
free_area_init() I believe it is safe to presume that every pfn that is
smaller than max_zone_pfns[0] belongs to zone 0.

And this fix alone with interleaving of init_unavailable_mem() inside
memmap_init() does exactly that. For pages magically appearing at the
range [0, memblock_start_of_DRAM()] I set zone link in struct page to
zone[0].

> Thanks,
> Andrea
>
Mel Gorman Dec. 6, 2020, 11:47 p.m. UTC | #51
On Sat, Dec 05, 2020 at 09:26:47PM -0500, Andrea Arcangeli wrote:
> Hi Mel,
> 
> On Thu, Nov 26, 2020 at 10:47:20AM +0000, Mel Gorman wrote:
> > Agreed. This thread has a lot of different directions in it at this
> > point so what I'd hope for is first, a patch that initialises holes with
> > zone/node linkages within a 1<<(MAX_ORDER-1) alignment. If there is a
> > hole, it would be expected the pages are PageReserved. Second, a fix to
> > fast_isolate that forces PFNs returned to always be within the stated
> > zone boundaries.
> 
> The last two patches should resolve the struct page
> initialization
> https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/ and the
> VM_BUG_ON_PAGE never happened again as expected.
> 
> So I looked back to see how the "distance" logic is accurate. I added
> those checks and I get negative hits:
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index cc1a7f600a86..844a90b0acdf 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1331,6 +1331,12 @@ fast_isolate_freepages(struct compact_control *cc)
>  	low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2));
>  	min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1));
>  
> +	WARN_ON_ONCE((long) low_pfn < 0);
> +	WARN_ON_ONCE((long) min_pfn < 0);
> +	if ((long) low_pfn < 0)
> +		return cc->free_pfn;
> +	if ((long) min_pfn < 0)
> +		return cc->free_pfn;
>  	if (WARN_ON_ONCE(min_pfn > low_pfn))
>  		low_pfn = min_pfn;
> 
> Both warn-on-once triggers, so it goes negative. While it appears not
> kernel crashing since pfn_valid won't succeed on negative entries and
> they'll always be higher than any pfn in the freelist, is this sign
> that there's further room for improvement here?
> 

Possibly, checking the wrong pfns is simply risky. This is not tested
at all, just checking if it's in the right ballpark even. Intent is that
when the free/migrate PFNs are too close or already overlapping that no
attempt is made and it returns back to detect the scanners have met.

diff --git a/mm/compaction.c b/mm/compaction.c
index 13cb7a961b31..208cb5857446 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1313,6 +1313,10 @@ fast_isolate_freepages(struct compact_control *cc)
 	if (cc->order <= 0)
 		return cc->free_pfn;
 
+	/* Ensure that migration and free scanner are not about to cross */
+	if (cc->migrate_pfn < cc->free_pfn)
+		return cc->free_pfn;
+
 	/*
 	 * If starting the scan, use a deeper search and use the highest
 	 * PFN found if a suitable one is not found.
@@ -1324,9 +1328,12 @@ fast_isolate_freepages(struct compact_control *cc)
 
 	/*
 	 * Preferred point is in the top quarter of the scan space but take
-	 * a pfn from the top half if the search is problematic.
+	 * a pfn from the top half if the search is problematic. Ensure
+	 * there is enough distance to make the fast search worthwhile.
 	 */
 	distance = (cc->free_pfn - cc->migrate_pfn);
+	if (distance <= (pageblock_nr_pages << 2))
+		return cc->free_pfn;
 	low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2));
 	min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1));
diff mbox series

Patch

diff --git a/mm/compaction.c b/mm/compaction.c
index 13cb7a961b31..d17e69549d34 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1433,7 +1433,10 @@  fast_isolate_freepages(struct compact_control *cc)
 					page = pageblock_pfn_to_page(min_pfn,
 						pageblock_end_pfn(min_pfn),
 						cc->zone);
-					cc->free_pfn = min_pfn;
+					if (likely(!PageReserved(page)))
+						cc->free_pfn = min_pfn;
+					else
+						page = NULL;
 				}
 			}
 		}