mbox series

[v2,0/4] arm64: drop pfn_valid_within() and simplify pfn_valid()

Message ID 20210421065108.1987-1-rppt@kernel.org (mailing list archive)
Headers show
Series arm64: drop pfn_valid_within() and simplify pfn_valid() | expand

Message

Mike Rapoport April 21, 2021, 6:51 a.m. UTC
From: Mike Rapoport <rppt@linux.ibm.com>

Hi,

These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire
pfn_valid_within() to 1. 

The idea is to mark NOMAP pages as reserved in the memory map and restore
the intended semantics of pfn_valid() to designate availability of struct
page for a pfn.

With this the core mm will be able to cope with the fact that it cannot use
NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER blocks
will be treated correctly even without the need for pfn_valid_within.

The patches are only boot tested on qemu-system-aarch64 so I'd really
appreciate memory stress tests on real hardware.

If this actually works we'll be one step closer to drop custom pfn_valid()
on arm64 altogether.

v2:
* Add check for PFN overflow in pfn_is_map_memory()
* Add Acked-by and Reviewed-by tags, thanks David.

v1: Link: https://lore.kernel.org/lkml/20210420090925.7457-1-rppt@kernel.org
* Add comment about the semantics of pfn_valid() as Anshuman suggested
* Extend comments about MEMBLOCK_NOMAP, per Anshuman
* Use pfn_is_map_memory() name for the exported wrapper for
  memblock_is_map_memory(). It is still local to arch/arm64 in the end
  because of header dependency issues.

rfc: Link: https://lore.kernel.org/lkml/20210407172607.8812-1-rppt@kernel.org

Mike Rapoport (4):
  include/linux/mmzone.h: add documentation for pfn_valid()
  memblock: update initialization of reserved pages
  arm64: decouple check whether pfn is in linear map from pfn_valid()
  arm64: drop pfn_valid_within() and simplify pfn_valid()

 arch/arm64/Kconfig              |  3 ---
 arch/arm64/include/asm/memory.h |  2 +-
 arch/arm64/include/asm/page.h   |  1 +
 arch/arm64/kvm/mmu.c            |  2 +-
 arch/arm64/mm/init.c            | 10 ++++++++--
 arch/arm64/mm/ioremap.c         |  4 ++--
 arch/arm64/mm/mmu.c             |  2 +-
 include/linux/memblock.h        |  4 +++-
 include/linux/mmzone.h          | 11 +++++++++++
 mm/memblock.c                   | 28 ++++++++++++++++++++++++++--
 10 files changed, 54 insertions(+), 13 deletions(-)

base-commit: e49d033bddf5b565044e2abe4241353959bc9120

Comments

Kefeng Wang April 22, 2021, 7 a.m. UTC | #1
On 2021/4/21 14:51, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
>
> Hi,
>
> These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire
> pfn_valid_within() to 1.
>
> The idea is to mark NOMAP pages as reserved in the memory map and restore
> the intended semantics of pfn_valid() to designate availability of struct
> page for a pfn.
>
> With this the core mm will be able to cope with the fact that it cannot use
> NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER blocks
> will be treated correctly even without the need for pfn_valid_within.
>
> The patches are only boot tested on qemu-system-aarch64 so I'd really
> appreciate memory stress tests on real hardware.
>
> If this actually works we'll be one step closer to drop custom pfn_valid()
> on arm64 altogether.

Hi Mike,I have a question, without HOLES_IN_ZONE, the pfn_valid_within() 
in move_freepages_block()->move_freepages()
will be optimized, if there are holes in zone, the 'struce page'(memory 
map) for pfn range of hole will be free by
free_memmap(), and then the page traverse in the zone(with holes) from 
move_freepages() will meet the wrong page,
then it could panic at PageLRU(page) test, check link[1],

"The idea is to mark NOMAP pages as reserved in the memory map", I see 
the patch2 check memblock_is_nomap() in memory region
of memblock, but it seems that memblock_mark_nomap() is not called(maybe 
I missed), then memmap_init_reserved_pages() won't
work, so should the HOLES_IN_ZONE still be needed for generic mm code?

[1] 
https://lore.kernel.org/linux-arm-kernel/541193a6-2bce-f042-5bb2-88913d5f1047@arm.com/
Mike Rapoport April 22, 2021, 7:29 a.m. UTC | #2
On Thu, Apr 22, 2021 at 03:00:20PM +0800, Kefeng Wang wrote:
> 
> On 2021/4/21 14:51, Mike Rapoport wrote:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > Hi,
> > 
> > These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire
> > pfn_valid_within() to 1.
> > 
> > The idea is to mark NOMAP pages as reserved in the memory map and restore
> > the intended semantics of pfn_valid() to designate availability of struct
> > page for a pfn.
> > 
> > With this the core mm will be able to cope with the fact that it cannot use
> > NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER blocks
> > will be treated correctly even without the need for pfn_valid_within.
> > 
> > The patches are only boot tested on qemu-system-aarch64 so I'd really
> > appreciate memory stress tests on real hardware.
> > 
> > If this actually works we'll be one step closer to drop custom pfn_valid()
> > on arm64 altogether.
> 
> Hi Mike,I have a question, without HOLES_IN_ZONE, the pfn_valid_within() in
> move_freepages_block()->move_freepages()
> will be optimized, if there are holes in zone, the 'struce page'(memory map)
> for pfn range of hole will be free by
> free_memmap(), and then the page traverse in the zone(with holes) from
> move_freepages() will meet the wrong page,
> then it could panic at PageLRU(page) test, check link[1],

First, HOLES_IN_ZONE name us hugely misleading, this configuration option
has nothing to to with memory holes, but rather it is there to deal with
holes or undefined struct pages in the memory map, when these holes can be
inside a MAX_ORDER_NR_PAGES region.

In general pfn walkers use pfn_valid() and pfn_valid_within() to avoid
accessing *missing* struct pages, like those that are freed at
free_memmap(). But on arm64 these tests also filter out the nomap entries
because their struct pages are not initialized.

The panic you refer to happened because there was an uninitialized struct
page in the middle of MAX_ORDER_NR_PAGES region because it corresponded to
nomap memory.

With these changes I make sure that such pages will be properly initialized
as PageReserved and the pfn walkers will be able to rely on the memory map.

Note also, that free_memmap() aligns the parts being freed on MAX_ORDER
boundaries, so there will be no missing parts in the memory map within a
MAX_ORDER_NR_PAGES region.
 
> "The idea is to mark NOMAP pages as reserved in the memory map", I see the
> patch2 check memblock_is_nomap() in memory region
> of memblock, but it seems that memblock_mark_nomap() is not called(maybe I
> missed), then memmap_init_reserved_pages() won't
> work, so should the HOLES_IN_ZONE still be needed for generic mm code?
> 
> [1] https://lore.kernel.org/linux-arm-kernel/541193a6-2bce-f042-5bb2-88913d5f1047@arm.com/
>
Kefeng Wang April 22, 2021, 3:28 p.m. UTC | #3
On 2021/4/22 15:29, Mike Rapoport wrote:
> On Thu, Apr 22, 2021 at 03:00:20PM +0800, Kefeng Wang wrote:
>> On 2021/4/21 14:51, Mike Rapoport wrote:
>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>
>>> Hi,
>>>
>>> These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire
>>> pfn_valid_within() to 1.
>>>
>>> The idea is to mark NOMAP pages as reserved in the memory map and restore
>>> the intended semantics of pfn_valid() to designate availability of struct
>>> page for a pfn.
>>>
>>> With this the core mm will be able to cope with the fact that it cannot use
>>> NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER blocks
>>> will be treated correctly even without the need for pfn_valid_within.
>>>
>>> The patches are only boot tested on qemu-system-aarch64 so I'd really
>>> appreciate memory stress tests on real hardware.
>>>
>>> If this actually works we'll be one step closer to drop custom pfn_valid()
>>> on arm64 altogether.
>> Hi Mike,I have a question, without HOLES_IN_ZONE, the pfn_valid_within() in
>> move_freepages_block()->move_freepages()
>> will be optimized, if there are holes in zone, the 'struce page'(memory map)
>> for pfn range of hole will be free by
>> free_memmap(), and then the page traverse in the zone(with holes) from
>> move_freepages() will meet the wrong page,
>> then it could panic at PageLRU(page) test, check link[1],
> First, HOLES_IN_ZONE name us hugely misleading, this configuration option
> has nothing to to with memory holes, but rather it is there to deal with
> holes or undefined struct pages in the memory map, when these holes can be
> inside a MAX_ORDER_NR_PAGES region.
>
> In general pfn walkers use pfn_valid() and pfn_valid_within() to avoid
> accessing *missing* struct pages, like those that are freed at
> free_memmap(). But on arm64 these tests also filter out the nomap entries
> because their struct pages are not initialized.
>
> The panic you refer to happened because there was an uninitialized struct
> page in the middle of MAX_ORDER_NR_PAGES region because it corresponded to
> nomap memory.
>
> With these changes I make sure that such pages will be properly initialized
> as PageReserved and the pfn walkers will be able to rely on the memory map.
>
> Note also, that free_memmap() aligns the parts being freed on MAX_ORDER
> boundaries, so there will be no missing parts in the memory map within a
> MAX_ORDER_NR_PAGES region.

Ok, thanks, we met a same panic like the link on arm32(without 
HOLES_IN_ZONE),

the scheme for arm64 could be suit for arm32, right?  I will try the 
patchset with

some changes on arm32 and give some feedback.

Again, the stupid question, where will mark the region of memblock with

MEMBLOCK_NOMAP flag ?


>   
>> "The idea is to mark NOMAP pages as reserved in the memory map", I see the
>> patch2 check memblock_is_nomap() in memory region
>> of memblock, but it seems that memblock_mark_nomap() is not called(maybe I
>> missed), then memmap_init_reserved_pages() won't
>> work, so should the HOLES_IN_ZONE still be needed for generic mm code?
>>
>> [1] https://lore.kernel.org/linux-arm-kernel/541193a6-2bce-f042-5bb2-88913d5f1047@arm.com/
>>
Kefeng Wang April 23, 2021, 8:11 a.m. UTC | #4
On 2021/4/22 23:28, Kefeng Wang wrote:
>
> On 2021/4/22 15:29, Mike Rapoport wrote:
>> On Thu, Apr 22, 2021 at 03:00:20PM +0800, Kefeng Wang wrote:
>>> On 2021/4/21 14:51, Mike Rapoport wrote:
>>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>>
>>>> Hi,
>>>>
>>>> These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially 
>>>> hardwire
>>>> pfn_valid_within() to 1.
>>>>
>>>> The idea is to mark NOMAP pages as reserved in the memory map and 
>>>> restore
>>>> the intended semantics of pfn_valid() to designate availability of 
>>>> struct
>>>> page for a pfn.
>>>>
>>>> With this the core mm will be able to cope with the fact that it 
>>>> cannot use
>>>> NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER 
>>>> blocks
>>>> will be treated correctly even without the need for pfn_valid_within.
>>>>
>>>> The patches are only boot tested on qemu-system-aarch64 so I'd really
>>>> appreciate memory stress tests on real hardware.
>>>>
>>>> If this actually works we'll be one step closer to drop custom 
>>>> pfn_valid()
>>>> on arm64 altogether.
...
>
> Ok, thanks, we met a same panic like the link on arm32(without 
> HOLES_IN_ZONE),
>
> the scheme for arm64 could be suit for arm32, right?  I will try the 
> patchset with
>
> some changes on arm32 and give some feedback.

I tested this patchset(plus arm32 change, like arm64 does) based on lts 
5.10,add

some debug log, the useful info shows below, if we enable HOLES_IN_ZONE, 
no panic,

any idea, thanks.

Zone ranges:
   Normal   [mem 0x0000000080a00000-0x00000000b01fffff]
   HighMem  [mem 0x00000000b0200000-0x00000000ffffefff]
Movable zone start for each node
Early memory node ranges
   node   0: [mem 0x0000000080a00000-0x00000000855fffff]
   node   0: [mem 0x0000000086a00000-0x0000000087dfffff]
   node   0: [mem 0x000000008bd00000-0x000000008c4fffff]
   node   0: [mem 0x000000008e300000-0x000000008ecfffff]
   node   0: [mem 0x0000000090d00000-0x00000000bfffffff]
   node   0: [mem 0x00000000cc000000-0x00000000dc9fffff]
   node   0: [mem 0x00000000de700000-0x00000000de9fffff]
   node   0: [mem 0x00000000e0800000-0x00000000e0bfffff]
   node   0: [mem 0x00000000f4b00000-0x00000000f6ffffff]
   node   0: [mem 0x00000000fda00000-0x00000000ffffefff]

----> free_memmap, start_pfn = 85800,  85800000 end_pfn = 86a00, 86a00000
----> free_memmap, start_pfn = 8c800,  8c800000 end_pfn = 8e300, 8e300000
----> free_memmap, start_pfn = 8f000,  8f000000 end_pfn = 90000, 90000000
----> free_memmap, start_pfn = dcc00,  dcc00000 end_pfn = de700, de700000
----> free_memmap, start_pfn = dec00,  dec00000 end_pfn = e0000, e0000000
----> free_memmap, start_pfn = e0c00,  e0c00000 end_pfn = e4000, e4000000
----> free_memmap, start_pfn = f7000,  f7000000 end_pfn = f8000, f8000000
=== >move_freepages: start_pfn/end_pfn [de600, de7ff], [de600000, 
de7ff000] :  pfn =de600 pfn2phy = de600000 , page = ef3cc000, page-flags 
= ffffffff
8<--- cut here ---
Unable to handle kernel paging request at virtual address fffffffe
pgd = 5dd50df5
[fffffffe] *pgd=affff861, *pte=00000000, *ppte=00000000
Internal error: Oops: 37 [#1] SMP ARM
Modules linked in: gmac(O)
CPU: 2 PID: 635 Comm: test-oom Tainted: G           O      5.10.0+ #31
Hardware name: Hisilicon A9
PC is at move_freepages_block+0x150/0x278
LR is at move_freepages_block+0x150/0x278
pc : [<c02383a4>]    lr : [<c02383a4>]    psr: 200e0393
sp : c4179cf8  ip : 00000000  fp : 00000001
r10: c4179d58  r9 : 000de7ff  r8 : 00000000
r7 : c0863280  r6 : 000de600  r5 : 000de600  r4 : ef3cc000
r3 : ffffffff  r2 : 00000000  r1 : ef5d069c  r0 : fffffffe
Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
Control: 1ac5387d  Table: 83b0c04a  DAC: 55555555
Process test-oom (pid: 635, stack limit = 0x25d667df)
Mike Rapoport April 25, 2021, 6:59 a.m. UTC | #5
On Thu, Apr 22, 2021 at 11:28:24PM +0800, Kefeng Wang wrote:
> 
> On 2021/4/22 15:29, Mike Rapoport wrote:
> > On Thu, Apr 22, 2021 at 03:00:20PM +0800, Kefeng Wang wrote:
> > > On 2021/4/21 14:51, Mike Rapoport wrote:
> > > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > > 
> > > > Hi,
> > > > 
> > > > These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire
> > > > pfn_valid_within() to 1.
> > > > 
> > > > The idea is to mark NOMAP pages as reserved in the memory map and restore
> > > > the intended semantics of pfn_valid() to designate availability of struct
> > > > page for a pfn.
> > > > 
> > > > With this the core mm will be able to cope with the fact that it cannot use
> > > > NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER blocks
> > > > will be treated correctly even without the need for pfn_valid_within.
> > > > 
> > > > The patches are only boot tested on qemu-system-aarch64 so I'd really
> > > > appreciate memory stress tests on real hardware.
> > > > 
> > > > If this actually works we'll be one step closer to drop custom pfn_valid()
> > > > on arm64 altogether.
> > > Hi Mike,I have a question, without HOLES_IN_ZONE, the pfn_valid_within() in
> > > move_freepages_block()->move_freepages()
> > > will be optimized, if there are holes in zone, the 'struce page'(memory map)
> > > for pfn range of hole will be free by
> > > free_memmap(), and then the page traverse in the zone(with holes) from
> > > move_freepages() will meet the wrong page,
> > > then it could panic at PageLRU(page) test, check link[1],
> > First, HOLES_IN_ZONE name us hugely misleading, this configuration option
> > has nothing to to with memory holes, but rather it is there to deal with
> > holes or undefined struct pages in the memory map, when these holes can be
> > inside a MAX_ORDER_NR_PAGES region.
> > 
> > In general pfn walkers use pfn_valid() and pfn_valid_within() to avoid
> > accessing *missing* struct pages, like those that are freed at
> > free_memmap(). But on arm64 these tests also filter out the nomap entries
> > because their struct pages are not initialized.
> > 
> > The panic you refer to happened because there was an uninitialized struct
> > page in the middle of MAX_ORDER_NR_PAGES region because it corresponded to
> > nomap memory.
> > 
> > With these changes I make sure that such pages will be properly initialized
> > as PageReserved and the pfn walkers will be able to rely on the memory map.
> > 
> > Note also, that free_memmap() aligns the parts being freed on MAX_ORDER
> > boundaries, so there will be no missing parts in the memory map within a
> > MAX_ORDER_NR_PAGES region.
> 
> Ok, thanks, we met a same panic like the link on arm32(without
> HOLES_IN_ZONE),
> 
> the scheme for arm64 could be suit for arm32, right?

In general yes. You just need to make sure that usage of pfn_valid() in
arch/arm does not presume that it tests something beyond availability of
struct page for a pfn.
 
> I will try the patchset with some changes on arm32 and give some
> feedback.
> 
> Again, the stupid question, where will mark the region of memblock with
> MEMBLOCK_NOMAP flag ?
 
Not sure I understand the question. The memory regions with "nomap"
property in the device tree will be marked MEMBLOCK_NOMAP.
 
> > > "The idea is to mark NOMAP pages as reserved in the memory map", I see the
> > > patch2 check memblock_is_nomap() in memory region
> > > of memblock, but it seems that memblock_mark_nomap() is not called(maybe I
> > > missed), then memmap_init_reserved_pages() won't
> > > work, so should the HOLES_IN_ZONE still be needed for generic mm code?
> > > 
> > > [1] https://lore.kernel.org/linux-arm-kernel/541193a6-2bce-f042-5bb2-88913d5f1047@arm.com/
> > >
Mike Rapoport April 25, 2021, 7:19 a.m. UTC | #6
On Fri, Apr 23, 2021 at 04:11:16PM +0800, Kefeng Wang wrote:
> 
> I tested this patchset(plus arm32 change, like arm64 does) based on lts
> 5.10,add
> 
> some debug log, the useful info shows below, if we enable HOLES_IN_ZONE, no
> panic,
> 
> any idea, thanks.
 
Are there any changes on top of 5.10 except for pfn_valid() patch?
Do you see this panic on 5.10 without the changes?
Can you see stack backtrace beyond move_freepages_block?

> Zone ranges:
>   Normal   [mem 0x0000000080a00000-0x00000000b01fffff]
>   HighMem  [mem 0x00000000b0200000-0x00000000ffffefff]
> Movable zone start for each node
> Early memory node ranges
>   node   0: [mem 0x0000000080a00000-0x00000000855fffff]
>   node   0: [mem 0x0000000086a00000-0x0000000087dfffff]
>   node   0: [mem 0x000000008bd00000-0x000000008c4fffff]
>   node   0: [mem 0x000000008e300000-0x000000008ecfffff]
>   node   0: [mem 0x0000000090d00000-0x00000000bfffffff]
>   node   0: [mem 0x00000000cc000000-0x00000000dc9fffff]
>   node   0: [mem 0x00000000de700000-0x00000000de9fffff]
>   node   0: [mem 0x00000000e0800000-0x00000000e0bfffff]
>   node   0: [mem 0x00000000f4b00000-0x00000000f6ffffff]
>   node   0: [mem 0x00000000fda00000-0x00000000ffffefff]
> 
> ----> free_memmap, start_pfn = 85800,  85800000 end_pfn = 86a00, 86a00000
> ----> free_memmap, start_pfn = 8c800,  8c800000 end_pfn = 8e300, 8e300000
> ----> free_memmap, start_pfn = 8f000,  8f000000 end_pfn = 90000, 90000000
> ----> free_memmap, start_pfn = dcc00,  dcc00000 end_pfn = de700, de700000
> ----> free_memmap, start_pfn = dec00,  dec00000 end_pfn = e0000, e0000000
> ----> free_memmap, start_pfn = e0c00,  e0c00000 end_pfn = e4000, e4000000
> ----> free_memmap, start_pfn = f7000,  f7000000 end_pfn = f8000, f8000000
> === >move_freepages: start_pfn/end_pfn [de601, de7ff], [de600000, de7ff000]
> :  pfn =de600 pfn2phy = de600000 , page = ef3cc000, page-flags = ffffffff
> 8<--- cut here ---
> Unable to handle kernel paging request at virtual address fffffffe
> pgd = 5dd50df5
> [fffffffe] *pgd=affff861, *pte=00000000, *ppte=00000000
> Internal error: Oops: 37 [#1] SMP ARM
> Modules linked in: gmac(O)
> CPU: 2 PID: 635 Comm: test-oom Tainted: G           O      5.10.0+ #31
> Hardware name: Hisilicon A9
> PC is at move_freepages_block+0x150/0x278
> LR is at move_freepages_block+0x150/0x278
> pc : [<c02383a4>]    lr : [<c02383a4>]    psr: 200e0393
> sp : c4179cf8  ip : 00000000  fp : 00000001
> r10: c4179d58  r9 : 000de7ff  r8 : 00000000
> r7 : c0863280  r6 : 000de600  r5 : 000de600  r4 : ef3cc000
> r3 : ffffffff  r2 : 00000000  r1 : ef5d069c  r0 : fffffffe
> Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
> Control: 1ac5387d  Table: 83b0c04a  DAC: 55555555
> Process test-oom (pid: 635, stack limit = 0x25d667df)
>
Mike Rapoport April 26, 2021, 5:20 a.m. UTC | #7
On Sun, Apr 25, 2021 at 03:51:56PM +0800, Kefeng Wang wrote:
> 
> On 2021/4/25 15:19, Mike Rapoport wrote:
> 
>     On Fri, Apr 23, 2021 at 04:11:16PM +0800, Kefeng Wang wrote:
> 
>         I tested this patchset(plus arm32 change, like arm64 does) based on lts
>         5.10,add
> 
>         some debug log, the useful info shows below, if we enable HOLES_IN_ZONE, no
>         panic,
> 
>         any idea, thanks.
> 
> 
>     Are there any changes on top of 5.10 except for pfn_valid() patch?
>     Do you see this panic on 5.10 without the changes?
> 
> Yes, there are some BSP support for arm board based on 5.10, with or without
> 
> your patch will get same panic, the panic pfn=de600 in the range of
> [dcc00,de00]
> 
> which is freed by free_memmap, start_pfn = dcc00,  dcc00000 end_pfn = de700,
> de700000
> 
> we see the PC is at PageLRU, same reason like arm64 panic log,
> 
>    "PageBuddy in move_freepages returns false
>     Then we call PageLRU, the macro calls PF_HEAD which is compound_page()
>     compound_page reads page->compound_head, it is 0xffffffffffffffff, so it
>     resturns 0xfffffffffffffffe - and accessing this address causes crash"
> 
>     Can you see stack backtrace beyond move_freepages_block?
> 
> I do some oom test, so the log is about memory allocate,
> 
> [<c02383c8>] (move_freepages_block) from [<c0238668>]
> (steal_suitable_fallback+0x174/0x1f4)
> 
> [<c0238668>] (steal_suitable_fallback) from [<c023999c>] (get_page_from_freelist+0x490/0x9a4)

Hmm, this is called with a page from free list, having a page from a freed
part of the memory map passed to steal_suitable_fallback() means that there
is an issue with creation of the free list.

Can you please add "memblock=debug" to the kernel command line and post the
log?

> [<c023999c>] (get_page_from_freelist) from [<c023a4dc>] (__alloc_pages_nodemask+0x188/0xc08)
> [<c023a4dc>] (__alloc_pages_nodemask) from [<c0223078>] (alloc_zeroed_user_highpage_movable+0x14/0x3c)
> [<c0223078>] (alloc_zeroed_user_highpage_movable) from [<c0226768>] (handle_mm_fault+0x254/0xac8)
> [<c0226768>] (handle_mm_fault) from [<c04ba09c>] (do_page_fault+0x228/0x2f4)
> [<c04ba09c>] (do_page_fault) from [<c0111d80>] (do_DataAbort+0x48/0xd0)
> [<c0111d80>] (do_DataAbort) from [<c0100e00>] (__dabt_usr+0x40/0x60)
> 
> 
> 
>         Zone ranges:
>           Normal   [mem 0x0000000080a00000-0x00000000b01fffff]
>           HighMem  [mem 0x00000000b0200000-0x00000000ffffefff]
>         Movable zone start for each node
>         Early memory node ranges
>           node   0: [mem 0x0000000080a00000-0x00000000855fffff]
>           node   0: [mem 0x0000000086a00000-0x0000000087dfffff]
>           node   0: [mem 0x000000008bd00000-0x000000008c4fffff]
>           node   0: [mem 0x000000008e300000-0x000000008ecfffff]
>           node   0: [mem 0x0000000090d00000-0x00000000bfffffff]
>           node   0: [mem 0x00000000cc000000-0x00000000dc9fffff]
>           node   0: [mem 0x00000000de700000-0x00000000de9fffff]
>           node   0: [mem 0x00000000e0800000-0x00000000e0bfffff]
>           node   0: [mem 0x00000000f4b00000-0x00000000f6ffffff]
>           node   0: [mem 0x00000000fda00000-0x00000000ffffefff]
> 
>         ----> free_memmap, start_pfn = 85800,  85800000 end_pfn = 86a00, 86a00000
>         ----> free_memmap, start_pfn = 8c800,  8c800000 end_pfn = 8e300, 8e300000
>         ----> free_memmap, start_pfn = 8f000,  8f000000 end_pfn = 90000, 90000000
>         ----> free_memmap, start_pfn = dcc00,  dcc00000 end_pfn = de700, de700000
>         ----> free_memmap, start_pfn = dec00,  dec00000 end_pfn = e0000, e0000000
>         ----> free_memmap, start_pfn = e0c00,  e0c00000 end_pfn = e4000, e4000000
>         ----> free_memmap, start_pfn = f7000,  f7000000 end_pfn = f8000, f8000000
>         === >move_freepages: start_pfn/end_pfn [de601, de7ff], [de600000, de7ff000]
>         :  pfn =de600 pfn2phy = de600000 , page = ef3cc000, page-flags = ffffffff
>         8<--- cut here ---
>         Unable to handle kernel paging request at virtual address fffffffe
>         pgd = 5dd50df5
>         [fffffffe] *pgd=affff861, *pte=00000000, *ppte=00000000
>         Internal error: Oops: 37 [#1] SMP ARM
>         Modules linked in: gmac(O)
>         CPU: 2 PID: 635 Comm: test-oom Tainted: G           O      5.10.0+ #31
>         Hardware name: Hisilicon A9
>         PC is at move_freepages_block+0x150/0x278
>         LR is at move_freepages_block+0x150/0x278
>         pc : [<c02383a4>]    lr : [<c02383a4>]    psr: 200e0393
>         sp : c4179cf8  ip : 00000000  fp : 00000001
>         r10: c4179d58  r9 : 000de7ff  r8 : 00000000
>         r7 : c0863280  r6 : 000de600  r5 : 000de600  r4 : ef3cc000
>         r3 : ffffffff  r2 : 00000000  r1 : ef5d069c  r0 : fffffffe
>         Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
>         Control: 1ac5387d  Table: 83b0c04a  DAC: 55555555
>         Process test-oom (pid: 635, stack limit = 0x25d667df)
> 
>
Kefeng Wang April 26, 2021, 3:26 p.m. UTC | #8
On 2021/4/26 13:20, Mike Rapoport wrote:
> On Sun, Apr 25, 2021 at 03:51:56PM +0800, Kefeng Wang wrote:
>> On 2021/4/25 15:19, Mike Rapoport wrote:
>>
>>      On Fri, Apr 23, 2021 at 04:11:16PM +0800, Kefeng Wang wrote:
>>
>>          I tested this patchset(plus arm32 change, like arm64 does) based on lts
>>          5.10,add
>>
>>          some debug log, the useful info shows below, if we enable HOLES_IN_ZONE, no
>>          panic,
>>
>>          any idea, thanks.
>>
>>
>>      Are there any changes on top of 5.10 except for pfn_valid() patch?
>>      Do you see this panic on 5.10 without the changes?
>>
>> Yes, there are some BSP support for arm board based on 5.10, with or without
>>
>> your patch will get same panic, the panic pfn=de600 in the range of
>> [dcc00,de00]
>>
>> which is freed by free_memmap, start_pfn = dcc00,  dcc00000 end_pfn = de700,
>> de700000
>>
>> we see the PC is at PageLRU, same reason like arm64 panic log,
>>
>>     "PageBuddy in move_freepages returns false
>>      Then we call PageLRU, the macro calls PF_HEAD which is compound_page()
>>      compound_page reads page->compound_head, it is 0xffffffffffffffff, so it
>>      resturns 0xfffffffffffffffe - and accessing this address causes crash"
>>
>>      Can you see stack backtrace beyond move_freepages_block?
>>
>> I do some oom test, so the log is about memory allocate,
>>
>> [<c02383c8>] (move_freepages_block) from [<c0238668>]
>> (steal_suitable_fallback+0x174/0x1f4)
>>
>> [<c0238668>] (steal_suitable_fallback) from [<c023999c>] (get_page_from_freelist+0x490/0x9a4)
> Hmm, this is called with a page from free list, having a page from a freed
> part of the memory map passed to steal_suitable_fallback() means that there
> is an issue with creation of the free list.
>
> Can you please add "memblock=debug" to the kernel command line and post the
> log?

Here is the log,

CPU: ARMv7 Processor [413fc090] revision 0 (ARMv7), cr=1ac5387d

CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
OF: fdt: Machine model: HISI-CA9
memblock_add: [0x80a00000-0x855fffff] early_init_dt_scan_memory+0x11c/0x188
memblock_add: [0x86a00000-0x87dfffff] early_init_dt_scan_memory+0x11c/0x188
memblock_add: [0x8bd00000-0x8c4fffff] early_init_dt_scan_memory+0x11c/0x188
memblock_add: [0x8e300000-0x8ecfffff] early_init_dt_scan_memory+0x11c/0x188
memblock_add: [0x90d00000-0xbfffffff] early_init_dt_scan_memory+0x11c/0x188
memblock_add: [0xcc000000-0xdc9fffff] early_init_dt_scan_memory+0x11c/0x188
memblock_add: [0xe0800000-0xe0bfffff] early_init_dt_scan_memory+0x11c/0x188
memblock_add: [0xf5300000-0xf5bfffff] early_init_dt_scan_memory+0x11c/0x188
memblock_add: [0xf5c00000-0xf6ffffff] early_init_dt_scan_memory+0x11c/0x188
memblock_add: [0xfe100000-0xfebfffff] early_init_dt_scan_memory+0x11c/0x188
memblock_add: [0xfec00000-0xffffffff] early_init_dt_scan_memory+0x11c/0x188
memblock_add: [0xde700000-0xde9fffff] early_init_dt_scan_memory+0x11c/0x188
memblock_add: [0xf4b00000-0xf52fffff] early_init_dt_scan_memory+0x11c/0x188
memblock_add: [0xfda00000-0xfe0fffff] early_init_dt_scan_memory+0x11c/0x188
memblock_reserve: [0x80a01000-0x80a02d2e] setup_arch+0x68/0x5c4
Malformed early option 'vecpage_wrprotect'
Memory policy: Data cache writealloc
memblock_reserve: [0x80b00000-0x812e8057] arm_memblock_init+0x34/0x14c
memblock_reserve: [0x83000000-0x84ffffff] arm_memblock_init+0x100/0x14c
memblock_reserve: [0x80a04000-0x80a07fff] arm_memblock_init+0xa0/0x14c
memblock_reserve: [0x80a00000-0x80a02fff] hisi_mem_reserve+0x14/0x30
MEMBLOCK configuration:
  memory size = 0x4c0fffff reserved size = 0x027ef058
  memory.cnt  = 0xa
  memory[0x0]    [0x80a00000-0x855fffff], 0x04c00000 bytes flags: 0x0
  memory[0x1]    [0x86a00000-0x87dfffff], 0x01400000 bytes flags: 0x0
  memory[0x2]    [0x8bd00000-0x8c4fffff], 0x00800000 bytes flags: 0x0
  memory[0x3]    [0x8e300000-0x8ecfffff], 0x00a00000 bytes flags: 0x0
  memory[0x4]    [0x90d00000-0xbfffffff], 0x2f300000 bytes flags: 0x0
  memory[0x5]    [0xcc000000-0xdc9fffff], 0x10a00000 bytes flags: 0x0
  memory[0x6]    [0xde700000-0xde9fffff], 0x00300000 bytes flags: 0x0
  memory[0x7]    [0xe0800000-0xe0bfffff], 0x00400000 bytes flags: 0x0
  memory[0x8]    [0xf4b00000-0xf6ffffff], 0x02500000 bytes flags: 0x0
  memory[0x9]    [0xfda00000-0xfffffffe], 0x025fffff bytes flags: 0x0
  reserved.cnt  = 0x4
  reserved[0x0]    [0x80a00000-0x80a02fff], 0x00003000 bytes flags: 0x0
  reserved[0x1]    [0x80a04000-0x80a07fff], 0x00004000 bytes flags: 0x0
  reserved[0x2]    [0x80b00000-0x812e8057], 0x007e8058 bytes flags: 0x0
  reserved[0x3]    [0x83000000-0x84ffffff], 0x02000000 bytes flags: 0x0
memblock_alloc_try_nid: 2097152 bytes align=0x200000 nid=-1 
from=0x00000000 max_addr=0x00000000 early_alloc+0x20/0x4c
memblock_reserve: [0xb0000000-0xb01fffff] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 4096 bytes align=0x1000 nid=-1 from=0x00000000 
max_addr=0x00000000 early_alloc+0x20/0x4c
memblock_reserve: [0xaffff000-0xafffffff] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 40 bytes align=0x4 nid=-1 from=0x00000000 
max_addr=0x00000000 iotable_init+0x34/0xf0
memblock_reserve: [0xafffefd8-0xafffefff] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 4096 bytes align=0x1000 nid=-1 from=0x00000000 
max_addr=0x00000000 early_alloc+0x20/0x4c
memblock_reserve: [0xafffd000-0xafffdfff] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 4096 bytes align=0x1000 nid=-1 from=0x00000000 
max_addr=0x00000000 early_alloc+0x20/0x4c
memblock_reserve: [0xafffc000-0xafffcfff] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 4096 bytes align=0x1000 nid=-1 from=0x00000000 
max_addr=0x00000000 early_alloc+0x20/0x4c
memblock_reserve: [0xafffb000-0xafffbfff] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 4096 bytes align=0x1000 nid=-1 from=0x00000000 
max_addr=0x00000000 early_alloc+0x20/0x4c
memblock_reserve: [0xafffa000-0xafffafff] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 384 bytes align=0x20 nid=0 from=0x00000000 
max_addr=0x00000000 sparse_init_nid+0x34/0x1d8
memblock_reserve: [0xafffee40-0xafffefbf] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_exact_nid_raw: 12582912 bytes align=0x80000 nid=0 
from=0xc09fffff max_addr=0x00000000 sparse_init_nid+0xec/0x1d8
memblock_reserve: [0xaf380000-0xaff7ffff] 
memblock_alloc_range_nid+0x104/0x13c
Zone ranges:
   Normal   [mem 0x0000000080a00000-0x00000000b01fffff]
   HighMem  [mem 0x00000000b0200000-0x00000000ffffefff]
Movable zone start for each node
Early memory node ranges
   node   0: [mem 0x0000000080a00000-0x00000000855fffff]
   node   0: [mem 0x0000000086a00000-0x0000000087dfffff]
   node   0: [mem 0x000000008bd00000-0x000000008c4fffff]
   node   0: [mem 0x000000008e300000-0x000000008ecfffff]
   node   0: [mem 0x0000000090d00000-0x00000000bfffffff]
   node   0: [mem 0x00000000cc000000-0x00000000dc9fffff]
   node   0: [mem 0x00000000de700000-0x00000000de9fffff]
   node   0: [mem 0x00000000e0800000-0x00000000e0bfffff]
   node   0: [mem 0x00000000f4b00000-0x00000000f6ffffff]
   node   0: [mem 0x00000000fda00000-0x00000000ffffefff]
Zeroed struct page in unavailable ranges: 513 pages
Initmem setup node 0 [mem 0x0000000080a00000-0x00000000ffffefff]
On node 0 totalpages: 311551
   Normal zone: 1230 pages used for memmap
   Normal zone: 0 pages reserved
   Normal zone: 157440 pages, LIFO batch:31
   HighMem zone: 154111 pages, LIFO batch:31
memblock_alloc_try_nid: 32 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 setup_arch+0x440/0x5c4
memblock_reserve: [0xafffee20-0xafffee3f] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 32 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 setup_arch+0x440/0x5c4
memblock_reserve: [0xafffee00-0xafffee1f] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 32 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 setup_arch+0x440/0x5c4
memblock_reserve: [0xafffede0-0xafffedff] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 32 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 setup_arch+0x440/0x5c4
memblock_reserve: [0xafffedc0-0xafffeddf] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 32 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 setup_arch+0x440/0x5c4
memblock_reserve: [0xafffeda0-0xafffedbf] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 32 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 setup_arch+0x440/0x5c4
memblock_reserve: [0xafffed80-0xafffed9f] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 32 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 setup_arch+0x440/0x5c4
memblock_reserve: [0xafffed60-0xafffed7f] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 32 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 setup_arch+0x440/0x5c4
memblock_reserve: [0xafffed40-0xafffed5f] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 32 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 setup_arch+0x440/0x5c4
memblock_reserve: [0xafffed20-0xafffed3f] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 32 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 setup_arch+0x440/0x5c4
memblock_reserve: [0xafffed00-0xafffed1f] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 22396 bytes align=0x4 nid=-1 from=0x00000000 
max_addr=0x00000000 early_init_dt_alloc_memory_arch+0x30/0x64
memblock_reserve: [0xafff4884-0xafff9fff] 
memblock_alloc_range_nid+0x104/0x13c
[dts]:cpu type is 1380
memblock_alloc_try_nid: 404 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 memblock_alloc.constprop.8+0x1c/0x24
memblock_reserve: [0xafffeb60-0xafffecf3] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 404 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 memblock_alloc.constprop.8+0x1c/0x24
memblock_reserve: [0xafffe9c0-0xafffeb53] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 4096 bytes align=0x1000 nid=-1 from=0x00000000 
max_addr=0x00000000 memblock_alloc+0x18/0x20
memblock_reserve: [0xafff3000-0xafff3fff] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 4096 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 memblock_alloc+0x18/0x20
memblock_reserve: [0xafff2000-0xafff2fff] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 278528 bytes align=0x1000 nid=-1 from=0xc09fffff 
max_addr=0x00000000 pcpu_dfl_fc_alloc+0x28/0x34
memblock_reserve: [0xaffae000-0xafff1fff] 
memblock_alloc_range_nid+0x104/0x13c
memblock_free: [0xaffbf000-0xaffbefff] pcpu_embed_first_chunk+0x5ec/0x6a8
memblock_free: [0xaffd0000-0xaffcffff] pcpu_embed_first_chunk+0x5ec/0x6a8
memblock_free: [0xaffe1000-0xaffe0fff] pcpu_embed_first_chunk+0x5ec/0x6a8
memblock_free: [0xafff2000-0xafff1fff] pcpu_embed_first_chunk+0x5ec/0x6a8
percpu: Embedded 17 pages/cpu s37044 r8192 d24396 u69632
memblock_alloc_try_nid: 4 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 memblock_alloc+0x18/0x20
memblock_reserve: [0xafffefc0-0xafffefc3] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 4 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 memblock_alloc+0x18/0x20
memblock_reserve: [0xafffe9a0-0xafffe9a3] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 16 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 memblock_alloc+0x18/0x20
memblock_reserve: [0xafffe980-0xafffe98f] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 16 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 memblock_alloc+0x18/0x20
memblock_reserve: [0xafffe960-0xafffe96f] 
memblock_alloc_range_nid+0x104/0x13c
pcpu-alloc: s37044 r8192 d24396 u69632 alloc=17*4096
pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3
memblock_alloc_try_nid: 128 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 memblock_alloc+0x18/0x20
memblock_reserve: [0xafffe8e0-0xafffe95f] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 92 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 memblock_alloc+0x18/0x20
memblock_reserve: [0xafffe880-0xafffe8db] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 384 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 memblock_alloc+0x18/0x20
memblock_reserve: [0xafffe700-0xafffe87f] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 388 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 memblock_alloc+0x18/0x20
memblock_reserve: [0xafffe560-0xafffe6e3] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 96 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 memblock_alloc+0x18/0x20
memblock_reserve: [0xafffe500-0xafffe55f] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 92 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 memblock_alloc+0x18/0x20
memblock_reserve: [0xafffe4a0-0xafffe4fb] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 768 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 memblock_alloc+0x18/0x20
memblock_reserve: [0xafffe1a0-0xafffe49f] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 772 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 memblock_alloc+0x18/0x20
memblock_reserve: [0xafff4580-0xafff4883] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 192 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 memblock_alloc+0x18/0x20
memblock_reserve: [0xafffe0e0-0xafffe19f] 
memblock_alloc_range_nid+0x104/0x13c
memblock_free: [0xafff3000-0xafff3fff] pcpu_embed_first_chunk+0x570/0x6a8
memblock_free: [0xafff2000-0xafff2fff] pcpu_embed_first_chunk+0x58c/0x6a8
Built 1 zonelists, mobility grouping on.  Total pages: 310321
Kernel command line: console=ttyAMA0,9600n8N lpj=8000000 
initrd=0x83000000,0x2000000 maxcpus=4 master_cpu=1 quiet highres=off  
oops=panic vecpage_wrprotect ksm=1 ramdisk_size=30720 kmemleak=off 
min_loop=128 lockd.nlm_tcpport=13001 lockd.nlm_udpport=13001 
rdinit=/sbin/init root=/dev/ram0 vmalloc=256M
printk: log_buf_len individual max cpu contribution: 4096 bytes
printk: log_buf_len total cpu_extra contributions: 12288 bytes
printk: log_buf_len min size: 16384 bytes
memblock_alloc_try_nid: 32768 bytes align=0x4 nid=-1 from=0x00000000 
max_addr=0x00000000 setup_log_buf+0xe4/0x404
memblock_reserve: [0xaffa6000-0xaffadfff] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 12288 bytes align=0x4 nid=-1 from=0x00000000 
max_addr=0x00000000 setup_log_buf+0x130/0x404
memblock_reserve: [0xaffa3000-0xaffa5fff] 
memblock_alloc_range_nid+0x104/0x13c
memblock_alloc_try_nid: 90112 bytes align=0x4 nid=-1 from=0x00000000 
max_addr=0x00000000 setup_log_buf+0x180/0x404
memblock_reserve: [0xaff8d000-0xaffa2fff] 
memblock_alloc_range_nid+0x104/0x13c
printk: log_buf_len: 32768 bytes
printk: early log buf free: 2492(15%)
memblock_alloc_try_nid: 524288 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 alloc_large_system_hash+0x1b0/0x2e8
memblock_reserve: [0xaf300000-0xaf37ffff] 
memblock_alloc_range_nid+0x104/0x13c
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes, linear)
memblock_alloc_try_nid: 262144 bytes align=0x20 nid=-1 from=0x00000000 
max_addr=0x00000000 alloc_large_system_hash+0x1b0/0x2e8
memblock_reserve: [0xaf2c0000-0xaf2fffff] 
memblock_alloc_range_nid+0x104/0x13c
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
mem auto-init: stack:off, heap alloc:off, heap free:off
memblock_free: [0xaf430000-0xaf453fff] mem_init+0x154/0x238
memblock_free: [0xaf510000-0xaf545fff] mem_init+0x154/0x238
memblock_free: [0xaf560000-0xaf57ffff] mem_init+0x154/0x238
memblock_free: [0xafd98000-0xafdcdfff] mem_init+0x154/0x238
memblock_free: [0xafdd8000-0xafdfffff] mem_init+0x154/0x238
memblock_free: [0xafe18000-0xafe7ffff] mem_init+0x154/0x238
memblock_free: [0xafee0000-0xafefffff] mem_init+0x154/0x238
Memory: 1191160K/1246204K available (4096K kernel code, 436K rwdata, 
1120K rodata, 1024K init, 491K bss, 55044K reserved, 0K cma-reserved, 
616444K highmem)

>> [<c023999c>] (get_page_from_freelist) from [<c023a4dc>] (__alloc_pages_nodemask+0x188/0xc08)
>> [<c023a4dc>] (__alloc_pages_nodemask) from [<c0223078>] (alloc_zeroed_user_highpage_movable+0x14/0x3c)
>> [<c0223078>] (alloc_zeroed_user_highpage_movable) from [<c0226768>] (handle_mm_fault+0x254/0xac8)
>> [<c0226768>] (handle_mm_fault) from [<c04ba09c>] (do_page_fault+0x228/0x2f4)
>> [<c04ba09c>] (do_page_fault) from [<c0111d80>] (do_DataAbort+0x48/0xd0)
>> [<c0111d80>] (do_DataAbort) from [<c0100e00>] (__dabt_usr+0x40/0x60)
>>
>>
>>
>>          Zone ranges:
>>            Normal   [mem 0x0000000080a00000-0x00000000b01fffff]
>>            HighMem  [mem 0x00000000b0200000-0x00000000ffffefff]
>>          Movable zone start for each node
>>          Early memory node ranges
>>            node   0: [mem 0x0000000080a00000-0x00000000855fffff]
>>            node   0: [mem 0x0000000086a00000-0x0000000087dfffff]
>>            node   0: [mem 0x000000008bd00000-0x000000008c4fffff]
>>            node   0: [mem 0x000000008e300000-0x000000008ecfffff]
>>            node   0: [mem 0x0000000090d00000-0x00000000bfffffff]
>>            node   0: [mem 0x00000000cc000000-0x00000000dc9fffff]
>>            node   0: [mem 0x00000000de700000-0x00000000de9fffff]
>>            node   0: [mem 0x00000000e0800000-0x00000000e0bfffff]
>>            node   0: [mem 0x00000000f4b00000-0x00000000f6ffffff]
>>            node   0: [mem 0x00000000fda00000-0x00000000ffffefff]
>>
>>          ----> free_memmap, start_pfn = 85800,  85800000 end_pfn = 86a00, 86a00000
>>          ----> free_memmap, start_pfn = 8c800,  8c800000 end_pfn = 8e300, 8e300000
>>          ----> free_memmap, start_pfn = 8f000,  8f000000 end_pfn = 90000, 90000000
>>          ----> free_memmap, start_pfn = dcc00,  dcc00000 end_pfn = de700, de700000
>>          ----> free_memmap, start_pfn = dec00,  dec00000 end_pfn = e0000, e0000000
>>          ----> free_memmap, start_pfn = e0c00,  e0c00000 end_pfn = e4000, e4000000
>>          ----> free_memmap, start_pfn = f7000,  f7000000 end_pfn = f8000, f8000000
>>          === >move_freepages: start_pfn/end_pfn [de601, de7ff], [de600000, de7ff000]
>>          :  pfn =de600 pfn2phy = de600000 , page = ef3cc000, page-flags = ffffffff
>>          8<--- cut here ---
>>          Unable to handle kernel paging request at virtual address fffffffe
>>          pgd = 5dd50df5
>>          [fffffffe] *pgd=affff861, *pte=00000000, *ppte=00000000
>>          Internal error: Oops: 37 [#1] SMP ARM
>>          Modules linked in: gmac(O)
>>          CPU: 2 PID: 635 Comm: test-oom Tainted: G           O      5.10.0+ #31
>>          Hardware name: Hisilicon A9
>>          PC is at move_freepages_block+0x150/0x278
>>          LR is at move_freepages_block+0x150/0x278
>>          pc : [<c02383a4>]    lr : [<c02383a4>]    psr: 200e0393
>>          sp : c4179cf8  ip : 00000000  fp : 00000001
>>          r10: c4179d58  r9 : 000de7ff  r8 : 00000000
>>          r7 : c0863280  r6 : 000de600  r5 : 000de600  r4 : ef3cc000
>>          r3 : ffffffff  r2 : 00000000  r1 : ef5d069c  r0 : fffffffe
>>          Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
>>          Control: 1ac5387d  Table: 83b0c04a  DAC: 55555555
>>          Process test-oom (pid: 635, stack limit = 0x25d667df)
>>
>>
Mike Rapoport April 27, 2021, 6:23 a.m. UTC | #9
On Mon, Apr 26, 2021 at 11:26:38PM +0800, Kefeng Wang wrote:
> 
> On 2021/4/26 13:20, Mike Rapoport wrote:
> > On Sun, Apr 25, 2021 at 03:51:56PM +0800, Kefeng Wang wrote:
> > > On 2021/4/25 15:19, Mike Rapoport wrote:
> > > 
> > >      On Fri, Apr 23, 2021 at 04:11:16PM +0800, Kefeng Wang wrote:
> > > 
> > >          I tested this patchset(plus arm32 change, like arm64 does)
> > >          based on lts 5.10,add some debug log, the useful info shows
> > >          below, if we enable HOLES_IN_ZONE, no panic, any idea,
> > >          thanks.
> > > 
> > >      Are there any changes on top of 5.10 except for pfn_valid() patch?
> > >      Do you see this panic on 5.10 without the changes?
> > > 
> > > Yes, there are some BSP support for arm board based on 5.10,

Is it possible to test 5.12?

> > > with or without your patch will get same panic, the panic pfn=de600
> > > in the range of [dcc00,de00] which is freed by free_memmap, start_pfn
> > > = dcc00,  dcc00000 end_pfn = de700, de700000
> > > 
> > > we see the PC is at PageLRU, same reason like arm64 panic log,
> > > 
> > >     "PageBuddy in move_freepages returns false
> > >      Then we call PageLRU, the macro calls PF_HEAD which is compound_page()
> > >      compound_page reads page->compound_head, it is 0xffffffffffffffff, so it
> > >      resturns 0xfffffffffffffffe - and accessing this address causes crash"
> > > 
> > >      Can you see stack backtrace beyond move_freepages_block?
> > > 
> > > I do some oom test, so the log is about memory allocate,
> > > 
> > > [<c02383c8>] (move_freepages_block) from [<c0238668>]
> > > (steal_suitable_fallback+0x174/0x1f4)
> > > 
> > > [<c0238668>] (steal_suitable_fallback) from [<c023999c>] (get_page_from_freelist+0x490/0x9a4)
> >
> > Hmm, this is called with a page from free list, having a page from a freed
> > part of the memory map passed to steal_suitable_fallback() means that there
> > is an issue with creation of the free list.
> > 
> > Can you please add "memblock=debug" to the kernel command line and post the
> > log?
> 
> Here is the log,
> 
> CPU: ARMv7 Processor [413fc090] revision 0 (ARMv7), cr=1ac5387d
> 
> CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
> OF: fdt: Machine model: HISI-CA9
> memblock_add: [0x80a00000-0x855fffff] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0x86a00000-0x87dfffff] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0x8bd00000-0x8c4fffff] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0x8e300000-0x8ecfffff] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0x90d00000-0xbfffffff] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0xcc000000-0xdc9fffff] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0xe0800000-0xe0bfffff] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0xf5300000-0xf5bfffff] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0xf5c00000-0xf6ffffff] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0xfe100000-0xfebfffff] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0xfec00000-0xffffffff] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0xde700000-0xde9fffff] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0xf4b00000-0xf52fffff] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0xfda00000-0xfe0fffff] early_init_dt_scan_memory+0x11c/0x188
> memblock_reserve: [0x80a01000-0x80a02d2e] setup_arch+0x68/0x5c4
> Malformed early option 'vecpage_wrprotect'
> Memory policy: Data cache writealloc
> memblock_reserve: [0x80b00000-0x812e8057] arm_memblock_init+0x34/0x14c
> memblock_reserve: [0x83000000-0x84ffffff] arm_memblock_init+0x100/0x14c
> memblock_reserve: [0x80a04000-0x80a07fff] arm_memblock_init+0xa0/0x14c
> memblock_reserve: [0x80a00000-0x80a02fff] hisi_mem_reserve+0x14/0x30
> MEMBLOCK configuration:
>  memory size = 0x4c0fffff reserved size = 0x027ef058
>  memory.cnt  = 0xa
>  memory[0x0]    [0x80a00000-0x855fffff], 0x04c00000 bytes flags: 0x0
>  memory[0x1]    [0x86a00000-0x87dfffff], 0x01400000 bytes flags: 0x0
>  memory[0x2]    [0x8bd00000-0x8c4fffff], 0x00800000 bytes flags: 0x0
>  memory[0x3]    [0x8e300000-0x8ecfffff], 0x00a00000 bytes flags: 0x0
>  memory[0x4]    [0x90d00000-0xbfffffff], 0x2f300000 bytes flags: 0x0
>  memory[0x5]    [0xcc000000-0xdc9fffff], 0x10a00000 bytes flags: 0x0
>  memory[0x6]    [0xde700000-0xde9fffff], 0x00300000 bytes flags: 0x0
>  memory[0x7]    [0xe0800000-0xe0bfffff], 0x00400000 bytes flags: 0x0
>  memory[0x8]    [0xf4b00000-0xf6ffffff], 0x02500000 bytes flags: 0x0
>  memory[0x9]    [0xfda00000-0xfffffffe], 0x025fffff bytes flags: 0x0
>  reserved.cnt  = 0x4
>  reserved[0x0]    [0x80a00000-0x80a02fff], 0x00003000 bytes flags: 0x0
>  reserved[0x1]    [0x80a04000-0x80a07fff], 0x00004000 bytes flags: 0x0
>  reserved[0x2]    [0x80b00000-0x812e8057], 0x007e8058 bytes flags: 0x0
>  reserved[0x3]    [0x83000000-0x84ffffff], 0x02000000 bytes flags: 0x0
...
> Zone ranges:
>   Normal   [mem 0x0000000080a00000-0x00000000b01fffff]
>   HighMem  [mem 0x00000000b0200000-0x00000000ffffefff]
> Movable zone start for each node
> Early memory node ranges
>   node   0: [mem 0x0000000080a00000-0x00000000855fffff]
>   node   0: [mem 0x0000000086a00000-0x0000000087dfffff]
>   node   0: [mem 0x000000008bd00000-0x000000008c4fffff]
>   node   0: [mem 0x000000008e300000-0x000000008ecfffff]
>   node   0: [mem 0x0000000090d00000-0x00000000bfffffff]
>   node   0: [mem 0x00000000cc000000-0x00000000dc9fffff]
>   node   0: [mem 0x00000000de700000-0x00000000de9fffff]
>   node   0: [mem 0x00000000e0800000-0x00000000e0bfffff]
>   node   0: [mem 0x00000000f4b00000-0x00000000f6ffffff]
>   node   0: [mem 0x00000000fda00000-0x00000000ffffefff]
> Zeroed struct page in unavailable ranges: 513 pages
> Initmem setup node 0 [mem 0x0000000080a00000-0x00000000ffffefff]
> On node 0 totalpages: 311551
>   Normal zone: 1230 pages used for memmap
>   Normal zone: 0 pages reserved
>   Normal zone: 157440 pages, LIFO batch:31
>   HighMem zone: 154111 pages, LIFO batch:31

AFAICT the range [de600000, de7ff000] should not be added to the free
lists.

Can you try with the below patch:

diff --git a/mm/memblock.c b/mm/memblock.c
index afaefa8fc6ab..7f3c33d53f87 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1994,6 +1994,8 @@ static unsigned long __init __free_memory_core(phys_addr_t start,
 	unsigned long end_pfn = min_t(unsigned long,
 				      PFN_DOWN(end), max_low_pfn);
 
+	pr_info("%s: range: %pa - %pa, pfn: %lx - %lx\n", __func__, &start, &end, start_pfn, end_pfn);
+
 	if (start_pfn >= end_pfn)
 		return 0;
 
 
> > > [<c023999c>] (get_page_from_freelist) from [<c023a4dc>] (__alloc_pages_nodemask+0x188/0xc08)
> > > [<c023a4dc>] (__alloc_pages_nodemask) from [<c0223078>] (alloc_zeroed_user_highpage_movable+0x14/0x3c)
> > > [<c0223078>] (alloc_zeroed_user_highpage_movable) from [<c0226768>] (handle_mm_fault+0x254/0xac8)
> > > [<c0226768>] (handle_mm_fault) from [<c04ba09c>] (do_page_fault+0x228/0x2f4)
> > > [<c04ba09c>] (do_page_fault) from [<c0111d80>] (do_DataAbort+0x48/0xd0)
> > > [<c0111d80>] (do_DataAbort) from [<c0100e00>] (__dabt_usr+0x40/0x60)
> > > 
> > >          Zone ranges:
> > >            Normal   [mem 0x0000000080a00000-0x00000000b01fffff]
> > >            HighMem  [mem 0x00000000b0200000-0x00000000ffffefff]
> > >          Movable zone start for each node
> > >          Early memory node ranges
> > >            node   0: [mem 0x0000000080a00000-0x00000000855fffff]
> > >            node   0: [mem 0x0000000086a00000-0x0000000087dfffff]
> > >            node   0: [mem 0x000000008bd00000-0x000000008c4fffff]
> > >            node   0: [mem 0x000000008e300000-0x000000008ecfffff]
> > >            node   0: [mem 0x0000000090d00000-0x00000000bfffffff]
> > >            node   0: [mem 0x00000000cc000000-0x00000000dc9fffff]
> > >            node   0: [mem 0x00000000de700000-0x00000000de9fffff]
> > >            node   0: [mem 0x00000000e0800000-0x00000000e0bfffff]
> > >            node   0: [mem 0x00000000f4b00000-0x00000000f6ffffff]
> > >            node   0: [mem 0x00000000fda00000-0x00000000ffffefff]
> > > 
> > >          ----> free_memmap, start_pfn = 85800,  85800000 end_pfn = 86a00, 86a00000
> > >          ----> free_memmap, start_pfn = 8c800,  8c800000 end_pfn = 8e300, 8e300000
> > >          ----> free_memmap, start_pfn = 8f000,  8f000000 end_pfn = 90000, 90000000
> > >          ----> free_memmap, start_pfn = dcc00,  dcc00000 end_pfn = de700, de700000
> > >          ----> free_memmap, start_pfn = dec00,  dec00000 end_pfn = e0000, e0000000
> > >          ----> free_memmap, start_pfn = e0c00,  e0c00000 end_pfn = e4000, e4000000
> > >          ----> free_memmap, start_pfn = f7000,  f7000000 end_pfn = f8000, f8000000
> > >          === >move_freepages: start_pfn/end_pfn [de601, de7ff], [de600000, de7ff000]
> > >          :  pfn =de600 pfn2phy = de600000 , page = ef3cc000, page-flags = ffffffff
> > >          8<--- cut here ---
> > >          Unable to handle kernel paging request at virtual address fffffffe
> > >          pgd = 5dd50df5
> > >          [fffffffe] *pgd=affff861, *pte=00000000, *ppte=00000000
> > >          Internal error: Oops: 37 [#1] SMP ARM
> > >          Modules linked in: gmac(O)
> > >          CPU: 2 PID: 635 Comm: test-oom Tainted: G           O      5.10.0+ #31
> > >          Hardware name: Hisilicon A9
> > >          PC is at move_freepages_block+0x150/0x278
> > >          LR is at move_freepages_block+0x150/0x278
> > >          pc : [<c02383a4>]    lr : [<c02383a4>]    psr: 200e0393
> > >          sp : c4179cf8  ip : 00000000  fp : 00000001
> > >          r10: c4179d58  r9 : 000de7ff  r8 : 00000000
> > >          r7 : c0863280  r6 : 000de600  r5 : 000de600  r4 : ef3cc000
> > >          r3 : ffffffff  r2 : 00000000  r1 : ef5d069c  r0 : fffffffe
> > >          Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
> > >          Control: 1ac5387d  Table: 83b0c04a  DAC: 55555555
> > >          Process test-oom (pid: 635, stack limit = 0x25d667df)
> > > 
> > >
Kefeng Wang April 27, 2021, 11:08 a.m. UTC | #10
On 2021/4/27 14:23, Mike Rapoport wrote:
> On Mon, Apr 26, 2021 at 11:26:38PM +0800, Kefeng Wang wrote:
>> On 2021/4/26 13:20, Mike Rapoport wrote:
>>> On Sun, Apr 25, 2021 at 03:51:56PM +0800, Kefeng Wang wrote:
>>>> On 2021/4/25 15:19, Mike Rapoport wrote:
>>>>
>>>>       On Fri, Apr 23, 2021 at 04:11:16PM +0800, Kefeng Wang wrote:
>>>>
>>>>           I tested this patchset(plus arm32 change, like arm64 does)
>>>>           based on lts 5.10,add some debug log, the useful info shows
>>>>           below, if we enable HOLES_IN_ZONE, no panic, any idea,
>>>>           thanks.
>>>>
>>>>       Are there any changes on top of 5.10 except for pfn_valid() patch?
>>>>       Do you see this panic on 5.10 without the changes?
>>>>
>>>> Yes, there are some BSP support for arm board based on 5.10,
> Is it possible to test 5.12?
>
>>>> with or without your patch will get same panic, the panic pfn=de600
>>>> in the range of [dcc00,de00] which is freed by free_memmap, start_pfn
>>>> = dcc00,  dcc00000 end_pfn = de700, de700000
>>>>
>>>> we see the PC is at PageLRU, same reason like arm64 panic log,
>>>>
>>>>      "PageBuddy in move_freepages returns false
>>>>       Then we call PageLRU, the macro calls PF_HEAD which is compound_page()
>>>>       compound_page reads page->compound_head, it is 0xffffffffffffffff, so it
>>>>       resturns 0xfffffffffffffffe - and accessing this address causes crash"
>>>>
>>>>       Can you see stack backtrace beyond move_freepages_block?
>>>>
>>>> I do some oom test, so the log is about memory allocate,
>>>>
>>>> [<c02383c8>] (move_freepages_block) from [<c0238668>]
>>>> (steal_suitable_fallback+0x174/0x1f4)
>>>>
>>>> [<c0238668>] (steal_suitable_fallback) from [<c023999c>] (get_page_from_freelist+0x490/0x9a4)
>>> Hmm, this is called with a page from free list, having a page from a freed
>>> part of the memory map passed to steal_suitable_fallback() means that there
>>> is an issue with creation of the free list.
>>>
>>> Can you please add "memblock=debug" to the kernel command line and post the
>>> log?
>> Here is the log,
>>
>> CPU: ARMv7 Processor [413fc090] revision 0 (ARMv7), cr=1ac5387d
>>
>> CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
>> OF: fdt: Machine model: HISI-CA9
>> memblock_add: [0x80a00000-0x855fffff] early_init_dt_scan_memory+0x11c/0x188
>> memblock_add: [0x86a00000-0x87dfffff] early_init_dt_scan_memory+0x11c/0x188
>> memblock_add: [0x8bd00000-0x8c4fffff] early_init_dt_scan_memory+0x11c/0x188
>> memblock_add: [0x8e300000-0x8ecfffff] early_init_dt_scan_memory+0x11c/0x188
>> memblock_add: [0x90d00000-0xbfffffff] early_init_dt_scan_memory+0x11c/0x188
>> memblock_add: [0xcc000000-0xdc9fffff] early_init_dt_scan_memory+0x11c/0x188
>> memblock_add: [0xe0800000-0xe0bfffff] early_init_dt_scan_memory+0x11c/0x188
>> memblock_add: [0xf5300000-0xf5bfffff] early_init_dt_scan_memory+0x11c/0x188
>> memblock_add: [0xf5c00000-0xf6ffffff] early_init_dt_scan_memory+0x11c/0x188
>> memblock_add: [0xfe100000-0xfebfffff] early_init_dt_scan_memory+0x11c/0x188
>> memblock_add: [0xfec00000-0xffffffff] early_init_dt_scan_memory+0x11c/0x188
>> memblock_add: [0xde700000-0xde9fffff] early_init_dt_scan_memory+0x11c/0x188
>> memblock_add: [0xf4b00000-0xf52fffff] early_init_dt_scan_memory+0x11c/0x188
>> memblock_add: [0xfda00000-0xfe0fffff] early_init_dt_scan_memory+0x11c/0x188
>> memblock_reserve: [0x80a01000-0x80a02d2e] setup_arch+0x68/0x5c4
>> Malformed early option 'vecpage_wrprotect'
>> Memory policy: Data cache writealloc
>> memblock_reserve: [0x80b00000-0x812e8057] arm_memblock_init+0x34/0x14c
>> memblock_reserve: [0x83000000-0x84ffffff] arm_memblock_init+0x100/0x14c
>> memblock_reserve: [0x80a04000-0x80a07fff] arm_memblock_init+0xa0/0x14c
>> memblock_reserve: [0x80a00000-0x80a02fff] hisi_mem_reserve+0x14/0x30
>> MEMBLOCK configuration:
>>   memory size = 0x4c0fffff reserved size = 0x027ef058
>>   memory.cnt  = 0xa
>>   memory[0x0]    [0x80a00000-0x855fffff], 0x04c00000 bytes flags: 0x0
>>   memory[0x1]    [0x86a00000-0x87dfffff], 0x01400000 bytes flags: 0x0
>>   memory[0x2]    [0x8bd00000-0x8c4fffff], 0x00800000 bytes flags: 0x0
>>   memory[0x3]    [0x8e300000-0x8ecfffff], 0x00a00000 bytes flags: 0x0
>>   memory[0x4]    [0x90d00000-0xbfffffff], 0x2f300000 bytes flags: 0x0
>>   memory[0x5]    [0xcc000000-0xdc9fffff], 0x10a00000 bytes flags: 0x0
>>   memory[0x6]    [0xde700000-0xde9fffff], 0x00300000 bytes flags: 0x0
>>   memory[0x7]    [0xe0800000-0xe0bfffff], 0x00400000 bytes flags: 0x0
>>   memory[0x8]    [0xf4b00000-0xf6ffffff], 0x02500000 bytes flags: 0x0
>>   memory[0x9]    [0xfda00000-0xfffffffe], 0x025fffff bytes flags: 0x0
>>   reserved.cnt  = 0x4
>>   reserved[0x0]    [0x80a00000-0x80a02fff], 0x00003000 bytes flags: 0x0
>>   reserved[0x1]    [0x80a04000-0x80a07fff], 0x00004000 bytes flags: 0x0
>>   reserved[0x2]    [0x80b00000-0x812e8057], 0x007e8058 bytes flags: 0x0
>>   reserved[0x3]    [0x83000000-0x84ffffff], 0x02000000 bytes flags: 0x0
> ...
>> Zone ranges:
>>    Normal   [mem 0x0000000080a00000-0x00000000b01fffff]
>>    HighMem  [mem 0x00000000b0200000-0x00000000ffffefff]
>> Movable zone start for each node
>> Early memory node ranges
>>    node   0: [mem 0x0000000080a00000-0x00000000855fffff]
>>    node   0: [mem 0x0000000086a00000-0x0000000087dfffff]
>>    node   0: [mem 0x000000008bd00000-0x000000008c4fffff]
>>    node   0: [mem 0x000000008e300000-0x000000008ecfffff]
>>    node   0: [mem 0x0000000090d00000-0x00000000bfffffff]
>>    node   0: [mem 0x00000000cc000000-0x00000000dc9fffff]
>>    node   0: [mem 0x00000000de700000-0x00000000de9fffff]
>>    node   0: [mem 0x00000000e0800000-0x00000000e0bfffff]
>>    node   0: [mem 0x00000000f4b00000-0x00000000f6ffffff]
>>    node   0: [mem 0x00000000fda00000-0x00000000ffffefff]
>> Zeroed struct page in unavailable ranges: 513 pages
>> Initmem setup node 0 [mem 0x0000000080a00000-0x00000000ffffefff]
>> On node 0 totalpages: 311551
>>    Normal zone: 1230 pages used for memmap
>>    Normal zone: 0 pages reserved
>>    Normal zone: 157440 pages, LIFO batch:31
>>    HighMem zone: 154111 pages, LIFO batch:31
> AFAICT the range [de600000, de7ff000] should not be added to the free
> lists.
>
> Can you try with the below patch:
>
> diff --git a/mm/memblock.c b/mm/memblock.c
> index afaefa8fc6ab..7f3c33d53f87 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1994,6 +1994,8 @@ static unsigned long __init __free_memory_core(phys_addr_t start,
>   	unsigned long end_pfn = min_t(unsigned long,
>   				      PFN_DOWN(end), max_low_pfn);
>   
> +	pr_info("%s: range: %pa - %pa, pfn: %lx - %lx\n", __func__, &start, &end, start_pfn, end_pfn);
> +
>   	if (start_pfn >= end_pfn)
>   		return 0;
>   
__free_memory_core, range: 0x80a03000 - 0x80a04000, pfn: 80a03 - 80a04
__free_memory_core, range: 0x80a08000 - 0x80b00000, pfn: 80a08 - 80b00
__free_memory_core, range: 0x812e8058 - 0x83000000, pfn: 812e9 - 83000
__free_memory_core, range: 0x85000000 - 0x85600000, pfn: 85000 - 85600
__free_memory_core, range: 0x86a00000 - 0x87e00000, pfn: 86a00 - 87e00
__free_memory_core, range: 0x8bd00000 - 0x8c500000, pfn: 8bd00 - 8c500
__free_memory_core, range: 0x8e300000 - 0x8ed00000, pfn: 8e300 - 8ed00
__free_memory_core, range: 0x90d00000 - 0xaf2c0000, pfn: 90d00 - af2c0
__free_memory_core, range: 0xaf430000 - 0xaf454000, pfn: af430 - af454
__free_memory_core, range: 0xaf510000 - 0xaf546000, pfn: af510 - af546
__free_memory_core, range: 0xaf560000 - 0xaf580000, pfn: af560 - af580
__free_memory_core, range: 0xafd98000 - 0xafdce000, pfn: afd98 - afdce
__free_memory_core, range: 0xafdd8000 - 0xafe00000, pfn: afdd8 - afe00
__free_memory_core, range: 0xafe18000 - 0xafe80000, pfn: afe18 - afe80
__free_memory_core, range: 0xafee0000 - 0xaff00000, pfn: afee0 - aff00
__free_memory_core, range: 0xaff80000 - 0xaff8d000, pfn: aff80 - aff8d
__free_memory_core, range: 0xafff2000 - 0xafff4580, pfn: afff2 - afff4
__free_memory_core, range: 0xafffe000 - 0xafffe0e0, pfn: afffe - afffe
__free_memory_core, range: 0xafffe4fc - 0xafffe500, pfn: affff - afffe
__free_memory_core, range: 0xafffe6e4 - 0xafffe700, pfn: affff - afffe
__free_memory_core, range: 0xafffe8dc - 0xafffe8e0, pfn: affff - afffe
__free_memory_core, range: 0xafffe970 - 0xafffe980, pfn: affff - afffe
__free_memory_core, range: 0xafffe990 - 0xafffe9a0, pfn: affff - afffe
__free_memory_core, range: 0xafffe9a4 - 0xafffe9c0, pfn: affff - afffe
__free_memory_core, range: 0xafffeb54 - 0xafffeb60, pfn: affff - afffe
__free_memory_core, range: 0xafffecf4 - 0xafffed00, pfn: affff - afffe
__free_memory_core, range: 0xafffefc4 - 0xafffefd8, pfn: affff - afffe
__free_memory_core, range: 0xb0200000 - 0xc0000000, pfn: b0200 - b0200
__free_memory_core, range: 0xcc000000 - 0xdca00000, pfn: cc000 - b0200
__free_memory_core, range: 0xde700000 - 0xdea00000, pfn: de700 - b0200
__free_memory_core, range: 0xe0800000 - 0xe0c00000, pfn: e0800 - b0200
__free_memory_core, range: 0xf4b00000 - 0xf7000000, pfn: f4b00 - b0200
__free_memory_core, range: 0xfda00000 - 0xffffffff, pfn: fda00 - b0200

>   
>>>> [<c023999c>] (get_page_from_freelist) from [<c023a4dc>] (__alloc_pages_nodemask+0x188/0xc08)
>>>> [<c023a4dc>] (__alloc_pages_nodemask) from [<c0223078>] (alloc_zeroed_user_highpage_movable+0x14/0x3c)
>>>> [<c0223078>] (alloc_zeroed_user_highpage_movable) from [<c0226768>] (handle_mm_fault+0x254/0xac8)
>>>> [<c0226768>] (handle_mm_fault) from [<c04ba09c>] (do_page_fault+0x228/0x2f4)
>>>> [<c04ba09c>] (do_page_fault) from [<c0111d80>] (do_DataAbort+0x48/0xd0)
>>>> [<c0111d80>] (do_DataAbort) from [<c0100e00>] (__dabt_usr+0x40/0x60)
>>>>
>>>>           Zone ranges:
>>>>             Normal   [mem 0x0000000080a00000-0x00000000b01fffff]
>>>>             HighMem  [mem 0x00000000b0200000-0x00000000ffffefff]
>>>>           Movable zone start for each node
>>>>           Early memory node ranges
>>>>             node   0: [mem 0x0000000080a00000-0x00000000855fffff]
>>>>             node   0: [mem 0x0000000086a00000-0x0000000087dfffff]
>>>>             node   0: [mem 0x000000008bd00000-0x000000008c4fffff]
>>>>             node   0: [mem 0x000000008e300000-0x000000008ecfffff]
>>>>             node   0: [mem 0x0000000090d00000-0x00000000bfffffff]
>>>>             node   0: [mem 0x00000000cc000000-0x00000000dc9fffff]
>>>>             node   0: [mem 0x00000000de700000-0x00000000de9fffff]
>>>>             node   0: [mem 0x00000000e0800000-0x00000000e0bfffff]
>>>>             node   0: [mem 0x00000000f4b00000-0x00000000f6ffffff]
>>>>             node   0: [mem 0x00000000fda00000-0x00000000ffffefff]
>>>>
>>>>           ----> free_memmap, start_pfn = 85800,  85800000 end_pfn = 86a00, 86a00000
>>>>           ----> free_memmap, start_pfn = 8c800,  8c800000 end_pfn = 8e300, 8e300000
>>>>           ----> free_memmap, start_pfn = 8f000,  8f000000 end_pfn = 90000, 90000000
>>>>           ----> free_memmap, start_pfn = dcc00,  dcc00000 end_pfn = de700, de700000
>>>>           ----> free_memmap, start_pfn = dec00,  dec00000 end_pfn = e0000, e0000000
>>>>           ----> free_memmap, start_pfn = e0c00,  e0c00000 end_pfn = e4000, e4000000
>>>>           ----> free_memmap, start_pfn = f7000,  f7000000 end_pfn = f8000, f8000000
>>>>           === >move_freepages: start_pfn/end_pfn [de601, de7ff], [de600000, de7ff000]
>>>>           :  pfn =de600 pfn2phy = de600000 , page = ef3cc000, page-flags = ffffffff
>>>>           8<--- cut here ---
>>>>           Unable to handle kernel paging request at virtual address fffffffe
>>>>           pgd = 5dd50df5
>>>>           [fffffffe] *pgd=affff861, *pte=00000000, *ppte=00000000
>>>>           Internal error: Oops: 37 [#1] SMP ARM
>>>>           Modules linked in: gmac(O)
>>>>           CPU: 2 PID: 635 Comm: test-oom Tainted: G           O      5.10.0+ #31
>>>>           Hardware name: Hisilicon A9
>>>>           PC is at move_freepages_block+0x150/0x278
>>>>           LR is at move_freepages_block+0x150/0x278
>>>>           pc : [<c02383a4>]    lr : [<c02383a4>]    psr: 200e0393
>>>>           sp : c4179cf8  ip : 00000000  fp : 00000001
>>>>           r10: c4179d58  r9 : 000de7ff  r8 : 00000000
>>>>           r7 : c0863280  r6 : 000de600  r5 : 000de600  r4 : ef3cc000
>>>>           r3 : ffffffff  r2 : 00000000  r1 : ef5d069c  r0 : fffffffe
>>>>           Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
>>>>           Control: 1ac5387d  Table: 83b0c04a  DAC: 55555555
>>>>           Process test-oom (pid: 635, stack limit = 0x25d667df)
>>>>
>>>>
Mike Rapoport April 28, 2021, 5:59 a.m. UTC | #11
On Tue, Apr 27, 2021 at 07:08:59PM +0800, Kefeng Wang wrote:
> 
> On 2021/4/27 14:23, Mike Rapoport wrote:
> > On Mon, Apr 26, 2021 at 11:26:38PM +0800, Kefeng Wang wrote:
> > > On 2021/4/26 13:20, Mike Rapoport wrote:
> > > > On Sun, Apr 25, 2021 at 03:51:56PM +0800, Kefeng Wang wrote:
> > > > > On 2021/4/25 15:19, Mike Rapoport wrote:
> > > > > 
> > > > >       On Fri, Apr 23, 2021 at 04:11:16PM +0800, Kefeng Wang wrote:
> > > > > 
> > > > >           I tested this patchset(plus arm32 change, like arm64 does)
> > > > >           based on lts 5.10,add some debug log, the useful info shows
> > > > >           below, if we enable HOLES_IN_ZONE, no panic, any idea,
> > > > >           thanks.
> > > > > 
> > > > >       Are there any changes on top of 5.10 except for pfn_valid() patch?
> > > > >       Do you see this panic on 5.10 without the changes?
> > > > > 
> > > > > Yes, there are some BSP support for arm board based on 5.10,
> > Is it possible to test 5.12?

Do you use SPARSMEM? If yes, what is your section size?
What is the value if CONFIG_FORCE_MAX_ZONEORDER in your configuration?
Kefeng Wang April 29, 2021, 12:48 a.m. UTC | #12
On 2021/4/28 13:59, Mike Rapoport wrote:
> On Tue, Apr 27, 2021 at 07:08:59PM +0800, Kefeng Wang wrote:
>> On 2021/4/27 14:23, Mike Rapoport wrote:
>>> On Mon, Apr 26, 2021 at 11:26:38PM +0800, Kefeng Wang wrote:
>>>> On 2021/4/26 13:20, Mike Rapoport wrote:
>>>>> On Sun, Apr 25, 2021 at 03:51:56PM +0800, Kefeng Wang wrote:
>>>>>> On 2021/4/25 15:19, Mike Rapoport wrote:
>>>>>>
>>>>>>        On Fri, Apr 23, 2021 at 04:11:16PM +0800, Kefeng Wang wrote:
>>>>>>
>>>>>>            I tested this patchset(plus arm32 change, like arm64 does)
>>>>>>            based on lts 5.10,add some debug log, the useful info shows
>>>>>>            below, if we enable HOLES_IN_ZONE, no panic, any idea,
>>>>>>            thanks.
>>>>>>
>>>>>>        Are there any changes on top of 5.10 except for pfn_valid() patch?
>>>>>>        Do you see this panic on 5.10 without the changes?
>>>>>>
>>>>>> Yes, there are some BSP support for arm board based on 5.10,
>>> Is it possible to test 5.12?
> Do you use SPARSMEM? If yes, what is your section size?
> What is the value if CONFIG_FORCE_MAX_ZONEORDER in your configuration?

Yes,

CONFIG_SPARSEMEM=y

CONFIG_SPARSEMEM_STATIC=y

CONFIG_FORCE_MAX_ZONEORDER = 11

CONFIG_PAGE_OFFSET=0xC0000000
CONFIG_HAVE_ARCH_PFN_VALID=y
CONFIG_HIGHMEM=y
#define SECTION_SIZE_BITS    26
#define MAX_PHYSADDR_BITS    32
#define MAX_PHYSMEM_BITS     32


>
Mike Rapoport April 29, 2021, 6:57 a.m. UTC | #13
On Thu, Apr 29, 2021 at 08:48:26AM +0800, Kefeng Wang wrote:
> 
> On 2021/4/28 13:59, Mike Rapoport wrote:
> > On Tue, Apr 27, 2021 at 07:08:59PM +0800, Kefeng Wang wrote:
> > > On 2021/4/27 14:23, Mike Rapoport wrote:
> > > > On Mon, Apr 26, 2021 at 11:26:38PM +0800, Kefeng Wang wrote:
> > > > > On 2021/4/26 13:20, Mike Rapoport wrote:
> > > > > > On Sun, Apr 25, 2021 at 03:51:56PM +0800, Kefeng Wang wrote:
> > > > > > > On 2021/4/25 15:19, Mike Rapoport wrote:
> > > > > > > 
> > > > > > >        On Fri, Apr 23, 2021 at 04:11:16PM +0800, Kefeng Wang wrote:
> > > > > > > 
> > > > > > >            I tested this patchset(plus arm32 change, like arm64 does)
> > > > > > >            based on lts 5.10,add some debug log, the useful info shows
> > > > > > >            below, if we enable HOLES_IN_ZONE, no panic, any idea,
> > > > > > >            thanks.
> > > > > > > 
> > > > > > >        Are there any changes on top of 5.10 except for pfn_valid() patch?
> > > > > > >        Do you see this panic on 5.10 without the changes?
> > > > > > > 
> > > > > > > Yes, there are some BSP support for arm board based on 5.10,
> > > > Is it possible to test 5.12?
> > Do you use SPARSMEM? If yes, what is your section size?
> > What is the value if CONFIG_FORCE_MAX_ZONEORDER in your configuration?
> 
> Yes,
> 
> CONFIG_SPARSEMEM=y
> 
> CONFIG_SPARSEMEM_STATIC=y
> 
> CONFIG_FORCE_MAX_ZONEORDER = 11
> 
> CONFIG_PAGE_OFFSET=0xC0000000
> CONFIG_HAVE_ARCH_PFN_VALID=y
> CONFIG_HIGHMEM=y
> #define SECTION_SIZE_BITS    26
> #define MAX_PHYSADDR_BITS    32
> #define MAX_PHYSMEM_BITS     32

It seems that with SPARSEMEM we don't align the freed parts on pageblock
boundaries.

Can you try the patch below:

diff --git a/mm/memblock.c b/mm/memblock.c
index afaefa8fc6ab..1926369b52ec 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1941,14 +1941,13 @@ static void __init free_unused_memmap(void)
 		 * due to SPARSEMEM sections which aren't present.
 		 */
 		start = min(start, ALIGN(prev_end, PAGES_PER_SECTION));
-#else
+#endif
 		/*
 		 * Align down here since the VM subsystem insists that the
 		 * memmap entries are valid from the bank start aligned to
 		 * MAX_ORDER_NR_PAGES.
 		 */
 		start = round_down(start, MAX_ORDER_NR_PAGES);
-#endif
 
 		/*
 		 * If we had a previous bank, and there is a space
Kefeng Wang April 29, 2021, 10:22 a.m. UTC | #14
On 2021/4/29 14:57, Mike Rapoport wrote:

>>> Do you use SPARSMEM? If yes, what is your section size?
>>> What is the value if CONFIG_FORCE_MAX_ZONEORDER in your configuration?
>> Yes,
>>
>> CONFIG_SPARSEMEM=y
>>
>> CONFIG_SPARSEMEM_STATIC=y
>>
>> CONFIG_FORCE_MAX_ZONEORDER = 11
>>
>> CONFIG_PAGE_OFFSET=0xC0000000
>> CONFIG_HAVE_ARCH_PFN_VALID=y
>> CONFIG_HIGHMEM=y
>> #define SECTION_SIZE_BITS    26
>> #define MAX_PHYSADDR_BITS    32
>> #define MAX_PHYSMEM_BITS     32


With the patch,  the addr is aligned, but the panic still occurred,

new free memory log is below,

memblock_free: [0xaf430000-0xaf44ffff] mem_init+0x158/0x23c

memblock_free: [0xaf510000-0xaf53ffff] mem_init+0x158/0x23c
memblock_free: [0xaf560000-0xaf57ffff] mem_init+0x158/0x23c
memblock_free: [0xafd98000-0xafdc7fff] mem_init+0x158/0x23c
memblock_free: [0xafdd8000-0xafdfffff] mem_init+0x158/0x23c
memblock_free: [0xafe18000-0xafe7ffff] mem_init+0x158/0x23c
memblock_free: [0xafee0000-0xafefffff] mem_init+0x158/0x23c
__free_memory_core, range: 0x80a03000 - 0x80a04000, pfn: 80a03 - 80a04
__free_memory_core, range: 0x80a08000 - 0x80b00000, pfn: 80a08 - 80b00
__free_memory_core, range: 0x812e8058 - 0x83000000, pfn: 812e9 - 83000
__free_memory_core, range: 0x85000000 - 0x85600000, pfn: 85000 - 85600
__free_memory_core, range: 0x86a00000 - 0x87e00000, pfn: 86a00 - 87e00
__free_memory_core, range: 0x8bd00000 - 0x8c500000, pfn: 8bd00 - 8c500
__free_memory_core, range: 0x8e300000 - 0x8ed00000, pfn: 8e300 - 8ed00
__free_memory_core, range: 0x90d00000 - 0xaf2c0000, pfn: 90d00 - af2c0
__free_memory_core, range: 0xaf430000 - 0xaf450000, pfn: af430 - af450
__free_memory_core, range: 0xaf510000 - 0xaf540000, pfn: af510 - af540
__free_memory_core, range: 0xaf560000 - 0xaf580000, pfn: af560 - af580
__free_memory_core, range: 0xafd98000 - 0xafdc8000, pfn: afd98 - afdc8
__free_memory_core, range: 0xafdd8000 - 0xafe00000, pfn: afdd8 - afe00
__free_memory_core, range: 0xafe18000 - 0xafe80000, pfn: afe18 - afe80
__free_memory_core, range: 0xafee0000 - 0xaff00000, pfn: afee0 - aff00
__free_memory_core, range: 0xaff80000 - 0xaff8d000, pfn: aff80 - aff8d
__free_memory_core, range: 0xafff2000 - 0xafff4580, pfn: afff2 - afff4
__free_memory_core, range: 0xafffe000 - 0xafffe0e0, pfn: afffe - afffe
__free_memory_core, range: 0xafffe4fc - 0xafffe500, pfn: affff - afffe
__free_memory_core, range: 0xafffe6e4 - 0xafffe700, pfn: affff - afffe
__free_memory_core, range: 0xafffe8dc - 0xafffe8e0, pfn: affff - afffe
__free_memory_core, range: 0xafffe970 - 0xafffe980, pfn: affff - afffe
__free_memory_core, range: 0xafffe990 - 0xafffe9a0, pfn: affff - afffe
__free_memory_core, range: 0xafffe9a4 - 0xafffe9c0, pfn: affff - afffe
__free_memory_core, range: 0xafffeb54 - 0xafffeb60, pfn: affff - afffe
__free_memory_core, range: 0xafffecf4 - 0xafffed00, pfn: affff - afffe
__free_memory_core, range: 0xafffefc4 - 0xafffefd8, pfn: affff - afffe
__free_memory_core, range: 0xb0200000 - 0xc0000000, pfn: b0200 - b0200
__free_memory_core, range: 0xcc000000 - 0xdca00000, pfn: cc000 - b0200
__free_memory_core, range: 0xde700000 - 0xdea00000, pfn: de700 - b0200
__free_memory_core, range: 0xe0800000 - 0xe0c00000, pfn: e0800 - b0200
__free_memory_core, range: 0xf4b00000 - 0xf7000000, pfn: f4b00 - b0200
__free_memory_core, range: 0xfda00000 - 0xffffffff, pfn: fda00 - b0200
> It seems that with SPARSEMEM we don't align the freed parts on pageblock
> boundaries.
>
> Can you try the patch below:
>
> diff --git a/mm/memblock.c b/mm/memblock.c
> index afaefa8fc6ab..1926369b52ec 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1941,14 +1941,13 @@ static void __init free_unused_memmap(void)
>   		 * due to SPARSEMEM sections which aren't present.
>   		 */
>   		start = min(start, ALIGN(prev_end, PAGES_PER_SECTION));
> -#else
> +#endif
>   		/*
>   		 * Align down here since the VM subsystem insists that the
>   		 * memmap entries are valid from the bank start aligned to
>   		 * MAX_ORDER_NR_PAGES.
>   		 */
>   		start = round_down(start, MAX_ORDER_NR_PAGES);
> -#endif
>   
>   		/*
>   		 * If we had a previous bank, and there is a space
>   
>
Mike Rapoport April 30, 2021, 9:51 a.m. UTC | #15
On Thu, Apr 29, 2021 at 06:22:55PM +0800, Kefeng Wang wrote:
> 
> On 2021/4/29 14:57, Mike Rapoport wrote:
> 
> > > > Do you use SPARSMEM? If yes, what is your section size?
> > > > What is the value if CONFIG_FORCE_MAX_ZONEORDER in your configuration?
> > > Yes,
> > > 
> > > CONFIG_SPARSEMEM=y
> > > 
> > > CONFIG_SPARSEMEM_STATIC=y
> > > 
> > > CONFIG_FORCE_MAX_ZONEORDER = 11
> > > 
> > > CONFIG_PAGE_OFFSET=0xC0000000
> > > CONFIG_HAVE_ARCH_PFN_VALID=y
> > > CONFIG_HIGHMEM=y
> > > #define SECTION_SIZE_BITS    26
> > > #define MAX_PHYSADDR_BITS    32
> > > #define MAX_PHYSMEM_BITS     32
> 
> 
> With the patch,  the addr is aligned, but the panic still occurred,

Is this the same panic at move_freepages() for range [de600, de7ff]?

Do you enable CONFIG_ARM_LPAE?

> new free memory log is below,
> 
> memblock_free: [0xaf430000-0xaf44ffff] mem_init+0x158/0x23c
> 
> memblock_free: [0xaf510000-0xaf53ffff] mem_init+0x158/0x23c
> memblock_free: [0xaf560000-0xaf57ffff] mem_init+0x158/0x23c
> memblock_free: [0xafd98000-0xafdc7fff] mem_init+0x158/0x23c
> memblock_free: [0xafdd8000-0xafdfffff] mem_init+0x158/0x23c
> memblock_free: [0xafe18000-0xafe7ffff] mem_init+0x158/0x23c
> memblock_free: [0xafee0000-0xafefffff] mem_init+0x158/0x23c
> __free_memory_core, range: 0x80a03000 - 0x80a04000, pfn: 80a03 - 80a04
> __free_memory_core, range: 0x80a08000 - 0x80b00000, pfn: 80a08 - 80b00
> __free_memory_core, range: 0x812e8058 - 0x83000000, pfn: 812e9 - 83000
> __free_memory_core, range: 0x85000000 - 0x85600000, pfn: 85000 - 85600
> __free_memory_core, range: 0x86a00000 - 0x87e00000, pfn: 86a00 - 87e00
> __free_memory_core, range: 0x8bd00000 - 0x8c500000, pfn: 8bd00 - 8c500
> __free_memory_core, range: 0x8e300000 - 0x8ed00000, pfn: 8e300 - 8ed00
> __free_memory_core, range: 0x90d00000 - 0xaf2c0000, pfn: 90d00 - af2c0
> __free_memory_core, range: 0xaf430000 - 0xaf450000, pfn: af430 - af450
> __free_memory_core, range: 0xaf510000 - 0xaf540000, pfn: af510 - af540
> __free_memory_core, range: 0xaf560000 - 0xaf580000, pfn: af560 - af580
> __free_memory_core, range: 0xafd98000 - 0xafdc8000, pfn: afd98 - afdc8
> __free_memory_core, range: 0xafdd8000 - 0xafe00000, pfn: afdd8 - afe00
> __free_memory_core, range: 0xafe18000 - 0xafe80000, pfn: afe18 - afe80
> __free_memory_core, range: 0xafee0000 - 0xaff00000, pfn: afee0 - aff00
> __free_memory_core, range: 0xaff80000 - 0xaff8d000, pfn: aff80 - aff8d
> __free_memory_core, range: 0xafff2000 - 0xafff4580, pfn: afff2 - afff4
> __free_memory_core, range: 0xafffe000 - 0xafffe0e0, pfn: afffe - afffe
> __free_memory_core, range: 0xafffe4fc - 0xafffe500, pfn: affff - afffe
> __free_memory_core, range: 0xafffe6e4 - 0xafffe700, pfn: affff - afffe
> __free_memory_core, range: 0xafffe8dc - 0xafffe8e0, pfn: affff - afffe
> __free_memory_core, range: 0xafffe970 - 0xafffe980, pfn: affff - afffe
> __free_memory_core, range: 0xafffe990 - 0xafffe9a0, pfn: affff - afffe
> __free_memory_core, range: 0xafffe9a4 - 0xafffe9c0, pfn: affff - afffe
> __free_memory_core, range: 0xafffeb54 - 0xafffeb60, pfn: affff - afffe
> __free_memory_core, range: 0xafffecf4 - 0xafffed00, pfn: affff - afffe
> __free_memory_core, range: 0xafffefc4 - 0xafffefd8, pfn: affff - afffe
> __free_memory_core, range: 0xb0200000 - 0xc0000000, pfn: b0200 - b0200
> __free_memory_core, range: 0xcc000000 - 0xdca00000, pfn: cc000 - b0200
> __free_memory_core, range: 0xde700000 - 0xdea00000, pfn: de700 - b0200

The range [de600, de7ff] 

> __free_memory_core, range: 0xe0800000 - 0xe0c00000, pfn: e0800 - b0200
> __free_memory_core, range: 0xf4b00000 - 0xf7000000, pfn: f4b00 - b0200
> __free_memory_core, range: 0xfda00000 - 0xffffffff, pfn: fda00 - b0200
> > It seems that with SPARSEMEM we don't align the freed parts on pageblock
> > boundaries.
> > 
> > Can you try the patch below:
> > 
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index afaefa8fc6ab..1926369b52ec 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -1941,14 +1941,13 @@ static void __init free_unused_memmap(void)
> >   		 * due to SPARSEMEM sections which aren't present.
> >   		 */
> >   		start = min(start, ALIGN(prev_end, PAGES_PER_SECTION));
> > -#else
> > +#endif
> >   		/*
> >   		 * Align down here since the VM subsystem insists that the
> >   		 * memmap entries are valid from the bank start aligned to
> >   		 * MAX_ORDER_NR_PAGES.
> >   		 */
> >   		start = round_down(start, MAX_ORDER_NR_PAGES);
> > -#endif
> >   		/*
> >   		 * If we had a previous bank, and there is a space
> >
Kefeng Wang April 30, 2021, 11:24 a.m. UTC | #16
On 2021/4/30 17:51, Mike Rapoport wrote:
> On Thu, Apr 29, 2021 at 06:22:55PM +0800, Kefeng Wang wrote:
>>
>> On 2021/4/29 14:57, Mike Rapoport wrote:
>>
>>>>> Do you use SPARSMEM? If yes, what is your section size?
>>>>> What is the value if CONFIG_FORCE_MAX_ZONEORDER in your configuration?
>>>> Yes,
>>>>
>>>> CONFIG_SPARSEMEM=y
>>>>
>>>> CONFIG_SPARSEMEM_STATIC=y
>>>>
>>>> CONFIG_FORCE_MAX_ZONEORDER = 11
>>>>
>>>> CONFIG_PAGE_OFFSET=0xC0000000
>>>> CONFIG_HAVE_ARCH_PFN_VALID=y
>>>> CONFIG_HIGHMEM=y
>>>> #define SECTION_SIZE_BITS    26
>>>> #define MAX_PHYSADDR_BITS    32
>>>> #define MAX_PHYSMEM_BITS     32
>>
>>
>> With the patch,  the addr is aligned, but the panic still occurred,
> 
> Is this the same panic at move_freepages() for range [de600, de7ff]?
> 
> Do you enable CONFIG_ARM_LPAE?

no, the CONFIG_ARM_LPAE is not set, and yes with same panic at 
move_freepages at

start_pfn/end_pfn [de600, de7ff], [de600000, de7ff000] :  pfn =de600, 
page =ef3cc000, page-flags = ffffffff,  pfn2phy = de600000



> 
>> new free memory log is below,
>>
>> memblock_free: [0xaf430000-0xaf44ffff] mem_init+0x158/0x23c
>>
>> memblock_free: [0xaf510000-0xaf53ffff] mem_init+0x158/0x23c
>> memblock_free: [0xaf560000-0xaf57ffff] mem_init+0x158/0x23c
>> memblock_free: [0xafd98000-0xafdc7fff] mem_init+0x158/0x23c
>> memblock_free: [0xafdd8000-0xafdfffff] mem_init+0x158/0x23c
>> memblock_free: [0xafe18000-0xafe7ffff] mem_init+0x158/0x23c
>> memblock_free: [0xafee0000-0xafefffff] mem_init+0x158/0x23c
>> __free_memory_core, range: 0x80a03000 - 0x80a04000, pfn: 80a03 - 80a04
>> __free_memory_core, range: 0x80a08000 - 0x80b00000, pfn: 80a08 - 80b00
>> __free_memory_core, range: 0x812e8058 - 0x83000000, pfn: 812e9 - 83000
>> __free_memory_core, range: 0x85000000 - 0x85600000, pfn: 85000 - 85600
>> __free_memory_core, range: 0x86a00000 - 0x87e00000, pfn: 86a00 - 87e00
>> __free_memory_core, range: 0x8bd00000 - 0x8c500000, pfn: 8bd00 - 8c500
>> __free_memory_core, range: 0x8e300000 - 0x8ed00000, pfn: 8e300 - 8ed00
>> __free_memory_core, range: 0x90d00000 - 0xaf2c0000, pfn: 90d00 - af2c0
>> __free_memory_core, range: 0xaf430000 - 0xaf450000, pfn: af430 - af450
>> __free_memory_core, range: 0xaf510000 - 0xaf540000, pfn: af510 - af540
>> __free_memory_core, range: 0xaf560000 - 0xaf580000, pfn: af560 - af580
>> __free_memory_core, range: 0xafd98000 - 0xafdc8000, pfn: afd98 - afdc8
>> __free_memory_core, range: 0xafdd8000 - 0xafe00000, pfn: afdd8 - afe00
>> __free_memory_core, range: 0xafe18000 - 0xafe80000, pfn: afe18 - afe80
>> __free_memory_core, range: 0xafee0000 - 0xaff00000, pfn: afee0 - aff00
>> __free_memory_core, range: 0xaff80000 - 0xaff8d000, pfn: aff80 - aff8d
>> __free_memory_core, range: 0xafff2000 - 0xafff4580, pfn: afff2 - afff4
>> __free_memory_core, range: 0xafffe000 - 0xafffe0e0, pfn: afffe - afffe
>> __free_memory_core, range: 0xafffe4fc - 0xafffe500, pfn: affff - afffe
>> __free_memory_core, range: 0xafffe6e4 - 0xafffe700, pfn: affff - afffe
>> __free_memory_core, range: 0xafffe8dc - 0xafffe8e0, pfn: affff - afffe
>> __free_memory_core, range: 0xafffe970 - 0xafffe980, pfn: affff - afffe
>> __free_memory_core, range: 0xafffe990 - 0xafffe9a0, pfn: affff - afffe
>> __free_memory_core, range: 0xafffe9a4 - 0xafffe9c0, pfn: affff - afffe
>> __free_memory_core, range: 0xafffeb54 - 0xafffeb60, pfn: affff - afffe
>> __free_memory_core, range: 0xafffecf4 - 0xafffed00, pfn: affff - afffe
>> __free_memory_core, range: 0xafffefc4 - 0xafffefd8, pfn: affff - afffe
>> __free_memory_core, range: 0xb0200000 - 0xc0000000, pfn: b0200 - b0200
>> __free_memory_core, range: 0xcc000000 - 0xdca00000, pfn: cc000 - b0200
>> __free_memory_core, range: 0xde700000 - 0xdea00000, pfn: de700 - b0200
> 
> The range [de600, de7ff]
the __free_memory_core will check the start pfn and end pfn,

  if (start_pfn >= end_pfn)
          return 0;

  __free_pages_memory(start_pfn, end_pfn);
so the memory will not be freed to buddy, confused...
> 
>> __free_memory_core, range: 0xe0800000 - 0xe0c00000, pfn: e0800 - b0200
>> __free_memory_core, range: 0xf4b00000 - 0xf7000000, pfn: f4b00 - b0200
>> __free_memory_core, range: 0xfda00000 - 0xffffffff, pfn: fda00 - b0200
>>> It seems that with SPARSEMEM we don't align the freed parts on pageblock
>>> boundaries.
>>>
>>> Can you try the patch below:
>>>
>>> diff --git a/mm/memblock.c b/mm/memblock.c
>>> index afaefa8fc6ab..1926369b52ec 100644
>>> --- a/mm/memblock.c
>>> +++ b/mm/memblock.c
>>> @@ -1941,14 +1941,13 @@ static void __init free_unused_memmap(void)
>>>    		 * due to SPARSEMEM sections which aren't present.
>>>    		 */
>>>    		start = min(start, ALIGN(prev_end, PAGES_PER_SECTION));
>>> -#else
>>> +#endif
>>>    		/*
>>>    		 * Align down here since the VM subsystem insists that the
>>>    		 * memmap entries are valid from the bank start aligned to
>>>    		 * MAX_ORDER_NR_PAGES.
>>>    		 */
>>>    		start = round_down(start, MAX_ORDER_NR_PAGES);
>>> -#endif
>>>    		/*
>>>    		 * If we had a previous bank, and there is a space
>>>
>
Mike Rapoport May 3, 2021, 6:26 a.m. UTC | #17
On Fri, Apr 30, 2021 at 07:24:37PM +0800, Kefeng Wang wrote:
> 
> 
> On 2021/4/30 17:51, Mike Rapoport wrote:
> > On Thu, Apr 29, 2021 at 06:22:55PM +0800, Kefeng Wang wrote:
> > > 
> > > On 2021/4/29 14:57, Mike Rapoport wrote:
> > > 
> > > > > > Do you use SPARSMEM? If yes, what is your section size?
> > > > > > What is the value if CONFIG_FORCE_MAX_ZONEORDER in your configuration?
> > > > > Yes,
> > > > > 
> > > > > CONFIG_SPARSEMEM=y
> > > > > 
> > > > > CONFIG_SPARSEMEM_STATIC=y
> > > > > 
> > > > > CONFIG_FORCE_MAX_ZONEORDER = 11
> > > > > 
> > > > > CONFIG_PAGE_OFFSET=0xC0000000
> > > > > CONFIG_HAVE_ARCH_PFN_VALID=y
> > > > > CONFIG_HIGHMEM=y
> > > > > #define SECTION_SIZE_BITS    26
> > > > > #define MAX_PHYSADDR_BITS    32
> > > > > #define MAX_PHYSMEM_BITS     32
> > > 
> > > 
> > > With the patch,  the addr is aligned, but the panic still occurred,
> > 
> > Is this the same panic at move_freepages() for range [de600, de7ff]?
> > 
> > Do you enable CONFIG_ARM_LPAE?
> 
> no, the CONFIG_ARM_LPAE is not set, and yes with same panic at
> move_freepages at
> 
> start_pfn/end_pfn [de600, de7ff], [de600000, de7ff000] :  pfn =de600, page
> =ef3cc000, page-flags = ffffffff,  pfn2phy = de600000
> 
> > > __free_memory_core, range: 0xb0200000 - 0xc0000000, pfn: b0200 - b0200
> > > __free_memory_core, range: 0xcc000000 - 0xdca00000, pfn: cc000 - b0200
> > > __free_memory_core, range: 0xde700000 - 0xdea00000, pfn: de700 - b0200

Hmm, [de600, de7ff] is not added to the free lists which is correct. But
then it's unclear how the page for de600 gets to move_freepages()...

Can't say I have any bright ideas to try here...

> the __free_memory_core will check the start pfn and end pfn,
> 
>  if (start_pfn >= end_pfn)
>          return 0;
> 
>  __free_pages_memory(start_pfn, end_pfn);
> so the memory will not be freed to buddy, confused...

It's a check for range validity, all valid ranges are added.

> > > __free_memory_core, range: 0xe0800000 - 0xe0c00000, pfn: e0800 - b0200
> > > __free_memory_core, range: 0xf4b00000 - 0xf7000000, pfn: f4b00 - b0200
> > > __free_memory_core, range: 0xfda00000 - 0xffffffff, pfn: fda00 - b0200
> > > > It seems that with SPARSEMEM we don't align the freed parts on pageblock
> > > > boundaries.
> > > > 
> > > > Can you try the patch below:
> > > > 
> > > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > > index afaefa8fc6ab..1926369b52ec 100644
> > > > --- a/mm/memblock.c
> > > > +++ b/mm/memblock.c
> > > > @@ -1941,14 +1941,13 @@ static void __init free_unused_memmap(void)
> > > >    		 * due to SPARSEMEM sections which aren't present.
> > > >    		 */
> > > >    		start = min(start, ALIGN(prev_end, PAGES_PER_SECTION));
> > > > -#else
> > > > +#endif
> > > >    		/*
> > > >    		 * Align down here since the VM subsystem insists that the
> > > >    		 * memmap entries are valid from the bank start aligned to
> > > >    		 * MAX_ORDER_NR_PAGES.
> > > >    		 */
> > > >    		start = round_down(start, MAX_ORDER_NR_PAGES);
> > > > -#endif
> > > >    		/*
> > > >    		 * If we had a previous bank, and there is a space
> > > > 
> >
David Hildenbrand May 3, 2021, 8:07 a.m. UTC | #18
On 03.05.21 08:26, Mike Rapoport wrote:
> On Fri, Apr 30, 2021 at 07:24:37PM +0800, Kefeng Wang wrote:
>>
>>
>> On 2021/4/30 17:51, Mike Rapoport wrote:
>>> On Thu, Apr 29, 2021 at 06:22:55PM +0800, Kefeng Wang wrote:
>>>>
>>>> On 2021/4/29 14:57, Mike Rapoport wrote:
>>>>
>>>>>>> Do you use SPARSMEM? If yes, what is your section size?
>>>>>>> What is the value if CONFIG_FORCE_MAX_ZONEORDER in your configuration?
>>>>>> Yes,
>>>>>>
>>>>>> CONFIG_SPARSEMEM=y
>>>>>>
>>>>>> CONFIG_SPARSEMEM_STATIC=y
>>>>>>
>>>>>> CONFIG_FORCE_MAX_ZONEORDER = 11
>>>>>>
>>>>>> CONFIG_PAGE_OFFSET=0xC0000000
>>>>>> CONFIG_HAVE_ARCH_PFN_VALID=y
>>>>>> CONFIG_HIGHMEM=y
>>>>>> #define SECTION_SIZE_BITS    26
>>>>>> #define MAX_PHYSADDR_BITS    32
>>>>>> #define MAX_PHYSMEM_BITS     32
>>>>
>>>>
>>>> With the patch,  the addr is aligned, but the panic still occurred,
>>>
>>> Is this the same panic at move_freepages() for range [de600, de7ff]?
>>>
>>> Do you enable CONFIG_ARM_LPAE?
>>
>> no, the CONFIG_ARM_LPAE is not set, and yes with same panic at
>> move_freepages at
>>
>> start_pfn/end_pfn [de600, de7ff], [de600000, de7ff000] :  pfn =de600, page
>> =ef3cc000, page-flags = ffffffff,  pfn2phy = de600000
>>
>>>> __free_memory_core, range: 0xb0200000 - 0xc0000000, pfn: b0200 - b0200
>>>> __free_memory_core, range: 0xcc000000 - 0xdca00000, pfn: cc000 - b0200
>>>> __free_memory_core, range: 0xde700000 - 0xdea00000, pfn: de700 - b0200
> 
> Hmm, [de600, de7ff] is not added to the free lists which is correct. But
> then it's unclear how the page for de600 gets to move_freepages()...
> 
> Can't say I have any bright ideas to try here...

Are we missing some checks (e.g., PageReserved()) that 
pfn_valid_within() would have "caught" before?
Mike Rapoport May 3, 2021, 8:44 a.m. UTC | #19
On Mon, May 03, 2021 at 10:07:01AM +0200, David Hildenbrand wrote:
> On 03.05.21 08:26, Mike Rapoport wrote:
> > On Fri, Apr 30, 2021 at 07:24:37PM +0800, Kefeng Wang wrote:
> > > 
> > > 
> > > On 2021/4/30 17:51, Mike Rapoport wrote:
> > > > On Thu, Apr 29, 2021 at 06:22:55PM +0800, Kefeng Wang wrote:
> > > > > 
> > > > > On 2021/4/29 14:57, Mike Rapoport wrote:
> > > > > 
> > > > > > > > Do you use SPARSMEM? If yes, what is your section size?
> > > > > > > > What is the value if CONFIG_FORCE_MAX_ZONEORDER in your configuration?
> > > > > > > Yes,
> > > > > > > 
> > > > > > > CONFIG_SPARSEMEM=y
> > > > > > > 
> > > > > > > CONFIG_SPARSEMEM_STATIC=y
> > > > > > > 
> > > > > > > CONFIG_FORCE_MAX_ZONEORDER = 11
> > > > > > > 
> > > > > > > CONFIG_PAGE_OFFSET=0xC0000000
> > > > > > > CONFIG_HAVE_ARCH_PFN_VALID=y
> > > > > > > CONFIG_HIGHMEM=y
> > > > > > > #define SECTION_SIZE_BITS    26
> > > > > > > #define MAX_PHYSADDR_BITS    32
> > > > > > > #define MAX_PHYSMEM_BITS     32
> > > > > 
> > > > > 
> > > > > With the patch,  the addr is aligned, but the panic still occurred,
> > > > 
> > > > Is this the same panic at move_freepages() for range [de600, de7ff]?
> > > > 
> > > > Do you enable CONFIG_ARM_LPAE?
> > > 
> > > no, the CONFIG_ARM_LPAE is not set, and yes with same panic at
> > > move_freepages at
> > > 
> > > start_pfn/end_pfn [de600, de7ff], [de600000, de7ff000] :  pfn =de600, page
> > > =ef3cc000, page-flags = ffffffff,  pfn2phy = de600000
> > > 
> > > > > __free_memory_core, range: 0xb0200000 - 0xc0000000, pfn: b0200 - b0200
> > > > > __free_memory_core, range: 0xcc000000 - 0xdca00000, pfn: cc000 - b0200
> > > > > __free_memory_core, range: 0xde700000 - 0xdea00000, pfn: de700 - b0200
> > 
> > Hmm, [de600, de7ff] is not added to the free lists which is correct. But
> > then it's unclear how the page for de600 gets to move_freepages()...
> > 
> > Can't say I have any bright ideas to try here...
> 
> Are we missing some checks (e.g., PageReserved()) that pfn_valid_within()
> would have "caught" before?

Unless I'm missing something the crash happens in __rmqueue_fallback():

do_steal:
	page = get_page_from_free_area(area, fallback_mt);

	steal_suitable_fallback(zone, page, alloc_flags, start_migratetype,
								can_steal);
		-> move_freepages() 
			-> BUG()

So a page from free area should be sane as the freed range was never added
it to the free lists.

And honestly, with the memory layout reported elsewhere in the stack I'd
say that the bootloader/fdt beg for fixes...
Kefeng Wang May 6, 2021, 12:47 p.m. UTC | #20
On 2021/5/3 16:44, Mike Rapoport wrote:
> On Mon, May 03, 2021 at 10:07:01AM +0200, David Hildenbrand wrote:
>> On 03.05.21 08:26, Mike Rapoport wrote:
>>> On Fri, Apr 30, 2021 at 07:24:37PM +0800, Kefeng Wang wrote:
>>>>
>>>>
>>>> On 2021/4/30 17:51, Mike Rapoport wrote:
>>>>> On Thu, Apr 29, 2021 at 06:22:55PM +0800, Kefeng Wang wrote:
>>>>>>
>>>>>> On 2021/4/29 14:57, Mike Rapoport wrote:
>>>>>>
>>>>>>>>> Do you use SPARSMEM? If yes, what is your section size?
>>>>>>>>> What is the value if CONFIG_FORCE_MAX_ZONEORDER in your configuration?
>>>>>>>> Yes,
>>>>>>>>
>>>>>>>> CONFIG_SPARSEMEM=y
>>>>>>>>
>>>>>>>> CONFIG_SPARSEMEM_STATIC=y
>>>>>>>>
>>>>>>>> CONFIG_FORCE_MAX_ZONEORDER = 11
>>>>>>>>
>>>>>>>> CONFIG_PAGE_OFFSET=0xC0000000
>>>>>>>> CONFIG_HAVE_ARCH_PFN_VALID=y
>>>>>>>> CONFIG_HIGHMEM=y
>>>>>>>> #define SECTION_SIZE_BITS    26
>>>>>>>> #define MAX_PHYSADDR_BITS    32
>>>>>>>> #define MAX_PHYSMEM_BITS     32
>>>>>>
>>>>>>
>>>>>> With the patch,  the addr is aligned, but the panic still occurred,
>>>>>
>>>>> Is this the same panic at move_freepages() for range [de600, de7ff]?
>>>>>
>>>>> Do you enable CONFIG_ARM_LPAE?
>>>>
>>>> no, the CONFIG_ARM_LPAE is not set, and yes with same panic at
>>>> move_freepages at
>>>>
>>>> start_pfn/end_pfn [de600, de7ff], [de600000, de7ff000] :  pfn =de600, page
>>>> =ef3cc000, page-flags = ffffffff,  pfn2phy = de600000
>>>>
>>>>>> __free_memory_core, range: 0xb0200000 - 0xc0000000, pfn: b0200 - b0200
>>>>>> __free_memory_core, range: 0xcc000000 - 0xdca00000, pfn: cc000 - b0200
>>>>>> __free_memory_core, range: 0xde700000 - 0xdea00000, pfn: de700 - b0200
>>>
>>> Hmm, [de600, de7ff] is not added to the free lists which is correct. But
>>> then it's unclear how the page for de600 gets to move_freepages()...
>>>
>>> Can't say I have any bright ideas to try here...
>>
>> Are we missing some checks (e.g., PageReserved()) that pfn_valid_within()
>> would have "caught" before?
> 
> Unless I'm missing something the crash happens in __rmqueue_fallback():
> 
> do_steal:
> 	page = get_page_from_free_area(area, fallback_mt);
> 
> 	steal_suitable_fallback(zone, page, alloc_flags, start_migratetype,
> 								can_steal);
> 		-> move_freepages()
> 			-> BUG()
> 
> So a page from free area should be sane as the freed range was never added
> it to the free lists.

Sorry for the late response due to the vacation.

The pfn in range [de600, de7ff] won't be added into the free lists via 
__free_memory_core(), but the pfn could be added into freelists via 
free_highmem_page()

I add some debug[1] in add_to_free_list(), we could see the calltrace

free_highpages, range_pfn [b0200, c0000], range_addr [b0200000, c0000000]
free_highpages, range_pfn [cc000, dca00], range_addr [cc000000, dca00000]
free_highpages, range_pfn [de700, dea00], range_addr [de700000, dea00000]
add_to_free_list, ===> pfn = de700
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at mm/page_alloc.c:900 add_to_free_list+0x8c/0xec
pfn = de700
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 5.10.0+ #48
Hardware name: Hisilicon A9
[<c010a600>] (show_stack) from [<c04b21c4>] (dump_stack+0x9c/0xc0)
[<c04b21c4>] (dump_stack) from [<c011c708>] (__warn+0xc0/0xec)
[<c011c708>] (__warn) from [<c011c7a8>] (warn_slowpath_fmt+0x74/0xa4)
[<c011c7a8>] (warn_slowpath_fmt) from [<c023721c>] 
(add_to_free_list+0x8c/0xec)
[<c023721c>] (add_to_free_list) from [<c0237e00>] 
(free_pcppages_bulk+0x200/0x278)
[<c0237e00>] (free_pcppages_bulk) from [<c0238d14>] 
(free_unref_page+0x58/0x68)
[<c0238d14>] (free_unref_page) from [<c023bb54>] 
(free_highmem_page+0xc/0x50)
[<c023bb54>] (free_highmem_page) from [<c070620c>] (mem_init+0x21c/0x254)
[<c070620c>] (mem_init) from [<c0700b38>] (start_kernel+0x258/0x5c0)
[<c0700b38>] (start_kernel) from [<00000000>] (0x0)

so any idea?

[1] debug
diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index 1ba9f9f9dbd8..ee3619c04f93 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -286,7 +286,7 @@ static void __init free_highpages(void)
                 /* Truncate partial highmem entries */
                 if (start < max_low)
                         start = max_low;
-
+               pr_info("%s, range_pfn [%lx, %lx], range_addr [%x, 
%x]\n", __func__, start, end, range_start, range_end);
                 for (; start < end; start++)
                         free_highmem_page(pfn_to_page(start));

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 592479f43c74..920f041f0c6f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -892,7 +892,14 @@ compaction_capture(struct capture_control *capc, 
struct page *page,
  static inline void add_to_free_list(struct page *page, struct zone *zone,
                                     unsigned int order, int migratetype)
  {
+       unsigned long pfn;
         struct free_area *area = &zone->free_area[order];
+       pfn = page_to_pfn(page);
+       if (pfn >= 0xde600 && pfn < 0xde7ff) {
+               pr_info("%s, ===> pfn = %lx", __func__, pfn);
+               WARN_ONCE(pfn == 0xde700, "pfn = %lx", pfn);
+       }



> 
> And honestly, with the memory layout reported elsewhere in the stack I'd
> say that the bootloader/fdt beg for fixes...
>
Kefeng Wang May 7, 2021, 7:17 a.m. UTC | #21
On 2021/5/6 20:47, Kefeng Wang wrote:
> 
> 
>>>>> no, the CONFIG_ARM_LPAE is not set, and yes with same panic at
>>>>> move_freepages at
>>>>>
>>>>> start_pfn/end_pfn [de600, de7ff], [de600000, de7ff000] :  pfn 
>>>>> =de600, page
>>>>> =ef3cc000, page-flags = ffffffff,  pfn2phy = de600000
>>>>>
>>>>>>> __free_memory_core, range: 0xb0200000 - 0xc0000000, pfn: b0200 - 
>>>>>>> b0200
>>>>>>> __free_memory_core, range: 0xcc000000 - 0xdca00000, pfn: cc000 - 
>>>>>>> b0200
>>>>>>> __free_memory_core, range: 0xde700000 - 0xdea00000, pfn: de700 - 
>>>>>>> b0200
>>>>
>>>> Hmm, [de600, de7ff] is not added to the free lists which is correct. 
>>>> But
>>>> then it's unclear how the page for de600 gets to move_freepages()...
>>>>
>>>> Can't say I have any bright ideas to try here...
>>>
>>> Are we missing some checks (e.g., PageReserved()) that 
>>> pfn_valid_within()
>>> would have "caught" before?
>>
>> Unless I'm missing something the crash happens in __rmqueue_fallback():
>>
>> do_steal:
>>     page = get_page_from_free_area(area, fallback_mt);
>>
>>     steal_suitable_fallback(zone, page, alloc_flags, start_migratetype,
>>                                 can_steal);
>>         -> move_freepages()
>>             -> BUG()
>>
>> So a page from free area should be sane as the freed range was never 
>> added
>> it to the free lists.
> 
> Sorry for the late response due to the vacation.
> 
> The pfn in range [de600, de7ff] won't be added into the free lists via 
> __free_memory_core(), but the pfn could be added into freelists via 
> free_highmem_page()
> 
> I add some debug[1] in add_to_free_list(), we could see the calltrace
> 
> free_highpages, range_pfn [b0200, c0000], range_addr [b0200000, c0000000]
> free_highpages, range_pfn [cc000, dca00], range_addr [cc000000, dca00000]
> free_highpages, range_pfn [de700, dea00], range_addr [de700000, dea00000]
> add_to_free_list, ===> pfn = de700
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 0 at mm/page_alloc.c:900 add_to_free_list+0x8c/0xec
> pfn = de700
> Modules linked in:
> CPU: 0 PID: 0 Comm: swapper Not tainted 5.10.0+ #48
> Hardware name: Hisilicon A9
> [<c010a600>] (show_stack) from [<c04b21c4>] (dump_stack+0x9c/0xc0)
> [<c04b21c4>] (dump_stack) from [<c011c708>] (__warn+0xc0/0xec)
> [<c011c708>] (__warn) from [<c011c7a8>] (warn_slowpath_fmt+0x74/0xa4)
> [<c011c7a8>] (warn_slowpath_fmt) from [<c023721c>] 
> (add_to_free_list+0x8c/0xec)
> [<c023721c>] (add_to_free_list) from [<c0237e00>] 
> (free_pcppages_bulk+0x200/0x278)
> [<c0237e00>] (free_pcppages_bulk) from [<c0238d14>] 
> (free_unref_page+0x58/0x68)
> [<c0238d14>] (free_unref_page) from [<c023bb54>] 
> (free_highmem_page+0xc/0x50)
> [<c023bb54>] (free_highmem_page) from [<c070620c>] (mem_init+0x21c/0x254)
> [<c070620c>] (mem_init) from [<c0700b38>] (start_kernel+0x258/0x5c0)
> [<c0700b38>] (start_kernel) from [<00000000>] (0x0)
> 
> so any idea?

If pfn = 0xde700, due to the pageblock_nr_pages = 0x200, then the 
start_pfn,end_pfn passed to move_freepages() will be [de600, de7ff],
but the range of [de600,de700] without ‘struct page' will lead to
this panic when pfn_valid_within not enabled if no HOLES_IN_ZONE,
and the same issue will occurred in isolate_freepages_block(), maybe
there are some scene, so I select HOLES_IN_ZONE in ARCH_HISI(ARM) to 
solve this issue in our 5.10, should we select HOLES_IN_ZONE in all ARM 
or only in ARCH_HISI, any better solution?  Thanks.
Mike Rapoport May 7, 2021, 10:30 a.m. UTC | #22
On Fri, May 07, 2021 at 03:17:08PM +0800, Kefeng Wang wrote:
> 
> On 2021/5/6 20:47, Kefeng Wang wrote:
> > 
> > 
> > > > > > no, the CONFIG_ARM_LPAE is not set, and yes with same panic at
> > > > > > move_freepages at
> > > > > > 
> > > > > > start_pfn/end_pfn [de600, de7ff], [de600000, de7ff000]
> > > > > > :  pfn =de600, page
> > > > > > =ef3cc000, page-flags = ffffffff,  pfn2phy = de600000
> > > > > > 
> > > > > > > > __free_memory_core, range: 0xb0200000 -
> > > > > > > > 0xc0000000, pfn: b0200 - b0200
> > > > > > > > __free_memory_core, range: 0xcc000000 -
> > > > > > > > 0xdca00000, pfn: cc000 - b0200
> > > > > > > > __free_memory_core, range: 0xde700000 -
> > > > > > > > 0xdea00000, pfn: de700 - b0200
> > > > > 
> > > > > Hmm, [de600, de7ff] is not added to the free lists which is
> > > > > correct. But
> > > > > then it's unclear how the page for de600 gets to move_freepages()...
> > > > > 
> > > > > Can't say I have any bright ideas to try here...
> > > > 
> > > > Are we missing some checks (e.g., PageReserved()) that
> > > > pfn_valid_within()
> > > > would have "caught" before?
> > > 
> > > Unless I'm missing something the crash happens in __rmqueue_fallback():
> > > 
> > > do_steal:
> > >     page = get_page_from_free_area(area, fallback_mt);
> > > 
> > >     steal_suitable_fallback(zone, page, alloc_flags, start_migratetype,
> > >                                 can_steal);
> > >         -> move_freepages()
> > >             -> BUG()
> > > 
> > > So a page from free area should be sane as the freed range was never
> > > added
> > > it to the free lists.
> > 
> > Sorry for the late response due to the vacation.
> > 
> > The pfn in range [de600, de7ff] won't be added into the free lists via
> > __free_memory_core(), but the pfn could be added into freelists via
> > free_highmem_page()
> > 
> > I add some debug[1] in add_to_free_list(), we could see the calltrace
> > 
> > free_highpages, range_pfn [b0200, c0000], range_addr [b0200000, c0000000]
> > free_highpages, range_pfn [cc000, dca00], range_addr [cc000000, dca00000]
> > free_highpages, range_pfn [de700, dea00], range_addr [de700000, dea00000]
> > add_to_free_list, ===> pfn = de700
> > ------------[ cut here ]------------
> > WARNING: CPU: 0 PID: 0 at mm/page_alloc.c:900 add_to_free_list+0x8c/0xec
> > pfn = de700
> > Modules linked in:
> > CPU: 0 PID: 0 Comm: swapper Not tainted 5.10.0+ #48
> > Hardware name: Hisilicon A9
> > [<c010a600>] (show_stack) from [<c04b21c4>] (dump_stack+0x9c/0xc0)
> > [<c04b21c4>] (dump_stack) from [<c011c708>] (__warn+0xc0/0xec)
> > [<c011c708>] (__warn) from [<c011c7a8>] (warn_slowpath_fmt+0x74/0xa4)
> > [<c011c7a8>] (warn_slowpath_fmt) from [<c023721c>]
> > (add_to_free_list+0x8c/0xec)
> > [<c023721c>] (add_to_free_list) from [<c0237e00>]
> > (free_pcppages_bulk+0x200/0x278)
> > [<c0237e00>] (free_pcppages_bulk) from [<c0238d14>]
> > (free_unref_page+0x58/0x68)
> > [<c0238d14>] (free_unref_page) from [<c023bb54>]
> > (free_highmem_page+0xc/0x50)
> > [<c023bb54>] (free_highmem_page) from [<c070620c>] (mem_init+0x21c/0x254)
> > [<c070620c>] (mem_init) from [<c0700b38>] (start_kernel+0x258/0x5c0)
> > [<c0700b38>] (start_kernel) from [<00000000>] (0x0)
> > 
> > so any idea?
> 
> If pfn = 0xde700, due to the pageblock_nr_pages = 0x200, then the
> start_pfn,end_pfn passed to move_freepages() will be [de600, de7ff],
> but the range of [de600,de700] without ‘struct page' will lead to
> this panic when pfn_valid_within not enabled if no HOLES_IN_ZONE,
> and the same issue will occurred in isolate_freepages_block(), maybe

I think your analysis is correct except one minor detail. With the #ifdef
fix I've proposed earlieri [1] the memmap for [0xde600, 0xde700] should not
be freed so there should be a struct page. Did you check what parts of the
memmap are actually freed with this patch applied?
Would you get a panic if you add

	dump_page(pfn_to_page(0xde600), "");

say, in the end of memblock_free_all()?

> there are some scene, so I select HOLES_IN_ZONE in ARCH_HISI(ARM) to solve
> this issue in our 5.10, should we select HOLES_IN_ZONE in all ARM or only in
> ARCH_HISI, any better solution?  Thanks.

I don't think that HOLES_IN_ZONE is the right solution. I believe that we
must keep the memory map aligned on pageblock boundaries. That's surely not the
case for SPARSEMEM as of now, and if my fix is not enough we need to find
where it went wrong.

Besides, I'd say that if it is possible to update your firmware to make the
memory layout reported to the kernel less, hmm, esoteric, you would hit
less corner cases.

[1] https://lore.kernel.org/lkml/YIpY8TXCSc7Lfa2Z@kernel.org
Kefeng Wang May 7, 2021, 12:34 p.m. UTC | #23
On 2021/5/7 18:30, Mike Rapoport wrote:
> On Fri, May 07, 2021 at 03:17:08PM +0800, Kefeng Wang wrote:
>>
>> On 2021/5/6 20:47, Kefeng Wang wrote:
>>>
>>>
>>>>>>> no, the CONFIG_ARM_LPAE is not set, and yes with same panic at
>>>>>>> move_freepages at
>>>>>>>
>>>>>>> start_pfn/end_pfn [de600, de7ff], [de600000, de7ff000]
>>>>>>> :  pfn =de600, page
>>>>>>> =ef3cc000, page-flags = ffffffff,  pfn2phy = de600000
>>>>>>>
>>>>>>>>> __free_memory_core, range: 0xb0200000 -
>>>>>>>>> 0xc0000000, pfn: b0200 - b0200
>>>>>>>>> __free_memory_core, range: 0xcc000000 -
>>>>>>>>> 0xdca00000, pfn: cc000 - b0200
>>>>>>>>> __free_memory_core, range: 0xde700000 -
>>>>>>>>> 0xdea00000, pfn: de700 - b0200
>>>>>>
>>>>>> Hmm, [de600, de7ff] is not added to the free lists which is
>>>>>> correct. But
>>>>>> then it's unclear how the page for de600 gets to move_freepages()...
>>>>>>
>>>>>> Can't say I have any bright ideas to try here...
>>>>>
>>>>> Are we missing some checks (e.g., PageReserved()) that
>>>>> pfn_valid_within()
>>>>> would have "caught" before?
>>>>
>>>> Unless I'm missing something the crash happens in __rmqueue_fallback():
>>>>
>>>> do_steal:
>>>>      page = get_page_from_free_area(area, fallback_mt);
>>>>
>>>>      steal_suitable_fallback(zone, page, alloc_flags, start_migratetype,
>>>>                                  can_steal);
>>>>          -> move_freepages()
>>>>              -> BUG()
>>>>
>>>> So a page from free area should be sane as the freed range was never
>>>> added
>>>> it to the free lists.
>>>
>>> Sorry for the late response due to the vacation.
>>>
>>> The pfn in range [de600, de7ff] won't be added into the free lists via
>>> __free_memory_core(), but the pfn could be added into freelists via
>>> free_highmem_page()
>>>
>>> I add some debug[1] in add_to_free_list(), we could see the calltrace
>>>
>>> free_highpages, range_pfn [b0200, c0000], range_addr [b0200000, c0000000]
>>> free_highpages, range_pfn [cc000, dca00], range_addr [cc000000, dca00000]
>>> free_highpages, range_pfn [de700, dea00], range_addr [de700000, dea00000]
>>> add_to_free_list, ===> pfn = de700
>>> ------------[ cut here ]------------
>>> WARNING: CPU: 0 PID: 0 at mm/page_alloc.c:900 add_to_free_list+0x8c/0xec
>>> pfn = de700
>>> Modules linked in:
>>> CPU: 0 PID: 0 Comm: swapper Not tainted 5.10.0+ #48
>>> Hardware name: Hisilicon A9
>>> [<c010a600>] (show_stack) from [<c04b21c4>] (dump_stack+0x9c/0xc0)
>>> [<c04b21c4>] (dump_stack) from [<c011c708>] (__warn+0xc0/0xec)
>>> [<c011c708>] (__warn) from [<c011c7a8>] (warn_slowpath_fmt+0x74/0xa4)
>>> [<c011c7a8>] (warn_slowpath_fmt) from [<c023721c>]
>>> (add_to_free_list+0x8c/0xec)
>>> [<c023721c>] (add_to_free_list) from [<c0237e00>]
>>> (free_pcppages_bulk+0x200/0x278)
>>> [<c0237e00>] (free_pcppages_bulk) from [<c0238d14>]
>>> (free_unref_page+0x58/0x68)
>>> [<c0238d14>] (free_unref_page) from [<c023bb54>]
>>> (free_highmem_page+0xc/0x50)
>>> [<c023bb54>] (free_highmem_page) from [<c070620c>] (mem_init+0x21c/0x254)
>>> [<c070620c>] (mem_init) from [<c0700b38>] (start_kernel+0x258/0x5c0)
>>> [<c0700b38>] (start_kernel) from [<00000000>] (0x0)
>>>
>>> so any idea?
>>
>> If pfn = 0xde700, due to the pageblock_nr_pages = 0x200, then the
>> start_pfn,end_pfn passed to move_freepages() will be [de600, de7ff],
>> but the range of [de600,de700] without ‘struct page' will lead to
>> this panic when pfn_valid_within not enabled if no HOLES_IN_ZONE,
>> and the same issue will occurred in isolate_freepages_block(), maybe
> 
> I think your analysis is correct except one minor detail. With the #ifdef
> fix I've proposed earlieri [1] the memmap for [0xde600, 0xde700] should not
> be freed so there should be a struct page. Did you check what parts of the
> memmap are actually freed with this patch applied?
> Would you get a panic if you add
> 
> 	dump_page(pfn_to_page(0xde600), "");
> 
> say, in the end of memblock_free_all()?

The memory is not continuous, see MEMBLOCK:
  memory size = 0x4c0fffff reserved size = 0x027ef058
  memory.cnt  = 0xa
  memory[0x0]    [0x80a00000-0x855fffff], 0x04c00000 bytes flags: 0x0
  memory[0x1]    [0x86a00000-0x87dfffff], 0x01400000 bytes flags: 0x0
  memory[0x2]    [0x8bd00000-0x8c4fffff], 0x00800000 bytes flags: 0x0
  memory[0x3]    [0x8e300000-0x8ecfffff], 0x00a00000 bytes flags: 0x0
  memory[0x4]    [0x90d00000-0xbfffffff], 0x2f300000 bytes flags: 0x0
  memory[0x5]    [0xcc000000-0xdc9fffff], 0x10a00000 bytes flags: 0x0
  memory[0x6]    [0xde700000-0xde9fffff], 0x00300000 bytes flags: 0x0
...

The pfn_range [0xde600,0xde700] => addr_range [0xde600000,0xde700000]
is not available memory, and we won't create memmap , so with or without 
your patch, we can't see the range in free_memmap(), right?

> 
>> there are some scene, so I select HOLES_IN_ZONE in ARCH_HISI(ARM) to solve
>> this issue in our 5.10, should we select HOLES_IN_ZONE in all ARM or only in
>> ARCH_HISI, any better solution?  Thanks.
> 
> I don't think that HOLES_IN_ZONE is the right solution. I believe that we
> must keep the memory map aligned on pageblock boundaries. That's surely not the
> case for SPARSEMEM as of now, and if my fix is not enough we need to find
> where it went wrong.
> 
> Besides, I'd say that if it is possible to update your firmware to make the
> memory layout reported to the kernel less, hmm, esoteric, you would hit
> less corner cases.

Sorry, memory layout is customized and we can't change it, some memory 
is for special purposes by our production.
> 
> [1] https://lore.kernel.org/lkml/YIpY8TXCSc7Lfa2Z@kernel.org
>
Mike Rapoport May 9, 2021, 5:59 a.m. UTC | #24
On Fri, May 07, 2021 at 08:34:52PM +0800, Kefeng Wang wrote:
> 
> 
> On 2021/5/7 18:30, Mike Rapoport wrote:
> > On Fri, May 07, 2021 at 03:17:08PM +0800, Kefeng Wang wrote:
> > > 
> > > On 2021/5/6 20:47, Kefeng Wang wrote:
> > > > 
> > > > > > > > no, the CONFIG_ARM_LPAE is not set, and yes with same panic at
> > > > > > > > move_freepages at
> > > > > > > > 
> > > > > > > > start_pfn/end_pfn [de600, de7ff], [de600000, de7ff000]
> > > > > > > > :  pfn =de600, page
> > > > > > > > =ef3cc000, page-flags = ffffffff,  pfn2phy = de600000
> > > > > > > > 
> > > > > > > > > > __free_memory_core, range: 0xb0200000 -
> > > > > > > > > > 0xc0000000, pfn: b0200 - b0200
> > > > > > > > > > __free_memory_core, range: 0xcc000000 -
> > > > > > > > > > 0xdca00000, pfn: cc000 - b0200
> > > > > > > > > > __free_memory_core, range: 0xde700000 -
> > > > > > > > > > 0xdea00000, pfn: de700 - b0200
> > > > > > > 
> > > > > > > Hmm, [de600, de7ff] is not added to the free lists which is
> > > > > > > correct. But
> > > > > > > then it's unclear how the page for de600 gets to move_freepages()...
> > > > > > > 
> > > > > > > Can't say I have any bright ideas to try here...
> > > > > > 
> > > > > > Are we missing some checks (e.g., PageReserved()) that
> > > > > > pfn_valid_within()
> > > > > > would have "caught" before?
> > > > > 
> > > > > Unless I'm missing something the crash happens in __rmqueue_fallback():
> > > > > 
> > > > > do_steal:
> > > > >      page = get_page_from_free_area(area, fallback_mt);
> > > > > 
> > > > >      steal_suitable_fallback(zone, page, alloc_flags, start_migratetype,
> > > > >                                  can_steal);
> > > > >          -> move_freepages()
> > > > >              -> BUG()
> > > > > 
> > > > > So a page from free area should be sane as the freed range was never
> > > > > added
> > > > > it to the free lists.
> > > > 
> > > > Sorry for the late response due to the vacation.
> > > > 
> > > > The pfn in range [de600, de7ff] won't be added into the free lists via
> > > > __free_memory_core(), but the pfn could be added into freelists via
> > > > free_highmem_page()
> > > > 
> > > > I add some debug[1] in add_to_free_list(), we could see the calltrace
> > > > 
> > > > free_highpages, range_pfn [b0200, c0000], range_addr [b0200000, c0000000]
> > > > free_highpages, range_pfn [cc000, dca00], range_addr [cc000000, dca00000]
> > > > free_highpages, range_pfn [de700, dea00], range_addr [de700000, dea00000]
> > > > add_to_free_list, ===> pfn = de700
> > > > ------------[ cut here ]------------
> > > > WARNING: CPU: 0 PID: 0 at mm/page_alloc.c:900 add_to_free_list+0x8c/0xec
> > > > pfn = de700
> > > > Modules linked in:
> > > > CPU: 0 PID: 0 Comm: swapper Not tainted 5.10.0+ #48
> > > > Hardware name: Hisilicon A9
> > > > [<c010a600>] (show_stack) from [<c04b21c4>] (dump_stack+0x9c/0xc0)
> > > > [<c04b21c4>] (dump_stack) from [<c011c708>] (__warn+0xc0/0xec)
> > > > [<c011c708>] (__warn) from [<c011c7a8>] (warn_slowpath_fmt+0x74/0xa4)
> > > > [<c011c7a8>] (warn_slowpath_fmt) from [<c023721c>]
> > > > (add_to_free_list+0x8c/0xec)
> > > > [<c023721c>] (add_to_free_list) from [<c0237e00>]
> > > > (free_pcppages_bulk+0x200/0x278)
> > > > [<c0237e00>] (free_pcppages_bulk) from [<c0238d14>]
> > > > (free_unref_page+0x58/0x68)
> > > > [<c0238d14>] (free_unref_page) from [<c023bb54>]
> > > > (free_highmem_page+0xc/0x50)
> > > > [<c023bb54>] (free_highmem_page) from [<c070620c>] (mem_init+0x21c/0x254)
> > > > [<c070620c>] (mem_init) from [<c0700b38>] (start_kernel+0x258/0x5c0)
> > > > [<c0700b38>] (start_kernel) from [<00000000>] (0x0)
> > > > 
> > > > so any idea?
> > > 
> > > If pfn = 0xde700, due to the pageblock_nr_pages = 0x200, then the
> > > start_pfn,end_pfn passed to move_freepages() will be [de600, de7ff],
> > > but the range of [de600,de700] without ‘struct page' will lead to
> > > this panic when pfn_valid_within not enabled if no HOLES_IN_ZONE,
> > > and the same issue will occurred in isolate_freepages_block(), maybe
> > 
> > I think your analysis is correct except one minor detail. With the #ifdef
> > fix I've proposed earlieri [1] the memmap for [0xde600, 0xde700] should not
> > be freed so there should be a struct page. Did you check what parts of the
> > memmap are actually freed with this patch applied?
> > Would you get a panic if you add
> > 
> > 	dump_page(pfn_to_page(0xde600), "");
> > 
> > say, in the end of memblock_free_all()?
> 
> The memory is not continuous, see MEMBLOCK:
>  memory size = 0x4c0fffff reserved size = 0x027ef058
>  memory.cnt  = 0xa
>  memory[0x0]    [0x80a00000-0x855fffff], 0x04c00000 bytes flags: 0x0
>  memory[0x1]    [0x86a00000-0x87dfffff], 0x01400000 bytes flags: 0x0
>  memory[0x2]    [0x8bd00000-0x8c4fffff], 0x00800000 bytes flags: 0x0
>  memory[0x3]    [0x8e300000-0x8ecfffff], 0x00a00000 bytes flags: 0x0
>  memory[0x4]    [0x90d00000-0xbfffffff], 0x2f300000 bytes flags: 0x0
>  memory[0x5]    [0xcc000000-0xdc9fffff], 0x10a00000 bytes flags: 0x0
>  memory[0x6]    [0xde700000-0xde9fffff], 0x00300000 bytes flags: 0x0
> ...
> 
> The pfn_range [0xde600,0xde700] => addr_range [0xde600000,0xde700000]
> is not available memory, and we won't create memmap , so with or without
> your patch, we can't see the range in free_memmap(), right?
 

This is not available memory and we won't see the reange in free_memmap(),
but we still should create memmap for it and that's what my patch tried to
do.

There are a lot of places in core mm that operate on pageblocks and
free_unused_memmap() should make sure that any pageblock has a valid memory
map.

Currently, that's not the case when SPARSEMEM=y and my patch tried to fix
it.

Can you please send log with my patch applied and with the printing of
ranges that are freed in free_unused_memmap() you've used in previous
mails?
 
> > > there are some scene, so I select HOLES_IN_ZONE in ARCH_HISI(ARM) to solve
> > > this issue in our 5.10, should we select HOLES_IN_ZONE in all ARM or only in
> > > ARCH_HISI, any better solution?  Thanks.
> > 
> > I don't think that HOLES_IN_ZONE is the right solution. I believe that we
> > must keep the memory map aligned on pageblock boundaries. That's surely not the
> > case for SPARSEMEM as of now, and if my fix is not enough we need to find
> > where it went wrong.
> > 
> > Besides, I'd say that if it is possible to update your firmware to make the
> > memory layout reported to the kernel less, hmm, esoteric, you would hit
> > less corner cases.
> 
> Sorry, memory layout is customized and we can't change it, some memory is
> for special purposes by our production.
 
I understand that this memory cannot be used by Linux, but the firmware may
supply the kernel with actual physical memory layout and then mark all
the special purpose memory that kernel should not touch as reserved.

> > [1] https://lore.kernel.org/lkml/YIpY8TXCSc7Lfa2Z@kernel.org
> >
Kefeng Wang May 10, 2021, 3:10 a.m. UTC | #25
On 2021/5/9 13:59, Mike Rapoport wrote:
> On Fri, May 07, 2021 at 08:34:52PM +0800, Kefeng Wang wrote:
>>
>>
>> On 2021/5/7 18:30, Mike Rapoport wrote:
>>> On Fri, May 07, 2021 at 03:17:08PM +0800, Kefeng Wang wrote:
>>>>
>>>> On 2021/5/6 20:47, Kefeng Wang wrote:
>>>>>
>>>>>>>>> no, the CONFIG_ARM_LPAE is not set, and yes with same panic at
>>>>>>>>> move_freepages at
>>>>>>>>>
>>>>>>>>> start_pfn/end_pfn [de600, de7ff], [de600000, de7ff000]
>>>>>>>>> :  pfn =de600, page
>>>>>>>>> =ef3cc000, page-flags = ffffffff,  pfn2phy = de600000
>>>>>>>>>
>>>>>>>>>>> __free_memory_core, range: 0xb0200000 -
>>>>>>>>>>> 0xc0000000, pfn: b0200 - b0200
>>>>>>>>>>> __free_memory_core, range: 0xcc000000 -
>>>>>>>>>>> 0xdca00000, pfn: cc000 - b0200
>>>>>>>>>>> __free_memory_core, range: 0xde700000 -
>>>>>>>>>>> 0xdea00000, pfn: de700 - b0200
>>>>>>>>
>>>>>>>> Hmm, [de600, de7ff] is not added to the free lists which is
>>>>>>>> correct. But
>>>>>>>> then it's unclear how the page for de600 gets to move_freepages()...
>>>>>>>>
>>>>>>>> Can't say I have any bright ideas to try here...
>>>>>>>
>>>>>>> Are we missing some checks (e.g., PageReserved()) that
>>>>>>> pfn_valid_within()
>>>>>>> would have "caught" before?
>>>>>>
>>>>>> Unless I'm missing something the crash happens in __rmqueue_fallback():
>>>>>>
>>>>>> do_steal:
>>>>>>       page = get_page_from_free_area(area, fallback_mt);
>>>>>>
>>>>>>       steal_suitable_fallback(zone, page, alloc_flags, start_migratetype,
>>>>>>                                   can_steal);
>>>>>>           -> move_freepages()
>>>>>>               -> BUG()
>>>>>>
>>>>>> So a page from free area should be sane as the freed range was never
>>>>>> added
>>>>>> it to the free lists.
>>>>>
>>>>> Sorry for the late response due to the vacation.
>>>>>
>>>>> The pfn in range [de600, de7ff] won't be added into the free lists via
>>>>> __free_memory_core(), but the pfn could be added into freelists via
>>>>> free_highmem_page()
>>>>>
>>>>> I add some debug[1] in add_to_free_list(), we could see the calltrace
>>>>>
>>>>> free_highpages, range_pfn [b0200, c0000], range_addr [b0200000, c0000000]
>>>>> free_highpages, range_pfn [cc000, dca00], range_addr [cc000000, dca00000]
>>>>> free_highpages, range_pfn [de700, dea00], range_addr [de700000, dea00000]
>>>>> add_to_free_list, ===> pfn = de700
>>>>> ------------[ cut here ]------------
>>>>> WARNING: CPU: 0 PID: 0 at mm/page_alloc.c:900 add_to_free_list+0x8c/0xec
>>>>> pfn = de700
>>>>> Modules linked in:
>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 5.10.0+ #48
>>>>> Hardware name: Hisilicon A9
>>>>> [<c010a600>] (show_stack) from [<c04b21c4>] (dump_stack+0x9c/0xc0)
>>>>> [<c04b21c4>] (dump_stack) from [<c011c708>] (__warn+0xc0/0xec)
>>>>> [<c011c708>] (__warn) from [<c011c7a8>] (warn_slowpath_fmt+0x74/0xa4)
>>>>> [<c011c7a8>] (warn_slowpath_fmt) from [<c023721c>]
>>>>> (add_to_free_list+0x8c/0xec)
>>>>> [<c023721c>] (add_to_free_list) from [<c0237e00>]
>>>>> (free_pcppages_bulk+0x200/0x278)
>>>>> [<c0237e00>] (free_pcppages_bulk) from [<c0238d14>]
>>>>> (free_unref_page+0x58/0x68)
>>>>> [<c0238d14>] (free_unref_page) from [<c023bb54>]
>>>>> (free_highmem_page+0xc/0x50)
>>>>> [<c023bb54>] (free_highmem_page) from [<c070620c>] (mem_init+0x21c/0x254)
>>>>> [<c070620c>] (mem_init) from [<c0700b38>] (start_kernel+0x258/0x5c0)
>>>>> [<c0700b38>] (start_kernel) from [<00000000>] (0x0)
>>>>>
>>>>> so any idea?
>>>>
>>>> If pfn = 0xde700, due to the pageblock_nr_pages = 0x200, then the
>>>> start_pfn,end_pfn passed to move_freepages() will be [de600, de7ff],
>>>> but the range of [de600,de700] without ‘struct page' will lead to
>>>> this panic when pfn_valid_within not enabled if no HOLES_IN_ZONE,
>>>> and the same issue will occurred in isolate_freepages_block(), maybe
>>>
>>> I think your analysis is correct except one minor detail. With the #ifdef
>>> fix I've proposed earlieri [1] the memmap for [0xde600, 0xde700] should not
>>> be freed so there should be a struct page. Did you check what parts of the
>>> memmap are actually freed with this patch applied?
>>> Would you get a panic if you add
>>>
>>> 	dump_page(pfn_to_page(0xde600), "");
>>>
>>> say, in the end of memblock_free_all()?
>>
>> The memory is not continuous, see MEMBLOCK:
>>   memory size = 0x4c0fffff reserved size = 0x027ef058
>>   memory.cnt  = 0xa
>>   memory[0x0]    [0x80a00000-0x855fffff], 0x04c00000 bytes flags: 0x0
>>   memory[0x1]    [0x86a00000-0x87dfffff], 0x01400000 bytes flags: 0x0
>>   memory[0x2]    [0x8bd00000-0x8c4fffff], 0x00800000 bytes flags: 0x0
>>   memory[0x3]    [0x8e300000-0x8ecfffff], 0x00a00000 bytes flags: 0x0
>>   memory[0x4]    [0x90d00000-0xbfffffff], 0x2f300000 bytes flags: 0x0
>>   memory[0x5]    [0xcc000000-0xdc9fffff], 0x10a00000 bytes flags: 0x0
>>   memory[0x6]    [0xde700000-0xde9fffff], 0x00300000 bytes flags: 0x0
>> ...
>>
>> The pfn_range [0xde600,0xde700] => addr_range [0xde600000,0xde700000]
>> is not available memory, and we won't create memmap , so with or without
>> your patch, we can't see the range in free_memmap(), right?
>   
> 
> This is not available memory and we won't see the reange in free_memmap(),
> but we still should create memmap for it and that's what my patch tried to
> do.
> 
> There are a lot of places in core mm that operate on pageblocks and
> free_unused_memmap() should make sure that any pageblock has a valid memory
> map.
> 
> Currently, that's not the case when SPARSEMEM=y and my patch tried to fix
> it.
> 
> Can you please send log with my patch applied and with the printing of
> ranges that are freed in free_unused_memmap() you've used in previous
> mails?
with your patch[1] and debug print in free_memmap,
----> free_memmap, start_pfn = 85800,  85800000 end_pfn = 86800, 86800000
----> free_memmap, start_pfn = 8c800,  8c800000 end_pfn = 8e000, 8e000000
----> free_memmap, start_pfn = 8f000,  8f000000 end_pfn = 90000, 90000000
----> free_memmap, start_pfn = dcc00,  dcc00000 end_pfn = de400, de400000
----> free_memmap, start_pfn = dec00,  dec00000 end_pfn = e0000, e0000000
----> free_memmap, start_pfn = e0c00,  e0c00000 end_pfn = e4000, e4000000
----> free_memmap, start_pfn = f7000,  f7000000 end_pfn = f8000, f8000000
__free_memory_core, range: 0x80a03000 - 0x80a04000, pfn: 80a03 - 80a04
__free_memory_core, range: 0x80a08000 - 0x80b00000, pfn: 80a08 - 80b00
__free_memory_core, range: 0x812e8058 - 0x83000000, pfn: 812e9 - 83000
__free_memory_core, range: 0x85000000 - 0x85600000, pfn: 85000 - 85600
__free_memory_core, range: 0x86a00000 - 0x87e00000, pfn: 86a00 - 87e00
__free_memory_core, range: 0x8bd00000 - 0x8c500000, pfn: 8bd00 - 8c500
__free_memory_core, range: 0x8e300000 - 0x8ed00000, pfn: 8e300 - 8ed00
__free_memory_core, range: 0x90d00000 - 0xaf2c0000, pfn: 90d00 - af2c0
__free_memory_core, range: 0xaf430000 - 0xaf450000, pfn: af430 - af450
__free_memory_core, range: 0xaf510000 - 0xaf540000, pfn: af510 - af540
__free_memory_core, range: 0xaf560000 - 0xaf580000, pfn: af560 - af580
__free_memory_core, range: 0xafd98000 - 0xafdc8000, pfn: afd98 - afdc8
__free_memory_core, range: 0xafdd8000 - 0xafe00000, pfn: afdd8 - afe00
__free_memory_core, range: 0xafe18000 - 0xafe80000, pfn: afe18 - afe80
__free_memory_core, range: 0xafee0000 - 0xaff00000, pfn: afee0 - aff00
__free_memory_core, range: 0xaff80000 - 0xaff8d000, pfn: aff80 - aff8d
__free_memory_core, range: 0xafff2000 - 0xafff4580, pfn: afff2 - afff4
__free_memory_core, range: 0xafffe000 - 0xafffe0e0, pfn: afffe - afffe
__free_memory_core, range: 0xafffe4fc - 0xafffe500, pfn: affff - afffe
__free_memory_core, range: 0xafffe6e4 - 0xafffe700, pfn: affff - afffe
__free_memory_core, range: 0xafffe8dc - 0xafffe8e0, pfn: affff - afffe
__free_memory_core, range: 0xafffe970 - 0xafffe980, pfn: affff - afffe
__free_memory_core, range: 0xafffe990 - 0xafffe9a0, pfn: affff - afffe
__free_memory_core, range: 0xafffe9a4 - 0xafffe9c0, pfn: affff - afffe
__free_memory_core, range: 0xafffeb54 - 0xafffeb60, pfn: affff - afffe
__free_memory_core, range: 0xafffecf4 - 0xafffed00, pfn: affff - afffe
__free_memory_core, range: 0xafffefc4 - 0xafffefd8, pfn: affff - afffe
__free_memory_core, range: 0xb0200000 - 0xc0000000, pfn: b0200 - b0200
__free_memory_core, range: 0xcc000000 - 0xdca00000, pfn: cc000 - b0200
__free_memory_core, range: 0xde700000 - 0xdea00000, pfn: de700 - b0200
__free_memory_core, range: 0xe0800000 - 0xe0c00000, pfn: e0800 - b0200
__free_memory_core, range: 0xf4b00000 - 0xf7000000, pfn: f4b00 - b0200
__free_memory_core, range: 0xfda00000 - 0xffffffff, pfn: fda00 - b0200
free_highpages, range_pfn [b0200, c0000], range_addr [b0200000, c0000000]
free_highpages, range_pfn [cc000, dca00], range_addr [cc000000, dca00000]
free_highpages, range_pfn [de700, dea00], range_addr [de700000, dea00000]
free_highpages, range_pfn [e0800, e0c00], range_addr [e0800000, e0c00000]
free_highpages, range_pfn [f4b00, f7000], range_addr [f4b00000, f7000000]
free_highpages, range_pfn [fda00, fffff], range_addr [fda00000, ffffffff]

>   
>>>> there are some scene, so I select HOLES_IN_ZONE in ARCH_HISI(ARM) to solve
>>>> this issue in our 5.10, should we select HOLES_IN_ZONE in all ARM or only in
>>>> ARCH_HISI, any better solution?  Thanks.
>>>
>>> I don't think that HOLES_IN_ZONE is the right solution. I believe that we
>>> must keep the memory map aligned on pageblock boundaries. That's surely not the
>>> case for SPARSEMEM as of now, and if my fix is not enough we need to find
>>> where it went wrong.
>>>
>>> Besides, I'd say that if it is possible to update your firmware to make the
>>> memory layout reported to the kernel less, hmm, esoteric, you would hit
>>> less corner cases.
>>
>> Sorry, memory layout is customized and we can't change it, some memory is
>> for special purposes by our production.
>   
> I understand that this memory cannot be used by Linux, but the firmware may
> supply the kernel with actual physical memory layout and then mark all
> the special purpose memory that kernel should not touch as reserved.
We only can modify kernel, so it is not practicable for our production, 
and this way looks like a workaround, we need find a way to solve the 
issue from kernel side.

[1] https://lore.kernel.org/lkml/YIpY8TXCSc7Lfa2Z@kernel.org
Mike Rapoport May 11, 2021, 8:48 a.m. UTC | #26
On Mon, May 10, 2021 at 11:10:20AM +0800, Kefeng Wang wrote:
>
> > > The memory is not continuous, see MEMBLOCK:
> > >   memory size = 0x4c0fffff reserved size = 0x027ef058
> > >   memory.cnt  = 0xa
> > >   memory[0x0]    [0x80a00000-0x855fffff], 0x04c00000 bytes flags: 0x0
> > >   memory[0x1]    [0x86a00000-0x87dfffff], 0x01400000 bytes flags: 0x0
> > >   memory[0x2]    [0x8bd00000-0x8c4fffff], 0x00800000 bytes flags: 0x0
> > >   memory[0x3]    [0x8e300000-0x8ecfffff], 0x00a00000 bytes flags: 0x0
> > >   memory[0x4]    [0x90d00000-0xbfffffff], 0x2f300000 bytes flags: 0x0
> > >   memory[0x5]    [0xcc000000-0xdc9fffff], 0x10a00000 bytes flags: 0x0
> > >   memory[0x6]    [0xde700000-0xde9fffff], 0x00300000 bytes flags: 0x0
> > > ...
> > > 
> > > The pfn_range [0xde600,0xde700] => addr_range [0xde600000,0xde700000]
> > > is not available memory, and we won't create memmap , so with or without
> > > your patch, we can't see the range in free_memmap(), right?
> > 
> > This is not available memory and we won't see the reange in free_memmap(),
> > but we still should create memmap for it and that's what my patch tried to
> > do.
> > 
> > There are a lot of places in core mm that operate on pageblocks and
> > free_unused_memmap() should make sure that any pageblock has a valid memory
> > map.
> > 
> > Currently, that's not the case when SPARSEMEM=y and my patch tried to fix
> > it.
> > 
> > Can you please send log with my patch applied and with the printing of
> > ranges that are freed in free_unused_memmap() you've used in previous
> > mails?

> with your patch[1] and debug print in free_memmap,
> ----> free_memmap, start_pfn = 85800,  85800000 end_pfn = 86800, 86800000
> ----> free_memmap, start_pfn = 8c800,  8c800000 end_pfn = 8e000, 8e000000
> ----> free_memmap, start_pfn = 8f000,  8f000000 end_pfn = 90000, 90000000
> ----> free_memmap, start_pfn = dcc00,  dcc00000 end_pfn = de400, de400000
> ----> free_memmap, start_pfn = dec00,  dec00000 end_pfn = e0000, e0000000
> ----> free_memmap, start_pfn = e0c00,  e0c00000 end_pfn = e4000, e4000000
> ----> free_memmap, start_pfn = f7000,  f7000000 end_pfn = f8000, f8000000

It seems that freeing of the memory map is suboptimal still because that
code was not designed for memory layout that has more holes than Swiss
cheese. 

Still, the range [0xde600,0xde700] is not freed and there should be struct
pages for this range.

Can you add 

	dump_page(pfn_to_page(0xde600), "");

say, in the end of memblock_free_all()?
Kefeng Wang May 12, 2021, 3:08 a.m. UTC | #27
On 2021/5/11 16:48, Mike Rapoport wrote:
> On Mon, May 10, 2021 at 11:10:20AM +0800, Kefeng Wang wrote:
>>
>>>> The memory is not continuous, see MEMBLOCK:
>>>>    memory size = 0x4c0fffff reserved size = 0x027ef058
>>>>    memory.cnt  = 0xa
>>>>    memory[0x0]    [0x80a00000-0x855fffff], 0x04c00000 bytes flags: 0x0
>>>>    memory[0x1]    [0x86a00000-0x87dfffff], 0x01400000 bytes flags: 0x0
>>>>    memory[0x2]    [0x8bd00000-0x8c4fffff], 0x00800000 bytes flags: 0x0
>>>>    memory[0x3]    [0x8e300000-0x8ecfffff], 0x00a00000 bytes flags: 0x0
>>>>    memory[0x4]    [0x90d00000-0xbfffffff], 0x2f300000 bytes flags: 0x0
>>>>    memory[0x5]    [0xcc000000-0xdc9fffff], 0x10a00000 bytes flags: 0x0
>>>>    memory[0x6]    [0xde700000-0xde9fffff], 0x00300000 bytes flags: 0x0
>>>> ...
>>>>
>>>> The pfn_range [0xde600,0xde700] => addr_range [0xde600000,0xde700000]
>>>> is not available memory, and we won't create memmap , so with or without
>>>> your patch, we can't see the range in free_memmap(), right?
>>>
>>> This is not available memory and we won't see the reange in free_memmap(),
>>> but we still should create memmap for it and that's what my patch tried to
>>> do.
>>>
>>> There are a lot of places in core mm that operate on pageblocks and
>>> free_unused_memmap() should make sure that any pageblock has a valid memory
>>> map.
>>>
>>> Currently, that's not the case when SPARSEMEM=y and my patch tried to fix
>>> it.
>>>
>>> Can you please send log with my patch applied and with the printing of
>>> ranges that are freed in free_unused_memmap() you've used in previous
>>> mails?
> 
>> with your patch[1] and debug print in free_memmap,
>> ----> free_memmap, start_pfn = 85800,  85800000 end_pfn = 86800, 86800000
>> ----> free_memmap, start_pfn = 8c800,  8c800000 end_pfn = 8e000, 8e000000
>> ----> free_memmap, start_pfn = 8f000,  8f000000 end_pfn = 90000, 90000000
>> ----> free_memmap, start_pfn = dcc00,  dcc00000 end_pfn = de400, de400000
>> ----> free_memmap, start_pfn = dec00,  dec00000 end_pfn = e0000, e0000000
>> ----> free_memmap, start_pfn = e0c00,  e0c00000 end_pfn = e4000, e4000000
>> ----> free_memmap, start_pfn = f7000,  f7000000 end_pfn = f8000, f8000000
> 
> It seems that freeing of the memory map is suboptimal still because that
> code was not designed for memory layout that has more holes than Swiss
> cheese.
> 
> Still, the range [0xde600,0xde700] is not freed and there should be struct
> pages for this range.
> 
> Can you add
> 
> 	dump_page(pfn_to_page(0xde600), "");
> 
> say, in the end of memblock_free_all()?
>   
> 
The range [0xde600,0xde700] is not memory, so it won't create struct 
page for it when sparse_init?

After apply patch[1], the dump_page log,

page:ef3cc000 is uninitialized and poisoned
raw: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
page dumped because:


[1] 
https://lore.kernel.org/linux-mm/20210512031057.13580-3-wangkefeng.wang@huawei.com/T/#u
Matthew Wilcox May 12, 2021, 3:50 a.m. UTC | #28
On Sun, Apr 25, 2021 at 03:51:56PM +0800, Kefeng Wang wrote:
> we see the PC is at PageLRU, same reason like arm64 panic log,
> 
> "PageBuddy in move_freepages returns false Then we call PageLRU, the macro
> calls PF_HEAD which is compound_page() compound_page reads
> page->compound_head, it is 0xffffffffffffffff, so it resturns
> 0xfffffffffffffffe - and accessing this address causes crash"

Oh.  I posted patches to fix this back in 2018.

https://lore.kernel.org/linux-mm/20180414043145.3953-6-willy@infradead.org/

and 2019.

https://lore.kernel.org/linux-mm/20190501202433.GC28500@bombadil.infradead.org/

and 2020.

https://lore.kernel.org/linux-mm/20200408150148.25290-6-willy@infradead.org/

Looks like it's about that time of year for me to try to fix this again.
Mike Rapoport May 12, 2021, 8:26 a.m. UTC | #29
On Wed, May 12, 2021 at 11:08:14AM +0800, Kefeng Wang wrote:
> 
> On 2021/5/11 16:48, Mike Rapoport wrote:
> > On Mon, May 10, 2021 at 11:10:20AM +0800, Kefeng Wang wrote:
> > > 
> > > > > The memory is not continuous, see MEMBLOCK:
> > > > >    memory size = 0x4c0fffff reserved size = 0x027ef058
> > > > >    memory.cnt  = 0xa
> > > > >    memory[0x0]    [0x80a00000-0x855fffff], 0x04c00000 bytes flags: 0x0
> > > > >    memory[0x1]    [0x86a00000-0x87dfffff], 0x01400000 bytes flags: 0x0
> > > > >    memory[0x2]    [0x8bd00000-0x8c4fffff], 0x00800000 bytes flags: 0x0
> > > > >    memory[0x3]    [0x8e300000-0x8ecfffff], 0x00a00000 bytes flags: 0x0
> > > > >    memory[0x4]    [0x90d00000-0xbfffffff], 0x2f300000 bytes flags: 0x0
> > > > >    memory[0x5]    [0xcc000000-0xdc9fffff], 0x10a00000 bytes flags: 0x0
> > > > >    memory[0x6]    [0xde700000-0xde9fffff], 0x00300000 bytes flags: 0x0
> > > > > ...
> > > > > 
> > > > > The pfn_range [0xde600,0xde700] => addr_range [0xde600000,0xde700000]
> > > > > is not available memory, and we won't create memmap , so with or without
> > > > > your patch, we can't see the range in free_memmap(), right?
> > > > 
> > > > This is not available memory and we won't see the reange in free_memmap(),
> > > > but we still should create memmap for it and that's what my patch tried to
> > > > do.
> > > > 
> > > > There are a lot of places in core mm that operate on pageblocks and
> > > > free_unused_memmap() should make sure that any pageblock has a valid memory
> > > > map.
> > > > 
> > > > Currently, that's not the case when SPARSEMEM=y and my patch tried to fix
> > > > it.
> > > > 
> > > > Can you please send log with my patch applied and with the printing of
> > > > ranges that are freed in free_unused_memmap() you've used in previous
> > > > mails?
> > 
> > > with your patch[1] and debug print in free_memmap,
> > > ----> free_memmap, start_pfn = 85800,  85800000 end_pfn = 86800, 86800000
> > > ----> free_memmap, start_pfn = 8c800,  8c800000 end_pfn = 8e000, 8e000000
> > > ----> free_memmap, start_pfn = 8f000,  8f000000 end_pfn = 90000, 90000000
> > > ----> free_memmap, start_pfn = dcc00,  dcc00000 end_pfn = de400, de400000
> > > ----> free_memmap, start_pfn = dec00,  dec00000 end_pfn = e0000, e0000000
> > > ----> free_memmap, start_pfn = e0c00,  e0c00000 end_pfn = e4000, e4000000
> > > ----> free_memmap, start_pfn = f7000,  f7000000 end_pfn = f8000, f8000000
> > 
> > It seems that freeing of the memory map is suboptimal still because that
> > code was not designed for memory layout that has more holes than Swiss
> > cheese.
> > 
> > Still, the range [0xde600,0xde700] is not freed and there should be struct
> > pages for this range.
> > 
> > Can you add
> > 
> > 	dump_page(pfn_to_page(0xde600), "");
> > 
> > say, in the end of memblock_free_all()?
> > 
> The range [0xde600,0xde700] is not memory, so it won't create struct page
> for it when sparse_init?

sparse_init() indeed does not create memory map for unpopulated memory, but
it has pretty coarse granularity, i.e. 64M in your configuration. A hole
should be at least 64M in order to skip allocation of the memory map for
it.

For example, your memory layout has a hole of 192M at pfn 0xc0000 and this
hole won't have the memory map.

However the hole 0xdca00 - 0xde70 will still have a memory map in the
section  that covers 0xdc000 - 0xe0000.

I've tried outline this in a sketch below, hope it helps.

Memory:
                          c0000      cc000                      dca00
--------------------------+          +--------------------------+ +----+
 memory bank              |<- hole ->| memory bank              | | mb |
--------------------------+          +--------------------------+ +----+
                                                                de700  dea00

Memory map:

b0000    b4000            c0000      cc000   d0000    d8000    dc000
+--------+--------+- ... -+          +--------+- ... -+--------+---------+
| memmap | memmap | ...   |<- hole ->| memmap |  ...  | memmap | memmap  |
+--------+--------+- ... -+          +--------+- ... -+--------+---------+


> After apply patch[1], the dump_page log,
> 
> page:ef3cc000 is uninitialized and poisoned
> raw: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
> page dumped because:

This means that there is a memory map entry, and it got poisoned during the
initialization and never got reinitialized to sensible values, which would
be PageReserved() in this case.

I believe this was fixed by commit 0740a50b9baa ("mm/page_alloc.c: refactor
initialization of struct page for holes in memory layout") in the mainline
tree.

Can you backport it to your 5.10 tree and check if it helps?
Kefeng Wang May 13, 2021, 3:44 a.m. UTC | #30
On 2021/5/12 16:26, Mike Rapoport wrote:
> On Wed, May 12, 2021 at 11:08:14AM +0800, Kefeng Wang wrote:
>>
>> On 2021/5/11 16:48, Mike Rapoport wrote:
>>> On Mon, May 10, 2021 at 11:10:20AM +0800, Kefeng Wang wrote:
>>>>
>>>>>> The memory is not continuous, see MEMBLOCK:
>>>>>>     memory size = 0x4c0fffff reserved size = 0x027ef058
>>>>>>     memory.cnt  = 0xa
>>>>>>     memory[0x0]    [0x80a00000-0x855fffff], 0x04c00000 bytes flags: 0x0
>>>>>>     memory[0x1]    [0x86a00000-0x87dfffff], 0x01400000 bytes flags: 0x0
>>>>>>     memory[0x2]    [0x8bd00000-0x8c4fffff], 0x00800000 bytes flags: 0x0
>>>>>>     memory[0x3]    [0x8e300000-0x8ecfffff], 0x00a00000 bytes flags: 0x0
>>>>>>     memory[0x4]    [0x90d00000-0xbfffffff], 0x2f300000 bytes flags: 0x0
>>>>>>     memory[0x5]    [0xcc000000-0xdc9fffff], 0x10a00000 bytes flags: 0x0
>>>>>>     memory[0x6]    [0xde700000-0xde9fffff], 0x00300000 bytes flags: 0x0
>>>>>> ...
>>>>>>
>>>>>> The pfn_range [0xde600,0xde700] => addr_range [0xde600000,0xde700000]
>>>>>> is not available memory, and we won't create memmap , so with or without
>>>>>> your patch, we can't see the range in free_memmap(), right?
>>>>>
>>>>> This is not available memory and we won't see the reange in free_memmap(),
>>>>> but we still should create memmap for it and that's what my patch tried to
>>>>> do.
>>>>>
>>>>> There are a lot of places in core mm that operate on pageblocks and
>>>>> free_unused_memmap() should make sure that any pageblock has a valid memory
>>>>> map.
>>>>>
>>>>> Currently, that's not the case when SPARSEMEM=y and my patch tried to fix
>>>>> it.
>>>>>
>>>>> Can you please send log with my patch applied and with the printing of
>>>>> ranges that are freed in free_unused_memmap() you've used in previous
>>>>> mails?
>>>
>>>> with your patch[1] and debug print in free_memmap,
>>>> ----> free_memmap, start_pfn = 85800,  85800000 end_pfn = 86800, 86800000
>>>> ----> free_memmap, start_pfn = 8c800,  8c800000 end_pfn = 8e000, 8e000000
>>>> ----> free_memmap, start_pfn = 8f000,  8f000000 end_pfn = 90000, 90000000
>>>> ----> free_memmap, start_pfn = dcc00,  dcc00000 end_pfn = de400, de400000
>>>> ----> free_memmap, start_pfn = dec00,  dec00000 end_pfn = e0000, e0000000
>>>> ----> free_memmap, start_pfn = e0c00,  e0c00000 end_pfn = e4000, e4000000
>>>> ----> free_memmap, start_pfn = f7000,  f7000000 end_pfn = f8000, f8000000
>>>
>>> It seems that freeing of the memory map is suboptimal still because that
>>> code was not designed for memory layout that has more holes than Swiss
>>> cheese.
>>>
>>> Still, the range [0xde600,0xde700] is not freed and there should be struct
>>> pages for this range.
>>>
>>> Can you add
>>>
>>> 	dump_page(pfn_to_page(0xde600), "");
>>>
>>> say, in the end of memblock_free_all()?
>>>
>> The range [0xde600,0xde700] is not memory, so it won't create struct page
>> for it when sparse_init?
> 
> sparse_init() indeed does not create memory map for unpopulated memory, but
> it has pretty coarse granularity, i.e. 64M in your configuration. A hole
> should be at least 64M in order to skip allocation of the memory map for
> it.
> 
> For example, your memory layout has a hole of 192M at pfn 0xc0000 and this
> hole won't have the memory map.
> 
> However the hole 0xdca00 - 0xde70 will still have a memory map in the
> section  that covers 0xdc000 - 0xe0000.
> 
> I've tried outline this in a sketch below, hope it helps.
> 
> Memory:
>                            c0000      cc000                      dca00
> --------------------------+          +--------------------------+ +----+
>   memory bank              |<- hole ->| memory bank              | | mb |
> --------------------------+          +--------------------------+ +----+
>                                                                  de700  dea00
> 
> Memory map:
> 
> b0000    b4000            c0000      cc000   d0000    d8000    dc000
> +--------+--------+- ... -+          +--------+- ... -+--------+---------+
> | memmap | memmap | ...   |<- hole ->| memmap |  ...  | memmap | memmap  |
> +--------+--------+- ... -+          +--------+- ... -+--------+---------+
> 
> 
Thanks for the sketch, it is more clear,

>> After apply patch[1], the dump_page log,
>>
>> page:ef3cc000 is uninitialized and poisoned
>> raw: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
>> page dumped because:
> 
> This means that there is a memory map entry, and it got poisoned during the
> initialization and never got reinitialized to sensible values, which would
> be PageReserved() in this case.
> 
> I believe this was fixed by commit 0740a50b9baa ("mm/page_alloc.c: refactor
> initialization of struct page for holes in memory layout") in the mainline
> tree.
> 
> Can you backport it to your 5.10 tree and check if it helps?
>   
Hi Mike, the 0740a50b9baa is already in 5.10, tags/v5.10.24~5

commit 4c84191cbc3eff49568d3c5cccb628fa382cf7fb
Author: Mike Rapoport <rppt@kernel.org>
Date:   Fri Mar 12 21:07:12 2021 -0800

     mm/page_alloc.c: refactor initialization of struct page for holes 
in memory layout

     commit 0740a50b9baa4472cfb12442df4b39e2712a64a4 upstream.

but check init_unavailable_range(), we need deal with the hole in the
range of one pageblock.

For our scene, pageblock range: 0xde600,0xde7ff, but the available pfn 
begin with 0xde700.

If pfn(eg, 0xde600) is not valid, the step in init_unavailable_range is
pageblock_nr_pages, and ALIGN_DOWN(pfn, pageblock_nr_pages) from 0xde600
to 0xde700 is same, so the page range for pfn [0xde600,0xde700] won't be
initialized.

After add the following patch, the oom test could passed,

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aaa1655cf682..0c7e04f86f9f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6484,13 +6484,14 @@ static u64 __meminit 
init_unavailable_range(unsigned long spfn,
                                             unsigned long epfn,
                                             int zone, int node)
  {
-       unsigned long pfn;
+       unsigned long pfn, pfn_down;
+       unsigned long epfn_down = ALIGN_DOWN(epfn, pageblock_nr_pages);
         u64 pgcnt = 0;

         for (pfn = spfn; pfn < epfn; pfn++) {
-               if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
-                       pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
-                               + pageblock_nr_pages - 1;
+               pfn_down = ALIGN_DOWN(pfn, pageblock_nr_pages);
+               if (!pfn_valid(pfn_down) && pfn_down != epfn_down) {
+                       pfn = pfn_down + pageblock_nr_pages - 1;
                         continue;
                 }
                 __init_single_page(pfn_to_page(pfn), pfn, zone, node);


Before:
On node 0 totalpages: 311551
   Normal zone: 1230 pages used for memmap
   Normal zone: 0 pages reserved
   Normal zone: 157440 pages, LIFO batch:31
   Normal zone: 16384 pages in unavailable ranges
   HighMem zone: 154111 pages, LIFO batch:31
   HighMem zone: 1 pages in unavailable ranges

page:ef3cc000 is uninitialized and poisoned
raw: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff

After:
On node 0 totalpages: 311551
   Normal zone: 1230 pages used for memmap
   Normal zone: 0 pages reserved
   Normal zone: 157440 pages, LIFO batch:31
   Normal zone: 17152 pages in unavailable ranges
   HighMem zone: 154111 pages, LIFO batch:31
   HighMem zone: 513 pages in unavailable ranges
...
page:(ptrval) refcount:1 mapcount:0 mapping:00000000 index:0x0 pfn:0xde600
flags: 0xdd001000(reserved)
raw: dd001000 ef3cc004 ef3cc004 00000000 00000000 00000000 ffffffff 00000001
Mike Rapoport May 13, 2021, 10:55 a.m. UTC | #31
On Thu, May 13, 2021 at 11:44:00AM +0800, Kefeng Wang wrote:
> On 2021/5/12 16:26, Mike Rapoport wrote:
> > On Wed, May 12, 2021 at 11:08:14AM +0800, Kefeng Wang wrote:
> > > 
> > > On 2021/5/11 16:48, Mike Rapoport wrote:
> > > > On Mon, May 10, 2021 at 11:10:20AM +0800, Kefeng Wang wrote:
> > > > > 
> > > > > > > The memory is not continuous, see MEMBLOCK:
> > > > > > >     memory size = 0x4c0fffff reserved size = 0x027ef058
> > > > > > >     memory.cnt  = 0xa
> > > > > > >     memory[0x0]    [0x80a00000-0x855fffff], 0x04c00000 bytes flags: 0x0
> > > > > > >     memory[0x1]    [0x86a00000-0x87dfffff], 0x01400000 bytes flags: 0x0
> > > > > > >     memory[0x2]    [0x8bd00000-0x8c4fffff], 0x00800000 bytes flags: 0x0
> > > > > > >     memory[0x3]    [0x8e300000-0x8ecfffff], 0x00a00000 bytes flags: 0x0
> > > > > > >     memory[0x4]    [0x90d00000-0xbfffffff], 0x2f300000 bytes flags: 0x0
> > > > > > >     memory[0x5]    [0xcc000000-0xdc9fffff], 0x10a00000 bytes flags: 0x0
> > > > > > >     memory[0x6]    [0xde700000-0xde9fffff], 0x00300000 bytes flags: 0x0
> > > > > > > ...
> > > > > > > 
> > > > > > > The pfn_range [0xde600,0xde700] => addr_range [0xde600000,0xde700000]
> > > > > > > is not available memory, and we won't create memmap , so with or without
> > > > > > > your patch, we can't see the range in free_memmap(), right?
> > > > > > 
> > > > > > This is not available memory and we won't see the reange in free_memmap(),
> > > > > > but we still should create memmap for it and that's what my patch tried to
> > > > > > do.
> > > > > > 
> > > > > > There are a lot of places in core mm that operate on pageblocks and
> > > > > > free_unused_memmap() should make sure that any pageblock has a valid memory
> > > > > > map.
> > > > > > 
> > > > > > Currently, that's not the case when SPARSEMEM=y and my patch tried to fix
> > > > > > it.
> > > > > > 
> > > > > > Can you please send log with my patch applied and with the printing of
> > > > > > ranges that are freed in free_unused_memmap() you've used in previous
> > > > > > mails?
> > > > 
> > > > > with your patch[1] and debug print in free_memmap,
> > > > > ----> free_memmap, start_pfn = 85800,  85800000 end_pfn = 86800, 86800000
> > > > > ----> free_memmap, start_pfn = 8c800,  8c800000 end_pfn = 8e000, 8e000000
> > > > > ----> free_memmap, start_pfn = 8f000,  8f000000 end_pfn = 90000, 90000000
> > > > > ----> free_memmap, start_pfn = dcc00,  dcc00000 end_pfn = de400, de400000
> > > > > ----> free_memmap, start_pfn = dec00,  dec00000 end_pfn = e0000, e0000000
> > > > > ----> free_memmap, start_pfn = e0c00,  e0c00000 end_pfn = e4000, e4000000
> > > > > ----> free_memmap, start_pfn = f7000,  f7000000 end_pfn = f8000, f8000000
> > > > 
> > > > It seems that freeing of the memory map is suboptimal still because that
> > > > code was not designed for memory layout that has more holes than Swiss
> > > > cheese.
> > > > 
> > > > Still, the range [0xde600,0xde700] is not freed and there should be struct
> > > > pages for this range.
> > > > 
> > > > Can you add
> > > > 
> > > > 	dump_page(pfn_to_page(0xde600), "");
> > > > 
> > > > say, in the end of memblock_free_all()?
> > > > 
> > > The range [0xde600,0xde700] is not memory, so it won't create struct page
> > > for it when sparse_init?
> > 
> > sparse_init() indeed does not create memory map for unpopulated memory, but
> > it has pretty coarse granularity, i.e. 64M in your configuration. A hole
> > should be at least 64M in order to skip allocation of the memory map for
> > it.
> > 
> > For example, your memory layout has a hole of 192M at pfn 0xc0000 and this
> > hole won't have the memory map.
> > 
> > However the hole 0xdca00 - 0xde70 will still have a memory map in the
> > section  that covers 0xdc000 - 0xe0000.
> > 
> > I've tried outline this in a sketch below, hope it helps.
> > 
> > Memory:
> >                            c0000      cc000                      dca00
> > --------------------------+          +--------------------------+ +----+
> >   memory bank              |<- hole ->| memory bank              | | mb |
> > --------------------------+          +--------------------------+ +----+
> >                                                                  de700  dea00
> > 
> > Memory map:
> > 
> > b0000    b4000            c0000      cc000   d0000    d8000    dc000
> > +--------+--------+- ... -+          +--------+- ... -+--------+---------+
> > | memmap | memmap | ...   |<- hole ->| memmap |  ...  | memmap | memmap  |
> > +--------+--------+- ... -+          +--------+- ... -+--------+---------+
> > 
> > 
> Thanks for the sketch, it is more clear,
> 
> > > After apply patch[1], the dump_page log,
> > > 
> > > page:ef3cc000 is uninitialized and poisoned
> > > raw: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
> > > page dumped because:
> > 
> > This means that there is a memory map entry, and it got poisoned during the
> > initialization and never got reinitialized to sensible values, which would
> > be PageReserved() in this case.
> > 
> > I believe this was fixed by commit 0740a50b9baa ("mm/page_alloc.c: refactor
> > initialization of struct page for holes in memory layout") in the mainline
> > tree.
> > 
> > Can you backport it to your 5.10 tree and check if it helps?
> Hi Mike, the 0740a50b9baa is already in 5.10, tags/v5.10.24~5

Ah, you are using stable 5.10.y.
 
> commit 4c84191cbc3eff49568d3c5cccb628fa382cf7fb
> Author: Mike Rapoport <rppt@kernel.org>
> Date:   Fri Mar 12 21:07:12 2021 -0800
> 
>     mm/page_alloc.c: refactor initialization of struct page for holes in
> memory layout
> 
>     commit 0740a50b9baa4472cfb12442df4b39e2712a64a4 upstream.
> 
> but check init_unavailable_range(), we need deal with the hole in the
> range of one pageblock.
> 
> For our scene, pageblock range: 0xde600,0xde7ff, but the available pfn begin
> with 0xde700.
> 
> If pfn(eg, 0xde600) is not valid, the step in init_unavailable_range is
> pageblock_nr_pages, and ALIGN_DOWN(pfn, pageblock_nr_pages) from 0xde600
> to 0xde700 is same, so the page range for pfn [0xde600,0xde700] won't be
> initialized.

The pfn 0xde600 is valid in the sense that there is a memory map for that
pfn. Yet, with ARM's custom pfn_valid() will treat it as invalid because
there is a hole.
 
> After add the following patch, the oom test could passed,
 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index aaa1655cf682..0c7e04f86f9f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6484,13 +6484,14 @@ static u64 __meminit init_unavailable_range(unsigned
> long spfn,
>                                             unsigned long epfn,
>                                             int zone, int node)
>  {
> -       unsigned long pfn;
> +       unsigned long pfn, pfn_down;
> +       unsigned long epfn_down = ALIGN_DOWN(epfn, pageblock_nr_pages);
>         u64 pgcnt = 0;
> 
>         for (pfn = spfn; pfn < epfn; pfn++) {
> -               if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
> -                       pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
> -                               + pageblock_nr_pages - 1;
> +               pfn_down = ALIGN_DOWN(pfn, pageblock_nr_pages);
> +               if (!pfn_valid(pfn_down) && pfn_down != epfn_down) {
> +                       pfn = pfn_down + pageblock_nr_pages - 1;
>                         continue;
>                 }
>                 __init_single_page(pfn_to_page(pfn), pfn, zone, node);

I'd rather prefer to keep init_unavailable_range() and the assumption that
the memory map always covers an entire pageblock.

Can you please try the below hack. Essentially, it makes arm with SPARSEMEM
to use the generic pfn_valid() and updates the freeing of the memory map to
have the entire pageblocks covered.

If this works I'll send formal patches for those changes.


diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 24804f11302d..86ee711a3fdb 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -73,7 +73,7 @@ config ARM
 	select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
 	select HAVE_ARCH_KASAN if MMU && !XIP_KERNEL
 	select HAVE_ARCH_MMAP_RND_BITS if MMU
-	select HAVE_ARCH_PFN_VALID
+#	select HAVE_ARCH_PFN_VALID
 	select HAVE_ARCH_SECCOMP
 	select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
diff --git a/mm/memblock.c b/mm/memblock.c
index 504435753259..0d7bef1b49c3 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1928,9 +1928,11 @@ static void __init free_unused_memmap(void)
 	unsigned long start, end, prev_end = 0;
 	int i;
 
+#ifndef CONFIG_ARM
 	if (!IS_ENABLED(CONFIG_HAVE_ARCH_PFN_VALID) ||
 	    IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP))
 		return;
+#endif
 
 	/*
 	 * This relies on each bank being in address order.
@@ -1943,14 +1945,13 @@ static void __init free_unused_memmap(void)
 		 * due to SPARSEMEM sections which aren't present.
 		 */
 		start = min(start, ALIGN(prev_end, PAGES_PER_SECTION));
-#else
+#endif
 		/*
 		 * Align down here since the VM subsystem insists that the
 		 * memmap entries are valid from the bank start aligned to
 		 * MAX_ORDER_NR_PAGES.
 		 */
 		start = round_down(start, MAX_ORDER_NR_PAGES);
-#endif
 
 		/*
 		 * If we had a previous bank, and there is a space
 

> Before:
> On node 0 totalpages: 311551
>   Normal zone: 1230 pages used for memmap
>   Normal zone: 0 pages reserved
>   Normal zone: 157440 pages, LIFO batch:31
>   Normal zone: 16384 pages in unavailable ranges
>   HighMem zone: 154111 pages, LIFO batch:31
>   HighMem zone: 1 pages in unavailable ranges
> 
> page:ef3cc000 is uninitialized and poisoned
> raw: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
> 
> After:
> On node 0 totalpages: 311551
>   Normal zone: 1230 pages used for memmap
>   Normal zone: 0 pages reserved
>   Normal zone: 157440 pages, LIFO batch:31
>   Normal zone: 17152 pages in unavailable ranges
>   HighMem zone: 154111 pages, LIFO batch:31
>   HighMem zone: 513 pages in unavailable ranges
> ...
> page:(ptrval) refcount:1 mapcount:0 mapping:00000000 index:0x0 pfn:0xde600
> flags: 0xdd001000(reserved)
> raw: dd001000 ef3cc004 ef3cc004 00000000 00000000 00000000 ffffffff 00000001
>
Kefeng Wang May 14, 2021, 2:18 a.m. UTC | #32
On 2021/5/13 18:55, Mike Rapoport wrote:
> On Thu, May 13, 2021 at 11:44:00AM +0800, Kefeng Wang wrote:
>> On 2021/5/12 16:26, Mike Rapoport wrote:
>>> On Wed, May 12, 2021 at 11:08:14AM +0800, Kefeng Wang wrote:
>>>>
>>>> On 2021/5/11 16:48, Mike Rapoport wrote:
>>>>> On Mon, May 10, 2021 at 11:10:20AM +0800, Kefeng Wang wrote:
>>>>>>
>>>>>>>> The memory is not continuous, see MEMBLOCK:
>>>>>>>>      memory size = 0x4c0fffff reserved size = 0x027ef058
>>>>>>>>      memory.cnt  = 0xa
>>>>>>>>      memory[0x0]    [0x80a00000-0x855fffff], 0x04c00000 bytes flags: 0x0
>>>>>>>>      memory[0x1]    [0x86a00000-0x87dfffff], 0x01400000 bytes flags: 0x0
>>>>>>>>      memory[0x2]    [0x8bd00000-0x8c4fffff], 0x00800000 bytes flags: 0x0
>>>>>>>>      memory[0x3]    [0x8e300000-0x8ecfffff], 0x00a00000 bytes flags: 0x0
>>>>>>>>      memory[0x4]    [0x90d00000-0xbfffffff], 0x2f300000 bytes flags: 0x0
>>>>>>>>      memory[0x5]    [0xcc000000-0xdc9fffff], 0x10a00000 bytes flags: 0x0
>>>>>>>>      memory[0x6]    [0xde700000-0xde9fffff], 0x00300000 bytes flags: 0x0
>>>>>>>> ...
>>>>>>>>
>>>>>>>> The pfn_range [0xde600,0xde700] => addr_range [0xde600000,0xde700000]
>>>>>>>> is not available memory, and we won't create memmap , so with or without
>>>>>>>> your patch, we can't see the range in free_memmap(), right?
>>>>>>>
>>>>>>> This is not available memory and we won't see the reange in free_memmap(),
>>>>>>> but we still should create memmap for it and that's what my patch tried to
>>>>>>> do.
>>>>>>>
>>>>>>> There are a lot of places in core mm that operate on pageblocks and
>>>>>>> free_unused_memmap() should make sure that any pageblock has a valid memory
>>>>>>> map.
>>>>>>>
>>>>>>> Currently, that's not the case when SPARSEMEM=y and my patch tried to fix
>>>>>>> it.
>>>>>>>
>>>>>>> Can you please send log with my patch applied and with the printing of
>>>>>>> ranges that are freed in free_unused_memmap() you've used in previous
>>>>>>> mails?
>>>>>
>>>>>> with your patch[1] and debug print in free_memmap,
>>>>>> ----> free_memmap, start_pfn = 85800,  85800000 end_pfn = 86800, 86800000
>>>>>> ----> free_memmap, start_pfn = 8c800,  8c800000 end_pfn = 8e000, 8e000000
>>>>>> ----> free_memmap, start_pfn = 8f000,  8f000000 end_pfn = 90000, 90000000
>>>>>> ----> free_memmap, start_pfn = dcc00,  dcc00000 end_pfn = de400, de400000
>>>>>> ----> free_memmap, start_pfn = dec00,  dec00000 end_pfn = e0000, e0000000
>>>>>> ----> free_memmap, start_pfn = e0c00,  e0c00000 end_pfn = e4000, e4000000
>>>>>> ----> free_memmap, start_pfn = f7000,  f7000000 end_pfn = f8000, f8000000
>>>>>
>>>>> It seems that freeing of the memory map is suboptimal still because that
>>>>> code was not designed for memory layout that has more holes than Swiss
>>>>> cheese.
>>>>>
>>>>> Still, the range [0xde600,0xde700] is not freed and there should be struct
>>>>> pages for this range.
>>>>>
>>>>> Can you add
>>>>>
>>>>> 	dump_page(pfn_to_page(0xde600), "");
>>>>>
>>>>> say, in the end of memblock_free_all()?
>>>>>
>>>> The range [0xde600,0xde700] is not memory, so it won't create struct page
>>>> for it when sparse_init?
>>>
>>> sparse_init() indeed does not create memory map for unpopulated memory, but
>>> it has pretty coarse granularity, i.e. 64M in your configuration. A hole
>>> should be at least 64M in order to skip allocation of the memory map for
>>> it.
>>>
>>> For example, your memory layout has a hole of 192M at pfn 0xc0000 and this
>>> hole won't have the memory map.
>>>
>>> However the hole 0xdca00 - 0xde70 will still have a memory map in the
>>> section  that covers 0xdc000 - 0xe0000.
>>>
>>> I've tried outline this in a sketch below, hope it helps.
>>>
>>> Memory:
>>>                             c0000      cc000                      dca00
>>> --------------------------+          +--------------------------+ +----+
>>>    memory bank              |<- hole ->| memory bank              | | mb |
>>> --------------------------+          +--------------------------+ +----+
>>>                                                                   de700  dea00
>>>
>>> Memory map:
>>>
>>> b0000    b4000            c0000      cc000   d0000    d8000    dc000
>>> +--------+--------+- ... -+          +--------+- ... -+--------+---------+
>>> | memmap | memmap | ...   |<- hole ->| memmap |  ...  | memmap | memmap  |
>>> +--------+--------+- ... -+          +--------+- ... -+--------+---------+
>>>
>>>
>> Thanks for the sketch, it is more clear,
>>
>>>> After apply patch[1], the dump_page log,
>>>>
>>>> page:ef3cc000 is uninitialized and poisoned
>>>> raw: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
>>>> page dumped because:
>>>
>>> This means that there is a memory map entry, and it got poisoned during the
>>> initialization and never got reinitialized to sensible values, which would
>>> be PageReserved() in this case.
>>>
>>> I believe this was fixed by commit 0740a50b9baa ("mm/page_alloc.c: refactor
>>> initialization of struct page for holes in memory layout") in the mainline
>>> tree.
>>>
>>> Can you backport it to your 5.10 tree and check if it helps?
>> Hi Mike, the 0740a50b9baa is already in 5.10, tags/v5.10.24~5
> 
> Ah, you are using stable 5.10.y.
>   
>> commit 4c84191cbc3eff49568d3c5cccb628fa382cf7fb
>> Author: Mike Rapoport <rppt@kernel.org>
>> Date:   Fri Mar 12 21:07:12 2021 -0800
>>
>>      mm/page_alloc.c: refactor initialization of struct page for holes in
>> memory layout
>>
>>      commit 0740a50b9baa4472cfb12442df4b39e2712a64a4 upstream.
>>
>> but check init_unavailable_range(), we need deal with the hole in the
>> range of one pageblock.
>>
>> For our scene, pageblock range: 0xde600,0xde7ff, but the available pfn begin
>> with 0xde700.
>>
>> If pfn(eg, 0xde600) is not valid, the step in init_unavailable_range is
>> pageblock_nr_pages, and ALIGN_DOWN(pfn, pageblock_nr_pages) from 0xde600
>> to 0xde700 is same, so the page range for pfn [0xde600,0xde700] won't be
>> initialized.
> 
> The pfn 0xde600 is valid in the sense that there is a memory map for that
> pfn. Yet, with ARM's custom pfn_valid() will treat it as invalid because
> there is a hole.
>   
>> After add the following patch, the oom test could passed,
>   
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index aaa1655cf682..0c7e04f86f9f 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6484,13 +6484,14 @@ static u64 __meminit init_unavailable_range(unsigned
>> long spfn,
>>                                              unsigned long epfn,
>>                                              int zone, int node)
>>   {
>> -       unsigned long pfn;
>> +       unsigned long pfn, pfn_down;
>> +       unsigned long epfn_down = ALIGN_DOWN(epfn, pageblock_nr_pages);
>>          u64 pgcnt = 0;
>>
>>          for (pfn = spfn; pfn < epfn; pfn++) {
>> -               if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
>> -                       pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
>> -                               + pageblock_nr_pages - 1;
>> +               pfn_down = ALIGN_DOWN(pfn, pageblock_nr_pages);
>> +               if (!pfn_valid(pfn_down) && pfn_down != epfn_down) {
>> +                       pfn = pfn_down + pageblock_nr_pages - 1;
>>                          continue;
>>                  }
>>                  __init_single_page(pfn_to_page(pfn), pfn, zone, node);
> 
> I'd rather prefer to keep init_unavailable_range() and the assumption that
> the memory map always covers an entire pageblock.
> 
> Can you please try the below hack. Essentially, it makes arm with SPARSEMEM
> to use the generic pfn_valid() and updates the freeing of the memory map to
> have the entire pageblocks covered.
> 
> If this works I'll send formal patches for those changes.
> 
> 
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 24804f11302d..86ee711a3fdb 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -73,7 +73,7 @@ config ARM
>   	select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
>   	select HAVE_ARCH_KASAN if MMU && !XIP_KERNEL
>   	select HAVE_ARCH_MMAP_RND_BITS if MMU
> -	select HAVE_ARCH_PFN_VALID
> +#	select HAVE_ARCH_PFN_VALID
>   	select HAVE_ARCH_SECCOMP
>   	select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT
>   	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 504435753259..0d7bef1b49c3 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1928,9 +1928,11 @@ static void __init free_unused_memmap(void)
>   	unsigned long start, end, prev_end = 0;
>   	int i;
>   
> +#ifndef CONFIG_ARM
>   	if (!IS_ENABLED(CONFIG_HAVE_ARCH_PFN_VALID) ||
>   	    IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP))
>   		return;
> +#endif
>   
>   	/*
>   	 * This relies on each bank being in address order.
> @@ -1943,14 +1945,13 @@ static void __init free_unused_memmap(void)
>   		 * due to SPARSEMEM sections which aren't present.
>   		 */
>   		start = min(start, ALIGN(prev_end, PAGES_PER_SECTION));
> -#else
> +#endif
>   		/*
>   		 * Align down here since the VM subsystem insists that the
>   		 * memmap entries are valid from the bank start aligned to
>   		 * MAX_ORDER_NR_PAGES.
>   		 */
>   		start = round_down(start, MAX_ORDER_NR_PAGES);
> -#endif
>   
>   		/*
>   		 * If we had a previous bank, and there is a space
>   
> 

Without HAVE_ARCH_PFN_VALID, init_unavailable_range will set those page
with Reserved flag, and yes, it works for oom test.

On node 0 totalpages: 311551
   Normal zone: 1230 pages used for memmap
   Normal zone: 0 pages reserved
   Normal zone: 157440 pages, LIFO batch:31
   Normal zone: 55552 pages in unavailable ranges
   HighMem zone: 154111 pages, LIFO batch:31
   HighMem zone: 41985 pages in unavailable ranges

Thanks for your kindly guidance.

>> Before:
>> On node 0 totalpages: 311551
>>    Normal zone: 1230 pages used for memmap
>>    Normal zone: 0 pages reserved
>>    Normal zone: 157440 pages, LIFO batch:31
>>    Normal zone: 16384 pages in unavailable ranges
>>    HighMem zone: 154111 pages, LIFO batch:31
>>    HighMem zone: 1 pages in unavailable ranges
>>
>> page:ef3cc000 is uninitialized and poisoned
>> raw: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
>>
>> After:
>> On node 0 totalpages: 311551
>>    Normal zone: 1230 pages used for memmap
>>    Normal zone: 0 pages reserved
>>    Normal zone: 157440 pages, LIFO batch:31
>>    Normal zone: 17152 pages in unavailable ranges
>>    HighMem zone: 154111 pages, LIFO batch:31
>>    HighMem zone: 513 pages in unavailable ranges
>> ...
>> page:(ptrval) refcount:1 mapcount:0 mapping:00000000 index:0x0 pfn:0xde600
>> flags: 0xdd001000(reserved)
>> raw: dd001000 ef3cc004 ef3cc004 00000000 00000000 00000000 ffffffff 00000001
>>
>