Message ID | 20180606080408.GA31794@techadventures.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Jun 06, 2018 at 10:04:08AM +0200, Oscar Salvador wrote: > On Wed, Jun 06, 2018 at 05:16:24AM +0000, Naoya Horiguchi wrote: > > On Tue, Jun 05, 2018 at 07:35:01AM +0000, Horiguchi Naoya(堀口 直也) wrote: > > > On Mon, Jun 04, 2018 at 06:18:36PM -0700, Matthew Wilcox wrote: > > > > On Tue, Jun 05, 2018 at 12:54:03AM +0000, Naoya Horiguchi wrote: > > > > > Reproduction precedure is like this: > > > > > - enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G) > > > > > - read /proc/kpageflags (or call tools/vm/page-types with no arguments) > > > > > (- my kernel config is attached) > > > > > > > > > > I spent a few days on this, but didn't reach any solutions. > > > > > So let me report this with some details below ... > > > > > > > > > > In the critial page request, stable_page_flags() is called with an argument > > > > > page whose ->compound_head was somehow filled with '0xffffffffffffffff'. > > > > > And compound_head() returns (struct page *)(head - 1), which explains the > > > > > address 0xfffffffffffffffe in the above message. > > > > > > > > Hm. compound_head shares with: > > > > > > > > struct list_head lru; > > > > struct list_head slab_list; /* uses lru */ > > > > struct { /* Partial pages */ > > > > struct page *next; > > > > unsigned long _compound_pad_1; /* compound_head */ > > > > unsigned long _pt_pad_1; /* compound_head */ > > > > struct dev_pagemap *pgmap; > > > > struct rcu_head rcu_head; > > > > > > > > None of them should be -1. > > > > > > > > > It seems that this kernel panic happens when reading kpageflags of pfn range > > > > > [0xbffd7, 0xc0000), which coresponds to a 'reserved' range. > > > > > > > > > > [ 0.000000] user-defined physical RAM map: > > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable > > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved > > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved > > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable > > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved > > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved > > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved > > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) > > > > > > > > > > So I guess 'memmap=' parameter might badly affect the memory initialization process. > > > > > > > > > > This problem doesn't reproduce on v4.17, so some pre-released patch introduces it. > > > > > I hope this info helps you find the solution/workaround. > > > > > > > > Can you try bisecting this? It could be one of my patches to reorder struct > > > > page, or it could be one of Pavel's deferred page initialisation patches. > > > > Or something else ;-) > > > > > > Thank you for the comment. I'm trying bisecting now, let you know the result later. > > > > > > And I found that my statement "not reproduce on v4.17" was wrong (I used > > > different kvm guests, which made some different test condition and misguided me), > > > this seems an older (at least < 4.15) bug. > > > > (Cc: Pavel) > > > > Bisection showed that the following commit introduced this issue: > > > > commit f7f99100d8d95dbcf09e0216a143211e79418b9f > > Author: Pavel Tatashin <pasha.tatashin@oracle.com> > > Date: Wed Nov 15 17:36:44 2017 -0800 > > > > mm: stop zeroing memory during allocation in vmemmap > > > > This patch postpones struct page zeroing to later stage of memory initialization. > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of > > __init_single_page() were never reached. So in such case, struct pages populated > > by vmemmap_pte_populate() could be left uninitialized? > > And I'm not sure yet how this issue becomes visible with memmap= setting. > > I think that this becomes visible because memmap=x!y creates a persistent memory region: > > parse_memmap_one > { > ... > } else if (*p == '!') { > start_at = memparse(p+1, &p); > e820__range_add(start_at, mem_size, E820_TYPE_PRAM); > ... > } > > and this region it is not added neither in memblock.memory nor in memblock.reserved. > Ranges in memblock.memory get zeroed in memmap_init_zone(), while memblock.reserved get zeroed > in free_low_memory_core_early(): > > static unsigned long __init free_low_memory_core_early(void) > { > ... > for_each_reserved_mem_region(i, &start, &end) > reserve_bootmem_region(start, end); > ... > } > > > Maybe I am mistaken, but I think that persistent memory regions should be marked as reserved. > A comment in do_mark_busy() suggests this: > > static bool __init do_mark_busy(enum e820_type type, struct resource *res) > { > > ... > /* > * Treat persistent memory like device memory, i.e. reserve it > * for exclusive use of a driver > */ > ... > } > > > I wonder if something like this could work and if so, if it is right (i haven't tested it yet): > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > index 71c11ad5643e..3c9686ef74e5 100644 > --- a/arch/x86/kernel/e820.c > +++ b/arch/x86/kernel/e820.c > @@ -1247,6 +1247,11 @@ void __init e820__memblock_setup(void) > if (end != (resource_size_t)end) > continue; > > + if (entry->type == E820_TYPE_PRAM || entry->type == E820_TYPE_PMEM) { > + memblock_reserve(entry->addr, entry->size); > + continue; > + } > + > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > continue; It does not seem to work, so the reasoning might be incorrect. Best Regards Oscar Salvador
On Wed, Jun 06, 2018 at 10:53:19AM +0200, Oscar Salvador wrote: > On Wed, Jun 06, 2018 at 10:04:08AM +0200, Oscar Salvador wrote: > > On Wed, Jun 06, 2018 at 05:16:24AM +0000, Naoya Horiguchi wrote: > > > On Tue, Jun 05, 2018 at 07:35:01AM +0000, Horiguchi Naoya(堀口 直也) wrote: > > > > On Mon, Jun 04, 2018 at 06:18:36PM -0700, Matthew Wilcox wrote: > > > > > On Tue, Jun 05, 2018 at 12:54:03AM +0000, Naoya Horiguchi wrote: > > > > > > Reproduction precedure is like this: > > > > > > - enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G) > > > > > > - read /proc/kpageflags (or call tools/vm/page-types with no arguments) > > > > > > (- my kernel config is attached) > > > > > > > > > > > > I spent a few days on this, but didn't reach any solutions. > > > > > > So let me report this with some details below ... > > > > > > > > > > > > In the critial page request, stable_page_flags() is called with an argument > > > > > > page whose ->compound_head was somehow filled with '0xffffffffffffffff'. > > > > > > And compound_head() returns (struct page *)(head - 1), which explains the > > > > > > address 0xfffffffffffffffe in the above message. > > > > > > > > > > Hm. compound_head shares with: > > > > > > > > > > struct list_head lru; > > > > > struct list_head slab_list; /* uses lru */ > > > > > struct { /* Partial pages */ > > > > > struct page *next; > > > > > unsigned long _compound_pad_1; /* compound_head */ > > > > > unsigned long _pt_pad_1; /* compound_head */ > > > > > struct dev_pagemap *pgmap; > > > > > struct rcu_head rcu_head; > > > > > > > > > > None of them should be -1. > > > > > > > > > > > It seems that this kernel panic happens when reading kpageflags of pfn range > > > > > > [0xbffd7, 0xc0000), which coresponds to a 'reserved' range. > > > > > > > > > > > > [ 0.000000] user-defined physical RAM map: > > > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable > > > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved > > > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved > > > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable > > > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved > > > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved > > > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved > > > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) > > > > > > > > > > > > So I guess 'memmap=' parameter might badly affect the memory initialization process. > > > > > > > > > > > > This problem doesn't reproduce on v4.17, so some pre-released patch introduces it. > > > > > > I hope this info helps you find the solution/workaround. > > > > > > > > > > Can you try bisecting this? It could be one of my patches to reorder struct > > > > > page, or it could be one of Pavel's deferred page initialisation patches. > > > > > Or something else ;-) > > > > > > > > Thank you for the comment. I'm trying bisecting now, let you know the result later. > > > > > > > > And I found that my statement "not reproduce on v4.17" was wrong (I used > > > > different kvm guests, which made some different test condition and misguided me), > > > > this seems an older (at least < 4.15) bug. > > > > > > (Cc: Pavel) > > > > > > Bisection showed that the following commit introduced this issue: > > > > > > commit f7f99100d8d95dbcf09e0216a143211e79418b9f > > > Author: Pavel Tatashin <pasha.tatashin@oracle.com> > > > Date: Wed Nov 15 17:36:44 2017 -0800 > > > > > > mm: stop zeroing memory during allocation in vmemmap > > > > > > This patch postpones struct page zeroing to later stage of memory initialization. > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of > > > __init_single_page() were never reached. So in such case, struct pages populated > > > by vmemmap_pte_populate() could be left uninitialized? > > > And I'm not sure yet how this issue becomes visible with memmap= setting. > > > > I think that this becomes visible because memmap=x!y creates a persistent memory region: > > > > parse_memmap_one > > { > > ... > > } else if (*p == '!') { > > start_at = memparse(p+1, &p); > > e820__range_add(start_at, mem_size, E820_TYPE_PRAM); > > ... > > } > > > > and this region it is not added neither in memblock.memory nor in memblock.reserved. > > Ranges in memblock.memory get zeroed in memmap_init_zone(), while memblock.reserved get zeroed > > in free_low_memory_core_early(): > > > > static unsigned long __init free_low_memory_core_early(void) > > { > > ... > > for_each_reserved_mem_region(i, &start, &end) > > reserve_bootmem_region(start, end); > > ... > > } > > > > > > Maybe I am mistaken, but I think that persistent memory regions should be marked as reserved. > > A comment in do_mark_busy() suggests this: > > > > static bool __init do_mark_busy(enum e820_type type, struct resource *res) > > { > > > > ... > > /* > > * Treat persistent memory like device memory, i.e. reserve it > > * for exclusive use of a driver > > */ > > ... > > } > > > > > > I wonder if something like this could work and if so, if it is right (i haven't tested it yet): > > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > > index 71c11ad5643e..3c9686ef74e5 100644 > > --- a/arch/x86/kernel/e820.c > > +++ b/arch/x86/kernel/e820.c > > @@ -1247,6 +1247,11 @@ void __init e820__memblock_setup(void) > > if (end != (resource_size_t)end) > > continue; > > > > + if (entry->type == E820_TYPE_PRAM || entry->type == E820_TYPE_PMEM) { > > + memblock_reserve(entry->addr, entry->size); > > + continue; > > + } > > + > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > > continue; > > It does not seem to work, so the reasoning might be incorrect. Thank you for the comment. One note is that the memory region with "broken struct page" is a typical reserved region, not a pmem region. Strangely reading offset 0xbffd7 of /proc/kpageflags is OK if pmem region does not exist, but NG if pmem region exists. Reading the offset like 0x100000 (on pmem region) does not cause the crash, so pmem region seems properly set up. [ 0.000000] user-defined physical RAM map: [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved ===> "broken struct page" region [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) => pmem region [ 0.000000] user: [mem 0x0000000140000000-0x000000023fffffff] usable Thanks, Naoya Horiguchi
On Wed, Jun 06, 2018 at 09:06:30AM +0000, Horiguchi Naoya(堀口 直也) wrote: > On Wed, Jun 06, 2018 at 10:53:19AM +0200, Oscar Salvador wrote: > > On Wed, Jun 06, 2018 at 10:04:08AM +0200, Oscar Salvador wrote: > > > On Wed, Jun 06, 2018 at 05:16:24AM +0000, Naoya Horiguchi wrote: > > > > On Tue, Jun 05, 2018 at 07:35:01AM +0000, Horiguchi Naoya(堀口 直也) wrote: > > > > > On Mon, Jun 04, 2018 at 06:18:36PM -0700, Matthew Wilcox wrote: > > > > > > On Tue, Jun 05, 2018 at 12:54:03AM +0000, Naoya Horiguchi wrote: > > > > > > > Reproduction precedure is like this: > > > > > > > - enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G) > > > > > > > - read /proc/kpageflags (or call tools/vm/page-types with no arguments) > > > > > > > (- my kernel config is attached) > > > > > > > > > > > > > > I spent a few days on this, but didn't reach any solutions. > > > > > > > So let me report this with some details below ... > > > > > > > > > > > > > > In the critial page request, stable_page_flags() is called with an argument > > > > > > > page whose ->compound_head was somehow filled with '0xffffffffffffffff'. > > > > > > > And compound_head() returns (struct page *)(head - 1), which explains the > > > > > > > address 0xfffffffffffffffe in the above message. > > > > > > > > > > > > Hm. compound_head shares with: > > > > > > > > > > > > struct list_head lru; > > > > > > struct list_head slab_list; /* uses lru */ > > > > > > struct { /* Partial pages */ > > > > > > struct page *next; > > > > > > unsigned long _compound_pad_1; /* compound_head */ > > > > > > unsigned long _pt_pad_1; /* compound_head */ > > > > > > struct dev_pagemap *pgmap; > > > > > > struct rcu_head rcu_head; > > > > > > > > > > > > None of them should be -1. > > > > > > > > > > > > > It seems that this kernel panic happens when reading kpageflags of pfn range > > > > > > > [0xbffd7, 0xc0000), which coresponds to a 'reserved' range. > > > > > > > > > > > > > > [ 0.000000] user-defined physical RAM map: > > > > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable > > > > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved > > > > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved > > > > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable > > > > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved > > > > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved > > > > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved > > > > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) > > > > > > > > > > > > > > So I guess 'memmap=' parameter might badly affect the memory initialization process. > > > > > > > > > > > > > > This problem doesn't reproduce on v4.17, so some pre-released patch introduces it. > > > > > > > I hope this info helps you find the solution/workaround. > > > > > > > > > > > > Can you try bisecting this? It could be one of my patches to reorder struct > > > > > > page, or it could be one of Pavel's deferred page initialisation patches. > > > > > > Or something else ;-) > > > > > > > > > > Thank you for the comment. I'm trying bisecting now, let you know the result later. > > > > > > > > > > And I found that my statement "not reproduce on v4.17" was wrong (I used > > > > > different kvm guests, which made some different test condition and misguided me), > > > > > this seems an older (at least < 4.15) bug. > > > > > > > > (Cc: Pavel) > > > > > > > > Bisection showed that the following commit introduced this issue: > > > > > > > > commit f7f99100d8d95dbcf09e0216a143211e79418b9f > > > > Author: Pavel Tatashin <pasha.tatashin@oracle.com> > > > > Date: Wed Nov 15 17:36:44 2017 -0800 > > > > > > > > mm: stop zeroing memory during allocation in vmemmap > > > > > > > > This patch postpones struct page zeroing to later stage of memory initialization. > > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of > > > > __init_single_page() were never reached. So in such case, struct pages populated > > > > by vmemmap_pte_populate() could be left uninitialized? > > > > And I'm not sure yet how this issue becomes visible with memmap= setting. > > > > > > I think that this becomes visible because memmap=x!y creates a persistent memory region: > > > > > > parse_memmap_one > > > { > > > ... > > > } else if (*p == '!') { > > > start_at = memparse(p+1, &p); > > > e820__range_add(start_at, mem_size, E820_TYPE_PRAM); > > > ... > > > } > > > > > > and this region it is not added neither in memblock.memory nor in memblock.reserved. > > > Ranges in memblock.memory get zeroed in memmap_init_zone(), while memblock.reserved get zeroed > > > in free_low_memory_core_early(): > > > > > > static unsigned long __init free_low_memory_core_early(void) > > > { > > > ... > > > for_each_reserved_mem_region(i, &start, &end) > > > reserve_bootmem_region(start, end); > > > ... > > > } > > > > > > > > > Maybe I am mistaken, but I think that persistent memory regions should be marked as reserved. > > > A comment in do_mark_busy() suggests this: > > > > > > static bool __init do_mark_busy(enum e820_type type, struct resource *res) > > > { > > > > > > ... > > > /* > > > * Treat persistent memory like device memory, i.e. reserve it > > > * for exclusive use of a driver > > > */ > > > ... > > > } > > > > > > > > > I wonder if something like this could work and if so, if it is right (i haven't tested it yet): > > > > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > > > index 71c11ad5643e..3c9686ef74e5 100644 > > > --- a/arch/x86/kernel/e820.c > > > +++ b/arch/x86/kernel/e820.c > > > @@ -1247,6 +1247,11 @@ void __init e820__memblock_setup(void) > > > if (end != (resource_size_t)end) > > > continue; > > > > > > + if (entry->type == E820_TYPE_PRAM || entry->type == E820_TYPE_PMEM) { > > > + memblock_reserve(entry->addr, entry->size); > > > + continue; > > > + } > > > + > > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > > > continue; > > > > It does not seem to work, so the reasoning might be incorrect. > > Thank you for the comment. > > One note is that the memory region with "broken struct page" is a typical > reserved region, not a pmem region. Strangely reading offset 0xbffd7 of > /proc/kpageflags is OK if pmem region does not exist, but NG if pmem region exists. > Reading the offset like 0x100000 (on pmem region) does not cause the crash, > so pmem region seems properly set up. > > [ 0.000000] user-defined physical RAM map: > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved ===> "broken struct page" region > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) => pmem region > [ 0.000000] user: [mem 0x0000000140000000-0x000000023fffffff] usable > I have another note: > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of > __init_single_page() were never reached. So in such case, struct pages populated > by vmemmap_pte_populate() could be left uninitialized? I quickly checked whether enabling CONFIG_DEFERRED_STRUCT_PAGE_INIT affect the issue. And found that the kernel panic happens even with this config enabled. So I'm still confused... - Naoya
On Wed, Jun 06, 2018 at 09:24:05AM +0000, Horiguchi Naoya(堀口 直也) wrote: > On Wed, Jun 06, 2018 at 09:06:30AM +0000, Horiguchi Naoya(堀口 直也) wrote: > > On Wed, Jun 06, 2018 at 10:53:19AM +0200, Oscar Salvador wrote: > > > On Wed, Jun 06, 2018 at 10:04:08AM +0200, Oscar Salvador wrote: > > > > On Wed, Jun 06, 2018 at 05:16:24AM +0000, Naoya Horiguchi wrote: > > > > > On Tue, Jun 05, 2018 at 07:35:01AM +0000, Horiguchi Naoya(堀口 直也) wrote: > > > > > > On Mon, Jun 04, 2018 at 06:18:36PM -0700, Matthew Wilcox wrote: > > > > > > > On Tue, Jun 05, 2018 at 12:54:03AM +0000, Naoya Horiguchi wrote: > > > > > > > > Reproduction precedure is like this: > > > > > > > > - enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G) > > > > > > > > - read /proc/kpageflags (or call tools/vm/page-types with no arguments) > > > > > > > > (- my kernel config is attached) > > > > > > > > > > > > > > > > I spent a few days on this, but didn't reach any solutions. > > > > > > > > So let me report this with some details below ... > > > > > > > > > > > > > > > > In the critial page request, stable_page_flags() is called with an argument > > > > > > > > page whose ->compound_head was somehow filled with '0xffffffffffffffff'. > > > > > > > > And compound_head() returns (struct page *)(head - 1), which explains the > > > > > > > > address 0xfffffffffffffffe in the above message. > > > > > > > > > > > > > > Hm. compound_head shares with: > > > > > > > > > > > > > > struct list_head lru; > > > > > > > struct list_head slab_list; /* uses lru */ > > > > > > > struct { /* Partial pages */ > > > > > > > struct page *next; > > > > > > > unsigned long _compound_pad_1; /* compound_head */ > > > > > > > unsigned long _pt_pad_1; /* compound_head */ > > > > > > > struct dev_pagemap *pgmap; > > > > > > > struct rcu_head rcu_head; > > > > > > > > > > > > > > None of them should be -1. > > > > > > > > > > > > > > > It seems that this kernel panic happens when reading kpageflags of pfn range > > > > > > > > [0xbffd7, 0xc0000), which coresponds to a 'reserved' range. > > > > > > > > > > > > > > > > [ 0.000000] user-defined physical RAM map: > > > > > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable > > > > > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved > > > > > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved > > > > > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable > > > > > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved > > > > > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved > > > > > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved > > > > > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) > > > > > > > > > > > > > > > > So I guess 'memmap=' parameter might badly affect the memory initialization process. > > > > > > > > > > > > > > > > This problem doesn't reproduce on v4.17, so some pre-released patch introduces it. > > > > > > > > I hope this info helps you find the solution/workaround. > > > > > > > > > > > > > > Can you try bisecting this? It could be one of my patches to reorder struct > > > > > > > page, or it could be one of Pavel's deferred page initialisation patches. > > > > > > > Or something else ;-) > > > > > > > > > > > > Thank you for the comment. I'm trying bisecting now, let you know the result later. > > > > > > > > > > > > And I found that my statement "not reproduce on v4.17" was wrong (I used > > > > > > different kvm guests, which made some different test condition and misguided me), > > > > > > this seems an older (at least < 4.15) bug. > > > > > > > > > > (Cc: Pavel) > > > > > > > > > > Bisection showed that the following commit introduced this issue: > > > > > > > > > > commit f7f99100d8d95dbcf09e0216a143211e79418b9f > > > > > Author: Pavel Tatashin <pasha.tatashin@oracle.com> > > > > > Date: Wed Nov 15 17:36:44 2017 -0800 > > > > > > > > > > mm: stop zeroing memory during allocation in vmemmap > > > > > > > > > > This patch postpones struct page zeroing to later stage of memory initialization. > > > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of > > > > > __init_single_page() were never reached. So in such case, struct pages populated > > > > > by vmemmap_pte_populate() could be left uninitialized? > > > > > And I'm not sure yet how this issue becomes visible with memmap= setting. > > > > > > > > I think that this becomes visible because memmap=x!y creates a persistent memory region: > > > > > > > > parse_memmap_one > > > > { > > > > ... > > > > } else if (*p == '!') { > > > > start_at = memparse(p+1, &p); > > > > e820__range_add(start_at, mem_size, E820_TYPE_PRAM); > > > > ... > > > > } > > > > > > > > and this region it is not added neither in memblock.memory nor in memblock.reserved. > > > > Ranges in memblock.memory get zeroed in memmap_init_zone(), while memblock.reserved get zeroed > > > > in free_low_memory_core_early(): > > > > > > > > static unsigned long __init free_low_memory_core_early(void) > > > > { > > > > ... > > > > for_each_reserved_mem_region(i, &start, &end) > > > > reserve_bootmem_region(start, end); > > > > ... > > > > } > > > > > > > > > > > > Maybe I am mistaken, but I think that persistent memory regions should be marked as reserved. > > > > A comment in do_mark_busy() suggests this: > > > > > > > > static bool __init do_mark_busy(enum e820_type type, struct resource *res) > > > > { > > > > > > > > ... > > > > /* > > > > * Treat persistent memory like device memory, i.e. reserve it > > > > * for exclusive use of a driver > > > > */ > > > > ... > > > > } > > > > > > > > > > > > I wonder if something like this could work and if so, if it is right (i haven't tested it yet): > > > > > > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > > > > index 71c11ad5643e..3c9686ef74e5 100644 > > > > --- a/arch/x86/kernel/e820.c > > > > +++ b/arch/x86/kernel/e820.c > > > > @@ -1247,6 +1247,11 @@ void __init e820__memblock_setup(void) > > > > if (end != (resource_size_t)end) > > > > continue; > > > > > > > > + if (entry->type == E820_TYPE_PRAM || entry->type == E820_TYPE_PMEM) { > > > > + memblock_reserve(entry->addr, entry->size); > > > > + continue; > > > > + } > > > > + > > > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > > > > continue; > > > > > > It does not seem to work, so the reasoning might be incorrect. > > > > Thank you for the comment. > > > > One note is that the memory region with "broken struct page" is a typical > > reserved region, not a pmem region. Strangely reading offset 0xbffd7 of > > /proc/kpageflags is OK if pmem region does not exist, but NG if pmem region exists. > > Reading the offset like 0x100000 (on pmem region) does not cause the crash, > > so pmem region seems properly set up. > > > > [ 0.000000] user-defined physical RAM map: > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved ===> "broken struct page" region > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) => pmem region > > [ 0.000000] user: [mem 0x0000000140000000-0x000000023fffffff] usable > > > > I have another note: > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of > > __init_single_page() were never reached. So in such case, struct pages populated > > by vmemmap_pte_populate() could be left uninitialized? > > I quickly checked whether enabling CONFIG_DEFERRED_STRUCT_PAGE_INIT affect > the issue. And found that the kernel panic happens even with this config enabled. > So I'm still confused... Let me share some new facts: I gave accidentally an inconvenient memmap layout like 'memmap=1G!4G' in 2 NUMA node with 8 GB memory. While I didn't intended this, but 4GB is the address starting some memory block when no "memmap=" option is provided. (messages from free_area_init_nodes() for no "memmap=" case [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff] [ 0.000000] node 0: [mem 0x0000000100000000-0x000000013fffffff] // <--- [ 0.000000] node 1: [mem 0x0000000140000000-0x000000023fffffff] When "memmap=1G!4G" is given, the range [0x0000000100000000-0x000000013fffffff] disappears and kernel messages are like below: (messages from free_area_init_nodes() for "memmap=1G!4G" case [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff] [ 0.000000] node 1: [mem 0x0000000140000000-0x000000023fffffff] This makes kernel think that the end pfn of node 0 is 0 0xbffd7 instead of 0x140000, which affects the memory initialization process. memmap_init_zone() calls __init_single_page() for each page within a zone, so if zone->spanned_pages are underestimated, some pages are left uninitialized. If I provide 'memmap=1G!7G', the kernel panic does not reproduce and kernel messages are like below. (messages from free_area_init_nodes() for "memmap=1G!7G" case [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff] [ 0.000000] node 0: [mem 0x0000000100000000-0x000000013fffffff] [ 0.000000] node 1: [mem 0x0000000140000000-0x00000001bfffffff] [ 0.000000] node 1: [mem 0x0000000200000000-0x000000023fffffff] I think that in order to fix this, we need some conditions and/or prechecks for memblock layout, does it make sense? Or any other better approaches? Thanks, Naoya Horiguchi
On Thu, Jun 07, 2018 at 06:22:19AM +0000, Naoya Horiguchi wrote: > On Wed, Jun 06, 2018 at 09:24:05AM +0000, Horiguchi Naoya(堀口 直也) wrote: > > On Wed, Jun 06, 2018 at 09:06:30AM +0000, Horiguchi Naoya(堀口 直也) wrote: > > > On Wed, Jun 06, 2018 at 10:53:19AM +0200, Oscar Salvador wrote: > > > > On Wed, Jun 06, 2018 at 10:04:08AM +0200, Oscar Salvador wrote: > > > > > On Wed, Jun 06, 2018 at 05:16:24AM +0000, Naoya Horiguchi wrote: > > > > > > On Tue, Jun 05, 2018 at 07:35:01AM +0000, Horiguchi Naoya(堀口 直也) wrote: > > > > > > > On Mon, Jun 04, 2018 at 06:18:36PM -0700, Matthew Wilcox wrote: > > > > > > > > On Tue, Jun 05, 2018 at 12:54:03AM +0000, Naoya Horiguchi wrote: > > > > > > > > > Reproduction precedure is like this: > > > > > > > > > - enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G) > > > > > > > > > - read /proc/kpageflags (or call tools/vm/page-types with no arguments) > > > > > > > > > (- my kernel config is attached) > > > > > > > > > > > > > > > > > > I spent a few days on this, but didn't reach any solutions. > > > > > > > > > So let me report this with some details below ... > > > > > > > > > > > > > > > > > > In the critial page request, stable_page_flags() is called with an argument > > > > > > > > > page whose ->compound_head was somehow filled with '0xffffffffffffffff'. > > > > > > > > > And compound_head() returns (struct page *)(head - 1), which explains the > > > > > > > > > address 0xfffffffffffffffe in the above message. > > > > > > > > > > > > > > > > Hm. compound_head shares with: > > > > > > > > > > > > > > > > struct list_head lru; > > > > > > > > struct list_head slab_list; /* uses lru */ > > > > > > > > struct { /* Partial pages */ > > > > > > > > struct page *next; > > > > > > > > unsigned long _compound_pad_1; /* compound_head */ > > > > > > > > unsigned long _pt_pad_1; /* compound_head */ > > > > > > > > struct dev_pagemap *pgmap; > > > > > > > > struct rcu_head rcu_head; > > > > > > > > > > > > > > > > None of them should be -1. > > > > > > > > > > > > > > > > > It seems that this kernel panic happens when reading kpageflags of pfn range > > > > > > > > > [0xbffd7, 0xc0000), which coresponds to a 'reserved' range. > > > > > > > > > > > > > > > > > > [ 0.000000] user-defined physical RAM map: > > > > > > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable > > > > > > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved > > > > > > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved > > > > > > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable > > > > > > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved > > > > > > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved > > > > > > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved > > > > > > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) > > > > > > > > > > > > > > > > > > So I guess 'memmap=' parameter might badly affect the memory initialization process. > > > > > > > > > > > > > > > > > > This problem doesn't reproduce on v4.17, so some pre-released patch introduces it. > > > > > > > > > I hope this info helps you find the solution/workaround. > > > > > > > > > > > > > > > > Can you try bisecting this? It could be one of my patches to reorder struct > > > > > > > > page, or it could be one of Pavel's deferred page initialisation patches. > > > > > > > > Or something else ;-) > > > > > > > > > > > > > > Thank you for the comment. I'm trying bisecting now, let you know the result later. > > > > > > > > > > > > > > And I found that my statement "not reproduce on v4.17" was wrong (I used > > > > > > > different kvm guests, which made some different test condition and misguided me), > > > > > > > this seems an older (at least < 4.15) bug. > > > > > > > > > > > > (Cc: Pavel) > > > > > > > > > > > > Bisection showed that the following commit introduced this issue: > > > > > > > > > > > > commit f7f99100d8d95dbcf09e0216a143211e79418b9f > > > > > > Author: Pavel Tatashin <pasha.tatashin@oracle.com> > > > > > > Date: Wed Nov 15 17:36:44 2017 -0800 > > > > > > > > > > > > mm: stop zeroing memory during allocation in vmemmap > > > > > > > > > > > > This patch postpones struct page zeroing to later stage of memory initialization. > > > > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of > > > > > > __init_single_page() were never reached. So in such case, struct pages populated > > > > > > by vmemmap_pte_populate() could be left uninitialized? > > > > > > And I'm not sure yet how this issue becomes visible with memmap= setting. > > > > > > > > > > I think that this becomes visible because memmap=x!y creates a persistent memory region: > > > > > > > > > > parse_memmap_one > > > > > { > > > > > ... > > > > > } else if (*p == '!') { > > > > > start_at = memparse(p+1, &p); > > > > > e820__range_add(start_at, mem_size, E820_TYPE_PRAM); > > > > > ... > > > > > } > > > > > > > > > > and this region it is not added neither in memblock.memory nor in memblock.reserved. > > > > > Ranges in memblock.memory get zeroed in memmap_init_zone(), while memblock.reserved get zeroed > > > > > in free_low_memory_core_early(): > > > > > > > > > > static unsigned long __init free_low_memory_core_early(void) > > > > > { > > > > > ... > > > > > for_each_reserved_mem_region(i, &start, &end) > > > > > reserve_bootmem_region(start, end); > > > > > ... > > > > > } > > > > > > > > > > > > > > > Maybe I am mistaken, but I think that persistent memory regions should be marked as reserved. > > > > > A comment in do_mark_busy() suggests this: > > > > > > > > > > static bool __init do_mark_busy(enum e820_type type, struct resource *res) > > > > > { > > > > > > > > > > ... > > > > > /* > > > > > * Treat persistent memory like device memory, i.e. reserve it > > > > > * for exclusive use of a driver > > > > > */ > > > > > ... > > > > > } > > > > > > > > > > > > > > > I wonder if something like this could work and if so, if it is right (i haven't tested it yet): > > > > > > > > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > > > > > index 71c11ad5643e..3c9686ef74e5 100644 > > > > > --- a/arch/x86/kernel/e820.c > > > > > +++ b/arch/x86/kernel/e820.c > > > > > @@ -1247,6 +1247,11 @@ void __init e820__memblock_setup(void) > > > > > if (end != (resource_size_t)end) > > > > > continue; > > > > > > > > > > + if (entry->type == E820_TYPE_PRAM || entry->type == E820_TYPE_PMEM) { > > > > > + memblock_reserve(entry->addr, entry->size); > > > > > + continue; > > > > > + } > > > > > + > > > > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > > > > > continue; > > > > > > > > It does not seem to work, so the reasoning might be incorrect. > > > > > > Thank you for the comment. > > > > > > One note is that the memory region with "broken struct page" is a typical > > > reserved region, not a pmem region. Strangely reading offset 0xbffd7 of > > > /proc/kpageflags is OK if pmem region does not exist, but NG if pmem region exists. > > > Reading the offset like 0x100000 (on pmem region) does not cause the crash, > > > so pmem region seems properly set up. > > > > > > [ 0.000000] user-defined physical RAM map: > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved ===> "broken struct page" region > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) => pmem region > > > [ 0.000000] user: [mem 0x0000000140000000-0x000000023fffffff] usable > > > > > > > I have another note: > > > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of > > > __init_single_page() were never reached. So in such case, struct pages populated > > > by vmemmap_pte_populate() could be left uninitialized? > > > > I quickly checked whether enabling CONFIG_DEFERRED_STRUCT_PAGE_INIT affect > > the issue. And found that the kernel panic happens even with this config enabled. > > So I'm still confused... > > Let me share some new facts: > > I gave accidentally an inconvenient memmap layout like 'memmap=1G!4G' in > 2 NUMA node with 8 GB memory. > While I didn't intended this, but 4GB is the address starting some memory > block when no "memmap=" option is provided. > > (messages from free_area_init_nodes() for no "memmap=" case > [ 0.000000] Early memory node ranges > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff] > [ 0.000000] node 0: [mem 0x0000000100000000-0x000000013fffffff] // <--- > [ 0.000000] node 1: [mem 0x0000000140000000-0x000000023fffffff] > > When "memmap=1G!4G" is given, the range [0x0000000100000000-0x000000013fffffff] > disappears and kernel messages are like below: > > (messages from free_area_init_nodes() for "memmap=1G!4G" case > [ 0.000000] Early memory node ranges > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff] > [ 0.000000] node 1: [mem 0x0000000140000000-0x000000023fffffff] > > This makes kernel think that the end pfn of node 0 is 0 0xbffd7 > instead of 0x140000, which affects the memory initialization process. > memmap_init_zone() calls __init_single_page() for each page within a zone, > so if zone->spanned_pages are underestimated, some pages are left uninitialized. > > If I provide 'memmap=1G!7G', the kernel panic does not reproduce and > kernel messages are like below. > > (messages from free_area_init_nodes() for "memmap=1G!7G" case > [ 0.000000] Early memory node ranges > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff] > [ 0.000000] node 0: [mem 0x0000000100000000-0x000000013fffffff] > [ 0.000000] node 1: [mem 0x0000000140000000-0x00000001bfffffff] > [ 0.000000] node 1: [mem 0x0000000200000000-0x000000023fffffff] > > > I think that in order to fix this, we need some conditions and/or prechecks > for memblock layout, does it make sense? Or any other better approaches? This is what I am seeing too, some memory just vanishes and is left unitialized. All this is handled in parse_memmap_one(), so I wonder if the right to do would be that in case we detect that an user-specified map falls in an usable map, we just back off and do not insert it. Maybe a subroutine that checks for that kind of overlapping maps before calling e820__range_add()? Oscar Salvador
On Thu, Jun 07, 2018 at 08:59:40AM +0200, Oscar Salvador wrote: > On Thu, Jun 07, 2018 at 06:22:19AM +0000, Naoya Horiguchi wrote: > > On Wed, Jun 06, 2018 at 09:24:05AM +0000, Horiguchi Naoya(堀口 直也) wrote: > > > On Wed, Jun 06, 2018 at 09:06:30AM +0000, Horiguchi Naoya(堀口 直也) wrote: > > > > On Wed, Jun 06, 2018 at 10:53:19AM +0200, Oscar Salvador wrote: > > > > > On Wed, Jun 06, 2018 at 10:04:08AM +0200, Oscar Salvador wrote: > > > > > > On Wed, Jun 06, 2018 at 05:16:24AM +0000, Naoya Horiguchi wrote: > > > > > > > On Tue, Jun 05, 2018 at 07:35:01AM +0000, Horiguchi Naoya(堀口 直也) wrote: > > > > > > > > On Mon, Jun 04, 2018 at 06:18:36PM -0700, Matthew Wilcox wrote: > > > > > > > > > On Tue, Jun 05, 2018 at 12:54:03AM +0000, Naoya Horiguchi wrote: > > > > > > > > > > Reproduction precedure is like this: > > > > > > > > > > - enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G) > > > > > > > > > > - read /proc/kpageflags (or call tools/vm/page-types with no arguments) > > > > > > > > > > (- my kernel config is attached) > > > > > > > > > > > > > > > > > > > > I spent a few days on this, but didn't reach any solutions. > > > > > > > > > > So let me report this with some details below ... > > > > > > > > > > > > > > > > > > > > In the critial page request, stable_page_flags() is called with an argument > > > > > > > > > > page whose ->compound_head was somehow filled with '0xffffffffffffffff'. > > > > > > > > > > And compound_head() returns (struct page *)(head - 1), which explains the > > > > > > > > > > address 0xfffffffffffffffe in the above message. > > > > > > > > > > > > > > > > > > Hm. compound_head shares with: > > > > > > > > > > > > > > > > > > struct list_head lru; > > > > > > > > > struct list_head slab_list; /* uses lru */ > > > > > > > > > struct { /* Partial pages */ > > > > > > > > > struct page *next; > > > > > > > > > unsigned long _compound_pad_1; /* compound_head */ > > > > > > > > > unsigned long _pt_pad_1; /* compound_head */ > > > > > > > > > struct dev_pagemap *pgmap; > > > > > > > > > struct rcu_head rcu_head; > > > > > > > > > > > > > > > > > > None of them should be -1. > > > > > > > > > > > > > > > > > > > It seems that this kernel panic happens when reading kpageflags of pfn range > > > > > > > > > > [0xbffd7, 0xc0000), which coresponds to a 'reserved' range. > > > > > > > > > > > > > > > > > > > > [ 0.000000] user-defined physical RAM map: > > > > > > > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable > > > > > > > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved > > > > > > > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved > > > > > > > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable > > > > > > > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved > > > > > > > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved > > > > > > > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved > > > > > > > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) > > > > > > > > > > > > > > > > > > > > So I guess 'memmap=' parameter might badly affect the memory initialization process. > > > > > > > > > > > > > > > > > > > > This problem doesn't reproduce on v4.17, so some pre-released patch introduces it. > > > > > > > > > > I hope this info helps you find the solution/workaround. > > > > > > > > > > > > > > > > > > Can you try bisecting this? It could be one of my patches to reorder struct > > > > > > > > > page, or it could be one of Pavel's deferred page initialisation patches. > > > > > > > > > Or something else ;-) > > > > > > > > > > > > > > > > Thank you for the comment. I'm trying bisecting now, let you know the result later. > > > > > > > > > > > > > > > > And I found that my statement "not reproduce on v4.17" was wrong (I used > > > > > > > > different kvm guests, which made some different test condition and misguided me), > > > > > > > > this seems an older (at least < 4.15) bug. > > > > > > > > > > > > > > (Cc: Pavel) > > > > > > > > > > > > > > Bisection showed that the following commit introduced this issue: > > > > > > > > > > > > > > commit f7f99100d8d95dbcf09e0216a143211e79418b9f > > > > > > > Author: Pavel Tatashin <pasha.tatashin@oracle.com> > > > > > > > Date: Wed Nov 15 17:36:44 2017 -0800 > > > > > > > > > > > > > > mm: stop zeroing memory during allocation in vmemmap > > > > > > > > > > > > > > This patch postpones struct page zeroing to later stage of memory initialization. > > > > > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of > > > > > > > __init_single_page() were never reached. So in such case, struct pages populated > > > > > > > by vmemmap_pte_populate() could be left uninitialized? > > > > > > > And I'm not sure yet how this issue becomes visible with memmap= setting. > > > > > > > > > > > > I think that this becomes visible because memmap=x!y creates a persistent memory region: > > > > > > > > > > > > parse_memmap_one > > > > > > { > > > > > > ... > > > > > > } else if (*p == '!') { > > > > > > start_at = memparse(p+1, &p); > > > > > > e820__range_add(start_at, mem_size, E820_TYPE_PRAM); > > > > > > ... > > > > > > } > > > > > > > > > > > > and this region it is not added neither in memblock.memory nor in memblock.reserved. > > > > > > Ranges in memblock.memory get zeroed in memmap_init_zone(), while memblock.reserved get zeroed > > > > > > in free_low_memory_core_early(): > > > > > > > > > > > > static unsigned long __init free_low_memory_core_early(void) > > > > > > { > > > > > > ... > > > > > > for_each_reserved_mem_region(i, &start, &end) > > > > > > reserve_bootmem_region(start, end); > > > > > > ... > > > > > > } > > > > > > > > > > > > > > > > > > Maybe I am mistaken, but I think that persistent memory regions should be marked as reserved. > > > > > > A comment in do_mark_busy() suggests this: > > > > > > > > > > > > static bool __init do_mark_busy(enum e820_type type, struct resource *res) > > > > > > { > > > > > > > > > > > > ... > > > > > > /* > > > > > > * Treat persistent memory like device memory, i.e. reserve it > > > > > > * for exclusive use of a driver > > > > > > */ > > > > > > ... > > > > > > } > > > > > > > > > > > > > > > > > > I wonder if something like this could work and if so, if it is right (i haven't tested it yet): > > > > > > > > > > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > > > > > > index 71c11ad5643e..3c9686ef74e5 100644 > > > > > > --- a/arch/x86/kernel/e820.c > > > > > > +++ b/arch/x86/kernel/e820.c > > > > > > @@ -1247,6 +1247,11 @@ void __init e820__memblock_setup(void) > > > > > > if (end != (resource_size_t)end) > > > > > > continue; > > > > > > > > > > > > + if (entry->type == E820_TYPE_PRAM || entry->type == E820_TYPE_PMEM) { > > > > > > + memblock_reserve(entry->addr, entry->size); > > > > > > + continue; > > > > > > + } > > > > > > + > > > > > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > > > > > > continue; > > > > > > > > > > It does not seem to work, so the reasoning might be incorrect. > > > > > > > > Thank you for the comment. > > > > > > > > One note is that the memory region with "broken struct page" is a typical > > > > reserved region, not a pmem region. Strangely reading offset 0xbffd7 of > > > > /proc/kpageflags is OK if pmem region does not exist, but NG if pmem region exists. > > > > Reading the offset like 0x100000 (on pmem region) does not cause the crash, > > > > so pmem region seems properly set up. > > > > > > > > [ 0.000000] user-defined physical RAM map: > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved ===> "broken struct page" region > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) => pmem region > > > > [ 0.000000] user: [mem 0x0000000140000000-0x000000023fffffff] usable > > > > > > > > > > I have another note: > > > > > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of > > > > __init_single_page() were never reached. So in such case, struct pages populated > > > > by vmemmap_pte_populate() could be left uninitialized? > > > > > > I quickly checked whether enabling CONFIG_DEFERRED_STRUCT_PAGE_INIT affect > > > the issue. And found that the kernel panic happens even with this config enabled. > > > So I'm still confused... > > > > Let me share some new facts: > > > > I gave accidentally an inconvenient memmap layout like 'memmap=1G!4G' in > > 2 NUMA node with 8 GB memory. > > While I didn't intended this, but 4GB is the address starting some memory > > block when no "memmap=" option is provided. > > > > (messages from free_area_init_nodes() for no "memmap=" case > > [ 0.000000] Early memory node ranges > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff] > > [ 0.000000] node 0: [mem 0x0000000100000000-0x000000013fffffff] // <--- > > [ 0.000000] node 1: [mem 0x0000000140000000-0x000000023fffffff] > > > > When "memmap=1G!4G" is given, the range [0x0000000100000000-0x000000013fffffff] > > disappears and kernel messages are like below: > > > > (messages from free_area_init_nodes() for "memmap=1G!4G" case > > [ 0.000000] Early memory node ranges > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff] > > [ 0.000000] node 1: [mem 0x0000000140000000-0x000000023fffffff] > > > > This makes kernel think that the end pfn of node 0 is 0 0xbffd7 > > instead of 0x140000, which affects the memory initialization process. > > memmap_init_zone() calls __init_single_page() for each page within a zone, > > so if zone->spanned_pages are underestimated, some pages are left uninitialized. > > > > If I provide 'memmap=1G!7G', the kernel panic does not reproduce and > > kernel messages are like below. > > > > (messages from free_area_init_nodes() for "memmap=1G!7G" case > > [ 0.000000] Early memory node ranges > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff] > > [ 0.000000] node 0: [mem 0x0000000100000000-0x000000013fffffff] > > [ 0.000000] node 1: [mem 0x0000000140000000-0x00000001bfffffff] > > [ 0.000000] node 1: [mem 0x0000000200000000-0x000000023fffffff] > > > > > > I think that in order to fix this, we need some conditions and/or prechecks > > for memblock layout, does it make sense? Or any other better approaches? Could you share the "e820: BIOS-provided physical RAM map" and "e820: user-defined physical RAM map" output with both memmap= args (1G!4G and 1G!7G)? thanks Oscar Salvador
On Thu, Jun 07, 2018 at 11:49:21AM +0200, Oscar Salvador wrote: > On Thu, Jun 07, 2018 at 08:59:40AM +0200, Oscar Salvador wrote: > > On Thu, Jun 07, 2018 at 06:22:19AM +0000, Naoya Horiguchi wrote: > > > On Wed, Jun 06, 2018 at 09:24:05AM +0000, Horiguchi Naoya(堀口 直也) wrote: > > > > On Wed, Jun 06, 2018 at 09:06:30AM +0000, Horiguchi Naoya(堀口 直也) wrote: > > > > > On Wed, Jun 06, 2018 at 10:53:19AM +0200, Oscar Salvador wrote: > > > > > > On Wed, Jun 06, 2018 at 10:04:08AM +0200, Oscar Salvador wrote: > > > > > > > On Wed, Jun 06, 2018 at 05:16:24AM +0000, Naoya Horiguchi wrote: > > > > > > > > On Tue, Jun 05, 2018 at 07:35:01AM +0000, Horiguchi Naoya(堀口 直也) wrote: > > > > > > > > > On Mon, Jun 04, 2018 at 06:18:36PM -0700, Matthew Wilcox wrote: > > > > > > > > > > On Tue, Jun 05, 2018 at 12:54:03AM +0000, Naoya Horiguchi wrote: > > > > > > > > > > > Reproduction precedure is like this: > > > > > > > > > > > - enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G) > > > > > > > > > > > - read /proc/kpageflags (or call tools/vm/page-types with no arguments) > > > > > > > > > > > (- my kernel config is attached) > > > > > > > > > > > > > > > > > > > > > > I spent a few days on this, but didn't reach any solutions. > > > > > > > > > > > So let me report this with some details below ... > > > > > > > > > > > > > > > > > > > > > > In the critial page request, stable_page_flags() is called with an argument > > > > > > > > > > > page whose ->compound_head was somehow filled with '0xffffffffffffffff'. > > > > > > > > > > > And compound_head() returns (struct page *)(head - 1), which explains the > > > > > > > > > > > address 0xfffffffffffffffe in the above message. > > > > > > > > > > > > > > > > > > > > Hm. compound_head shares with: > > > > > > > > > > > > > > > > > > > > struct list_head lru; > > > > > > > > > > struct list_head slab_list; /* uses lru */ > > > > > > > > > > struct { /* Partial pages */ > > > > > > > > > > struct page *next; > > > > > > > > > > unsigned long _compound_pad_1; /* compound_head */ > > > > > > > > > > unsigned long _pt_pad_1; /* compound_head */ > > > > > > > > > > struct dev_pagemap *pgmap; > > > > > > > > > > struct rcu_head rcu_head; > > > > > > > > > > > > > > > > > > > > None of them should be -1. > > > > > > > > > > > > > > > > > > > > > It seems that this kernel panic happens when reading kpageflags of pfn range > > > > > > > > > > > [0xbffd7, 0xc0000), which coresponds to a 'reserved' range. > > > > > > > > > > > > > > > > > > > > > > [ 0.000000] user-defined physical RAM map: > > > > > > > > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable > > > > > > > > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved > > > > > > > > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved > > > > > > > > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable > > > > > > > > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved > > > > > > > > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved > > > > > > > > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved > > > > > > > > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) > > > > > > > > > > > > > > > > > > > > > > So I guess 'memmap=' parameter might badly affect the memory initialization process. > > > > > > > > > > > > > > > > > > > > > > This problem doesn't reproduce on v4.17, so some pre-released patch introduces it. > > > > > > > > > > > I hope this info helps you find the solution/workaround. > > > > > > > > > > > > > > > > > > > > Can you try bisecting this? It could be one of my patches to reorder struct > > > > > > > > > > page, or it could be one of Pavel's deferred page initialisation patches. > > > > > > > > > > Or something else ;-) > > > > > > > > > > > > > > > > > > Thank you for the comment. I'm trying bisecting now, let you know the result later. > > > > > > > > > > > > > > > > > > And I found that my statement "not reproduce on v4.17" was wrong (I used > > > > > > > > > different kvm guests, which made some different test condition and misguided me), > > > > > > > > > this seems an older (at least < 4.15) bug. > > > > > > > > > > > > > > > > (Cc: Pavel) > > > > > > > > > > > > > > > > Bisection showed that the following commit introduced this issue: > > > > > > > > > > > > > > > > commit f7f99100d8d95dbcf09e0216a143211e79418b9f > > > > > > > > Author: Pavel Tatashin <pasha.tatashin@oracle.com> > > > > > > > > Date: Wed Nov 15 17:36:44 2017 -0800 > > > > > > > > > > > > > > > > mm: stop zeroing memory during allocation in vmemmap > > > > > > > > > > > > > > > > This patch postpones struct page zeroing to later stage of memory initialization. > > > > > > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of > > > > > > > > __init_single_page() were never reached. So in such case, struct pages populated > > > > > > > > by vmemmap_pte_populate() could be left uninitialized? > > > > > > > > And I'm not sure yet how this issue becomes visible with memmap= setting. > > > > > > > > > > > > > > I think that this becomes visible because memmap=x!y creates a persistent memory region: > > > > > > > > > > > > > > parse_memmap_one > > > > > > > { > > > > > > > ... > > > > > > > } else if (*p == '!') { > > > > > > > start_at = memparse(p+1, &p); > > > > > > > e820__range_add(start_at, mem_size, E820_TYPE_PRAM); > > > > > > > ... > > > > > > > } > > > > > > > > > > > > > > and this region it is not added neither in memblock.memory nor in memblock.reserved. > > > > > > > Ranges in memblock.memory get zeroed in memmap_init_zone(), while memblock.reserved get zeroed > > > > > > > in free_low_memory_core_early(): > > > > > > > > > > > > > > static unsigned long __init free_low_memory_core_early(void) > > > > > > > { > > > > > > > ... > > > > > > > for_each_reserved_mem_region(i, &start, &end) > > > > > > > reserve_bootmem_region(start, end); > > > > > > > ... > > > > > > > } > > > > > > > > > > > > > > > > > > > > > Maybe I am mistaken, but I think that persistent memory regions should be marked as reserved. > > > > > > > A comment in do_mark_busy() suggests this: > > > > > > > > > > > > > > static bool __init do_mark_busy(enum e820_type type, struct resource *res) > > > > > > > { > > > > > > > > > > > > > > ... > > > > > > > /* > > > > > > > * Treat persistent memory like device memory, i.e. reserve it > > > > > > > * for exclusive use of a driver > > > > > > > */ > > > > > > > ... > > > > > > > } > > > > > > > > > > > > > > > > > > > > > I wonder if something like this could work and if so, if it is right (i haven't tested it yet): > > > > > > > > > > > > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > > > > > > > index 71c11ad5643e..3c9686ef74e5 100644 > > > > > > > --- a/arch/x86/kernel/e820.c > > > > > > > +++ b/arch/x86/kernel/e820.c > > > > > > > @@ -1247,6 +1247,11 @@ void __init e820__memblock_setup(void) > > > > > > > if (end != (resource_size_t)end) > > > > > > > continue; > > > > > > > > > > > > > > + if (entry->type == E820_TYPE_PRAM || entry->type == E820_TYPE_PMEM) { > > > > > > > + memblock_reserve(entry->addr, entry->size); > > > > > > > + continue; > > > > > > > + } > > > > > > > + > > > > > > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > > > > > > > continue; > > > > > > > > > > > > It does not seem to work, so the reasoning might be incorrect. > > > > > > > > > > Thank you for the comment. > > > > > > > > > > One note is that the memory region with "broken struct page" is a typical > > > > > reserved region, not a pmem region. Strangely reading offset 0xbffd7 of > > > > > /proc/kpageflags is OK if pmem region does not exist, but NG if pmem region exists. > > > > > Reading the offset like 0x100000 (on pmem region) does not cause the crash, > > > > > so pmem region seems properly set up. > > > > > > > > > > [ 0.000000] user-defined physical RAM map: > > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable > > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved > > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved > > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable > > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved ===> "broken struct page" region > > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved > > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved > > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) => pmem region > > > > > [ 0.000000] user: [mem 0x0000000140000000-0x000000023fffffff] usable > > > > > > > > > > > > > I have another note: > > > > > > > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of > > > > > __init_single_page() were never reached. So in such case, struct pages populated > > > > > by vmemmap_pte_populate() could be left uninitialized? > > > > > > > > I quickly checked whether enabling CONFIG_DEFERRED_STRUCT_PAGE_INIT affect > > > > the issue. And found that the kernel panic happens even with this config enabled. > > > > So I'm still confused... > > > > > > Let me share some new facts: > > > > > > I gave accidentally an inconvenient memmap layout like 'memmap=1G!4G' in > > > 2 NUMA node with 8 GB memory. > > > While I didn't intended this, but 4GB is the address starting some memory > > > block when no "memmap=" option is provided. > > > > > > (messages from free_area_init_nodes() for no "memmap=" case > > > [ 0.000000] Early memory node ranges > > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > > > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff] > > > [ 0.000000] node 0: [mem 0x0000000100000000-0x000000013fffffff] // <--- > > > [ 0.000000] node 1: [mem 0x0000000140000000-0x000000023fffffff] > > > > > > When "memmap=1G!4G" is given, the range [0x0000000100000000-0x000000013fffffff] > > > disappears and kernel messages are like below: > > > > > > (messages from free_area_init_nodes() for "memmap=1G!4G" case > > > [ 0.000000] Early memory node ranges > > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > > > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff] > > > [ 0.000000] node 1: [mem 0x0000000140000000-0x000000023fffffff] > > > > > > This makes kernel think that the end pfn of node 0 is 0 0xbffd7 > > > instead of 0x140000, which affects the memory initialization process. > > > memmap_init_zone() calls __init_single_page() for each page within a zone, > > > so if zone->spanned_pages are underestimated, some pages are left uninitialized. > > > > > > If I provide 'memmap=1G!7G', the kernel panic does not reproduce and > > > kernel messages are like below. > > > > > > (messages from free_area_init_nodes() for "memmap=1G!7G" case > > > [ 0.000000] Early memory node ranges > > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > > > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff] > > > [ 0.000000] node 0: [mem 0x0000000100000000-0x000000013fffffff] > > > [ 0.000000] node 1: [mem 0x0000000140000000-0x00000001bfffffff] > > > [ 0.000000] node 1: [mem 0x0000000200000000-0x000000023fffffff] > > > > > > > > > I think that in order to fix this, we need some conditions and/or prechecks > > > for memblock layout, does it make sense? Or any other better approaches? > > Could you share the "e820: BIOS-provided physical RAM map" and "e820: user-defined physical RAM map" > output with both memmap= args (1G!4G and 1G!7G)? Sure, here it is: # memmap=1G!4G BIOS-provided physical RAM map: BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved BIOS-e820: [mem 0x0000000000100000-0x00000000bffd6fff] usable BIOS-e820: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved BIOS-e820: [mem 0x0000000100000000-0x000000023fffffff] usable NX (Execute Disable) protection: active user-defined physical RAM map: user: [mem 0x0000000000000000-0x000000000009fbff] usable user: [mem 0x000000000009fc00-0x000000000009ffff] reserved user: [mem 0x00000000000f0000-0x00000000000fffff] reserved user: [mem 0x0000000000100000-0x00000000bffd6fff] usable user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved user: [mem 0x00000000feffc000-0x00000000feffffff] reserved user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) user: [mem 0x0000000140000000-0x000000023fffffff] usable # memmap=1G!7G BIOS-provided physical RAM map: BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved BIOS-e820: [mem 0x0000000000100000-0x00000000bffd6fff] usable BIOS-e820: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved BIOS-e820: [mem 0x0000000100000000-0x000000023fffffff] usable NX (Execute Disable) protection: active user-defined physical RAM map: user: [mem 0x0000000000000000-0x000000000009fbff] usable user: [mem 0x000000000009fc00-0x000000000009ffff] reserved user: [mem 0x00000000000f0000-0x00000000000fffff] reserved user: [mem 0x0000000000100000-0x00000000bffd6fff] usable user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved user: [mem 0x00000000feffc000-0x00000000feffffff] reserved user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved user: [mem 0x0000000100000000-0x00000001bfffffff] usable user: [mem 0x00000001c0000000-0x00000001ffffffff] persistent (type 12) user: [mem 0x0000000200000000-0x000000023fffffff] usable I attached full dmesgs (including some printk output for debugging) just in case. Thanks, Naoya Horiguchi
On Thu, Jun 07, 2018 at 10:02:56AM +0000, Horiguchi Naoya(堀口 直也) wrote: > On Thu, Jun 07, 2018 at 11:49:21AM +0200, Oscar Salvador wrote: > > On Thu, Jun 07, 2018 at 08:59:40AM +0200, Oscar Salvador wrote: > > > On Thu, Jun 07, 2018 at 06:22:19AM +0000, Naoya Horiguchi wrote: > > > > On Wed, Jun 06, 2018 at 09:24:05AM +0000, Horiguchi Naoya(堀口 直也) wrote: > > > > > On Wed, Jun 06, 2018 at 09:06:30AM +0000, Horiguchi Naoya(堀口 直也) wrote: > > > > > > On Wed, Jun 06, 2018 at 10:53:19AM +0200, Oscar Salvador wrote: > > > > > > > On Wed, Jun 06, 2018 at 10:04:08AM +0200, Oscar Salvador wrote: > > > > > > > > On Wed, Jun 06, 2018 at 05:16:24AM +0000, Naoya Horiguchi wrote: > > > > > > > > > On Tue, Jun 05, 2018 at 07:35:01AM +0000, Horiguchi Naoya(堀口 直也) wrote: > > > > > > > > > > On Mon, Jun 04, 2018 at 06:18:36PM -0700, Matthew Wilcox wrote: > > > > > > > > > > > On Tue, Jun 05, 2018 at 12:54:03AM +0000, Naoya Horiguchi wrote: > > > > > > > > > > > > Reproduction precedure is like this: > > > > > > > > > > > > - enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G) > > > > > > > > > > > > - read /proc/kpageflags (or call tools/vm/page-types with no arguments) > > > > > > > > > > > > (- my kernel config is attached) > > > > > > > > > > > > > > > > > > > > > > > > I spent a few days on this, but didn't reach any solutions. > > > > > > > > > > > > So let me report this with some details below ... > > > > > > > > > > > > > > > > > > > > > > > > In the critial page request, stable_page_flags() is called with an argument > > > > > > > > > > > > page whose ->compound_head was somehow filled with '0xffffffffffffffff'. > > > > > > > > > > > > And compound_head() returns (struct page *)(head - 1), which explains the > > > > > > > > > > > > address 0xfffffffffffffffe in the above message. > > > > > > > > > > > > > > > > > > > > > > Hm. compound_head shares with: > > > > > > > > > > > > > > > > > > > > > > struct list_head lru; > > > > > > > > > > > struct list_head slab_list; /* uses lru */ > > > > > > > > > > > struct { /* Partial pages */ > > > > > > > > > > > struct page *next; > > > > > > > > > > > unsigned long _compound_pad_1; /* compound_head */ > > > > > > > > > > > unsigned long _pt_pad_1; /* compound_head */ > > > > > > > > > > > struct dev_pagemap *pgmap; > > > > > > > > > > > struct rcu_head rcu_head; > > > > > > > > > > > > > > > > > > > > > > None of them should be -1. > > > > > > > > > > > > > > > > > > > > > > > It seems that this kernel panic happens when reading kpageflags of pfn range > > > > > > > > > > > > [0xbffd7, 0xc0000), which coresponds to a 'reserved' range. > > > > > > > > > > > > > > > > > > > > > > > > [ 0.000000] user-defined physical RAM map: > > > > > > > > > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable > > > > > > > > > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved > > > > > > > > > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved > > > > > > > > > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable > > > > > > > > > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved > > > > > > > > > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved > > > > > > > > > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved > > > > > > > > > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) > > > > > > > > > > > > > > > > > > > > > > > > So I guess 'memmap=' parameter might badly affect the memory initialization process. > > > > > > > > > > > > > > > > > > > > > > > > This problem doesn't reproduce on v4.17, so some pre-released patch introduces it. > > > > > > > > > > > > I hope this info helps you find the solution/workaround. > > > > > > > > > > > > > > > > > > > > > > Can you try bisecting this? It could be one of my patches to reorder struct > > > > > > > > > > > page, or it could be one of Pavel's deferred page initialisation patches. > > > > > > > > > > > Or something else ;-) > > > > > > > > > > > > > > > > > > > > Thank you for the comment. I'm trying bisecting now, let you know the result later. > > > > > > > > > > > > > > > > > > > > And I found that my statement "not reproduce on v4.17" was wrong (I used > > > > > > > > > > different kvm guests, which made some different test condition and misguided me), > > > > > > > > > > this seems an older (at least < 4.15) bug. > > > > > > > > > > > > > > > > > > (Cc: Pavel) > > > > > > > > > > > > > > > > > > Bisection showed that the following commit introduced this issue: > > > > > > > > > > > > > > > > > > commit f7f99100d8d95dbcf09e0216a143211e79418b9f > > > > > > > > > Author: Pavel Tatashin <pasha.tatashin@oracle.com> > > > > > > > > > Date: Wed Nov 15 17:36:44 2017 -0800 > > > > > > > > > > > > > > > > > > mm: stop zeroing memory during allocation in vmemmap > > > > > > > > > > > > > > > > > > This patch postpones struct page zeroing to later stage of memory initialization. > > > > > > > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of > > > > > > > > > __init_single_page() were never reached. So in such case, struct pages populated > > > > > > > > > by vmemmap_pte_populate() could be left uninitialized? > > > > > > > > > And I'm not sure yet how this issue becomes visible with memmap= setting. > > > > > > > > > > > > > > > > I think that this becomes visible because memmap=x!y creates a persistent memory region: > > > > > > > > > > > > > > > > parse_memmap_one > > > > > > > > { > > > > > > > > ... > > > > > > > > } else if (*p == '!') { > > > > > > > > start_at = memparse(p+1, &p); > > > > > > > > e820__range_add(start_at, mem_size, E820_TYPE_PRAM); > > > > > > > > ... > > > > > > > > } > > > > > > > > > > > > > > > > and this region it is not added neither in memblock.memory nor in memblock.reserved. > > > > > > > > Ranges in memblock.memory get zeroed in memmap_init_zone(), while memblock.reserved get zeroed > > > > > > > > in free_low_memory_core_early(): > > > > > > > > > > > > > > > > static unsigned long __init free_low_memory_core_early(void) > > > > > > > > { > > > > > > > > ... > > > > > > > > for_each_reserved_mem_region(i, &start, &end) > > > > > > > > reserve_bootmem_region(start, end); > > > > > > > > ... > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > Maybe I am mistaken, but I think that persistent memory regions should be marked as reserved. > > > > > > > > A comment in do_mark_busy() suggests this: > > > > > > > > > > > > > > > > static bool __init do_mark_busy(enum e820_type type, struct resource *res) > > > > > > > > { > > > > > > > > > > > > > > > > ... > > > > > > > > /* > > > > > > > > * Treat persistent memory like device memory, i.e. reserve it > > > > > > > > * for exclusive use of a driver > > > > > > > > */ > > > > > > > > ... > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > I wonder if something like this could work and if so, if it is right (i haven't tested it yet): > > > > > > > > > > > > > > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > > > > > > > > index 71c11ad5643e..3c9686ef74e5 100644 > > > > > > > > --- a/arch/x86/kernel/e820.c > > > > > > > > +++ b/arch/x86/kernel/e820.c > > > > > > > > @@ -1247,6 +1247,11 @@ void __init e820__memblock_setup(void) > > > > > > > > if (end != (resource_size_t)end) > > > > > > > > continue; > > > > > > > > > > > > > > > > + if (entry->type == E820_TYPE_PRAM || entry->type == E820_TYPE_PMEM) { > > > > > > > > + memblock_reserve(entry->addr, entry->size); > > > > > > > > + continue; > > > > > > > > + } > > > > > > > > + > > > > > > > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > > > > > > > > continue; > > > > > > > > > > > > > > It does not seem to work, so the reasoning might be incorrect. > > > > > > > > > > > > Thank you for the comment. > > > > > > > > > > > > One note is that the memory region with "broken struct page" is a typical > > > > > > reserved region, not a pmem region. Strangely reading offset 0xbffd7 of > > > > > > /proc/kpageflags is OK if pmem region does not exist, but NG if pmem region exists. > > > > > > Reading the offset like 0x100000 (on pmem region) does not cause the crash, > > > > > > so pmem region seems properly set up. > > > > > > > > > > > > [ 0.000000] user-defined physical RAM map: > > > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable > > > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved > > > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved > > > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable > > > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved ===> "broken struct page" region > > > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved > > > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved > > > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) => pmem region > > > > > > [ 0.000000] user: [mem 0x0000000140000000-0x000000023fffffff] usable > > > > > > > > > > > > > > > > I have another note: > > > > > > > > > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of > > > > > > __init_single_page() were never reached. So in such case, struct pages populated > > > > > > by vmemmap_pte_populate() could be left uninitialized? > > > > > > > > > > I quickly checked whether enabling CONFIG_DEFERRED_STRUCT_PAGE_INIT affect > > > > > the issue. And found that the kernel panic happens even with this config enabled. > > > > > So I'm still confused... > > > > > > > > Let me share some new facts: > > > > > > > > I gave accidentally an inconvenient memmap layout like 'memmap=1G!4G' in > > > > 2 NUMA node with 8 GB memory. > > > > While I didn't intended this, but 4GB is the address starting some memory > > > > block when no "memmap=" option is provided. > > > > > > > > (messages from free_area_init_nodes() for no "memmap=" case > > > > [ 0.000000] Early memory node ranges > > > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > > > > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff] > > > > [ 0.000000] node 0: [mem 0x0000000100000000-0x000000013fffffff] // <--- > > > > [ 0.000000] node 1: [mem 0x0000000140000000-0x000000023fffffff] > > > > > > > > When "memmap=1G!4G" is given, the range [0x0000000100000000-0x000000013fffffff] > > > > disappears and kernel messages are like below: > > > > > > > > (messages from free_area_init_nodes() for "memmap=1G!4G" case > > > > [ 0.000000] Early memory node ranges > > > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > > > > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff] > > > > [ 0.000000] node 1: [mem 0x0000000140000000-0x000000023fffffff] > > > > > > > > This makes kernel think that the end pfn of node 0 is 0 0xbffd7 > > > > instead of 0x140000, which affects the memory initialization process. > > > > memmap_init_zone() calls __init_single_page() for each page within a zone, > > > > so if zone->spanned_pages are underestimated, some pages are left uninitialized. > > > > > > > > If I provide 'memmap=1G!7G', the kernel panic does not reproduce and > > > > kernel messages are like below. > > > > > > > > (messages from free_area_init_nodes() for "memmap=1G!7G" case > > > > [ 0.000000] Early memory node ranges > > > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > > > > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff] > > > > [ 0.000000] node 0: [mem 0x0000000100000000-0x000000013fffffff] > > > > [ 0.000000] node 1: [mem 0x0000000140000000-0x00000001bfffffff] > > > > [ 0.000000] node 1: [mem 0x0000000200000000-0x000000023fffffff] > > > > > > > > > > > > I think that in order to fix this, we need some conditions and/or prechecks > > > > for memblock layout, does it make sense? Or any other better approaches? > > > All this is handled in parse_memmap_one(), so I wonder if the right to do would be that in > case we detect that an user-specified map falls in an usable map, we just back off and do not insert it. > > Maybe a subroutine that checks for that kind of overlapping maps before calling e820__range_add()? > This problem seems to happen when the end address of the user-defined memmap is equal to the end address of NUMA node (0x140000000 or 5GB in this case.) So that's the condition to be checked, I think. However we don't initialize numa at parse_memmap_one(), so we need identify right place to do this, so I'll do this next. Thanks, Naoya Horiguchi
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index 71c11ad5643e..3c9686ef74e5 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -1247,6 +1247,11 @@ void __init e820__memblock_setup(void) if (end != (resource_size_t)end) continue; + if (entry->type == E820_TYPE_PRAM || entry->type == E820_TYPE_PMEM) { + memblock_reserve(entry->addr, entry->size); + continue; + } + if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) continue;