diff mbox

[PATCHv2] mm: Don't offset memmap for flatmem

Message ID 1421888500-24364-1-git-send-email-lauraa@codeaurora.org (mailing list archive)
State New, archived
Headers show

Commit Message

Laura Abbott Jan. 22, 2015, 1:01 a.m. UTC
Srinivas Kandagatla reported bad page messages when trying to
remove the bottom 2MB on an ARM based IFC6410 board

BUG: Bad page state in process swapper  pfn:fffa8
page:ef7fb500 count:0 mapcount:0 mapping:  (null) index:0x0
flags: 0x96640253(locked|error|dirty|active|arch_1|reclaim|mlocked)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags:
flags: 0x200041(locked|active|mlocked)
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 3.19.0-rc3-00007-g412f9ba-dirty #816
Hardware name: Qualcomm (Flattened Device Tree)
[<c0218280>] (unwind_backtrace) from [<c0212be8>] (show_stack+0x20/0x24)
[<c0212be8>] (show_stack) from [<c0af7124>] (dump_stack+0x80/0x9c)
[<c0af7124>] (dump_stack) from [<c0301570>] (bad_page+0xc8/0x128)
[<c0301570>] (bad_page) from [<c03018a8>] (free_pages_prepare+0x168/0x1e0)
[<c03018a8>] (free_pages_prepare) from [<c030369c>] (free_hot_cold_page+0x3c/0x174)
[<c030369c>] (free_hot_cold_page) from [<c0303828>] (__free_pages+0x54/0x58)
[<c0303828>] (__free_pages) from [<c030395c>] (free_highmem_page+0x38/0x88)
[<c030395c>] (free_highmem_page) from [<c0f62d5c>] (mem_init+0x240/0x430)
[<c0f62d5c>] (mem_init) from [<c0f5db3c>] (start_kernel+0x1e4/0x3c8)
[<c0f5db3c>] (start_kernel) from [<80208074>] (0x80208074)
Disabling lock debugging due to kernel taint

Removing the lower 2MB made the start of the lowmem zone to no longer
be page block aligned. IFC6410 uses CONFIG_FLATMEM where
alloc_node_mem_map allocates memory for the mem_map. alloc_node_mem_map
will offset for unaligned nodes with the assumption the pfn/page
translation functions will account for the offset. The functions for
CONFIG_FLATMEM do not offset however, resulting in overrunning
the memmap array. Just use the allocated memmap without any offset
when running with CONFIG_FLATMEM to avoid the overrun.

Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Reported-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
---
 mm/page_alloc.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

Comments

Andrew Morton Jan. 23, 2015, 12:20 a.m. UTC | #1
On Wed, 21 Jan 2015 17:01:40 -0800 Laura Abbott <lauraa@codeaurora.org> wrote:

> Srinivas Kandagatla reported bad page messages when trying to
> remove the bottom 2MB on an ARM based IFC6410 board
> 
> BUG: Bad page state in process swapper  pfn:fffa8
> page:ef7fb500 count:0 mapcount:0 mapping:  (null) index:0x0
> flags: 0x96640253(locked|error|dirty|active|arch_1|reclaim|mlocked)
> page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> bad because of flags:
> flags: 0x200041(locked|active|mlocked)
> Modules linked in:
> CPU: 0 PID: 0 Comm: swapper Not tainted 3.19.0-rc3-00007-g412f9ba-dirty #816
> Hardware name: Qualcomm (Flattened Device Tree)
> [<c0218280>] (unwind_backtrace) from [<c0212be8>] (show_stack+0x20/0x24)
> [<c0212be8>] (show_stack) from [<c0af7124>] (dump_stack+0x80/0x9c)
> [<c0af7124>] (dump_stack) from [<c0301570>] (bad_page+0xc8/0x128)
> [<c0301570>] (bad_page) from [<c03018a8>] (free_pages_prepare+0x168/0x1e0)
> [<c03018a8>] (free_pages_prepare) from [<c030369c>] (free_hot_cold_page+0x3c/0x174)
> [<c030369c>] (free_hot_cold_page) from [<c0303828>] (__free_pages+0x54/0x58)
> [<c0303828>] (__free_pages) from [<c030395c>] (free_highmem_page+0x38/0x88)
> [<c030395c>] (free_highmem_page) from [<c0f62d5c>] (mem_init+0x240/0x430)
> [<c0f62d5c>] (mem_init) from [<c0f5db3c>] (start_kernel+0x1e4/0x3c8)
> [<c0f5db3c>] (start_kernel) from [<80208074>] (0x80208074)
> Disabling lock debugging due to kernel taint
> 
> Removing the lower 2MB made the start of the lowmem zone to no longer
> be page block aligned. IFC6410 uses CONFIG_FLATMEM where
> alloc_node_mem_map allocates memory for the mem_map. alloc_node_mem_map
> will offset for unaligned nodes with the assumption the pfn/page
> translation functions will account for the offset. The functions for
> CONFIG_FLATMEM do not offset however, resulting in overrunning
> the memmap array. Just use the allocated memmap without any offset
> when running with CONFIG_FLATMEM to avoid the overrun.
> 

I don't think v2 addressed Vlastimil's review comment?
Laura Abbott Jan. 23, 2015, 12:33 a.m. UTC | #2
On 1/22/2015 4:20 PM, Andrew Morton wrote:
> On Wed, 21 Jan 2015 17:01:40 -0800 Laura Abbott <lauraa@codeaurora.org> wrote:
>
>> Srinivas Kandagatla reported bad page messages when trying to
>> remove the bottom 2MB on an ARM based IFC6410 board
>>
>> BUG: Bad page state in process swapper  pfn:fffa8
>> page:ef7fb500 count:0 mapcount:0 mapping:  (null) index:0x0
>> flags: 0x96640253(locked|error|dirty|active|arch_1|reclaim|mlocked)
>> page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
>> bad because of flags:
>> flags: 0x200041(locked|active|mlocked)
>> Modules linked in:
>> CPU: 0 PID: 0 Comm: swapper Not tainted 3.19.0-rc3-00007-g412f9ba-dirty #816
>> Hardware name: Qualcomm (Flattened Device Tree)
>> [<c0218280>] (unwind_backtrace) from [<c0212be8>] (show_stack+0x20/0x24)
>> [<c0212be8>] (show_stack) from [<c0af7124>] (dump_stack+0x80/0x9c)
>> [<c0af7124>] (dump_stack) from [<c0301570>] (bad_page+0xc8/0x128)
>> [<c0301570>] (bad_page) from [<c03018a8>] (free_pages_prepare+0x168/0x1e0)
>> [<c03018a8>] (free_pages_prepare) from [<c030369c>] (free_hot_cold_page+0x3c/0x174)
>> [<c030369c>] (free_hot_cold_page) from [<c0303828>] (__free_pages+0x54/0x58)
>> [<c0303828>] (__free_pages) from [<c030395c>] (free_highmem_page+0x38/0x88)
>> [<c030395c>] (free_highmem_page) from [<c0f62d5c>] (mem_init+0x240/0x430)
>> [<c0f62d5c>] (mem_init) from [<c0f5db3c>] (start_kernel+0x1e4/0x3c8)
>> [<c0f5db3c>] (start_kernel) from [<80208074>] (0x80208074)
>> Disabling lock debugging due to kernel taint
>>
>> Removing the lower 2MB made the start of the lowmem zone to no longer
>> be page block aligned. IFC6410 uses CONFIG_FLATMEM where
>> alloc_node_mem_map allocates memory for the mem_map. alloc_node_mem_map
>> will offset for unaligned nodes with the assumption the pfn/page
>> translation functions will account for the offset. The functions for
>> CONFIG_FLATMEM do not offset however, resulting in overrunning
>> the memmap array. Just use the allocated memmap without any offset
>> when running with CONFIG_FLATMEM to avoid the overrun.
>>
>
> I don't think v2 addressed Vlastimil's review comment?
>

We're still adding the offset to node_mem_map and then subtracting it from
just mem_map. Did I miss another comment somewhere?
Vlastimil Babka Jan. 23, 2015, 9:05 a.m. UTC | #3
On 01/23/2015 01:33 AM, Laura Abbott wrote:
> On 1/22/2015 4:20 PM, Andrew Morton wrote:
>> On Wed, 21 Jan 2015 17:01:40 -0800 Laura Abbott <lauraa@codeaurora.org> wrote:
>>
>>> Srinivas Kandagatla reported bad page messages when trying to
>>> remove the bottom 2MB on an ARM based IFC6410 board
>>>
>>> BUG: Bad page state in process swapper  pfn:fffa8
>>> page:ef7fb500 count:0 mapcount:0 mapping:  (null) index:0x0
>>> flags: 0x96640253(locked|error|dirty|active|arch_1|reclaim|mlocked)
>>> page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
>>> bad because of flags:
>>> flags: 0x200041(locked|active|mlocked)
>>> Modules linked in:
>>> CPU: 0 PID: 0 Comm: swapper Not tainted 3.19.0-rc3-00007-g412f9ba-dirty #816
>>> Hardware name: Qualcomm (Flattened Device Tree)
>>> [<c0218280>] (unwind_backtrace) from [<c0212be8>] (show_stack+0x20/0x24)
>>> [<c0212be8>] (show_stack) from [<c0af7124>] (dump_stack+0x80/0x9c)
>>> [<c0af7124>] (dump_stack) from [<c0301570>] (bad_page+0xc8/0x128)
>>> [<c0301570>] (bad_page) from [<c03018a8>] (free_pages_prepare+0x168/0x1e0)
>>> [<c03018a8>] (free_pages_prepare) from [<c030369c>] (free_hot_cold_page+0x3c/0x174)
>>> [<c030369c>] (free_hot_cold_page) from [<c0303828>] (__free_pages+0x54/0x58)
>>> [<c0303828>] (__free_pages) from [<c030395c>] (free_highmem_page+0x38/0x88)
>>> [<c030395c>] (free_highmem_page) from [<c0f62d5c>] (mem_init+0x240/0x430)
>>> [<c0f62d5c>] (mem_init) from [<c0f5db3c>] (start_kernel+0x1e4/0x3c8)
>>> [<c0f5db3c>] (start_kernel) from [<80208074>] (0x80208074)
>>> Disabling lock debugging due to kernel taint
>>>
>>> Removing the lower 2MB made the start of the lowmem zone to no longer
>>> be page block aligned. IFC6410 uses CONFIG_FLATMEM where
>>> alloc_node_mem_map allocates memory for the mem_map. alloc_node_mem_map
>>> will offset for unaligned nodes with the assumption the pfn/page
>>> translation functions will account for the offset. The functions for
>>> CONFIG_FLATMEM do not offset however, resulting in overrunning
>>> the memmap array. Just use the allocated memmap without any offset
>>> when running with CONFIG_FLATMEM to avoid the overrun.
>>>
>>
>> I don't think v2 addressed Vlastimil's review comment?
>>
>
> We're still adding the offset to node_mem_map and then subtracting it from
> just mem_map. Did I miss another comment somewhere?

Yes that was addressed, thanks. But I don't feel comfortable acking it 
yet, as I have no idea if we are doing the right thing for 
CONFIG_HAVE_MEMBLOCK_NODE_MAP && CONFIG_FLATMEM case here.

Also putting the CONFIG_FLATMEM && !CONFIG_HAVE_MEMBLOCK_NODE_MAP under 
the "if (page_to_pfn(mem_map) != pgdat->node_start_pfn)" will probably 
do the right thing, but looks like a weird test for this case here.

I have no good suggestion though, so let's CC Mel who apparently wrote 
the ARCH_PFN_OFFSET correction?
Mel Gorman Jan. 26, 2015, 3:56 p.m. UTC | #4
On Fri, Jan 23, 2015 at 10:05:48AM +0100, Vlastimil Babka wrote:
> On 01/23/2015 01:33 AM, Laura Abbott wrote:
> >On 1/22/2015 4:20 PM, Andrew Morton wrote:
> >>On Wed, 21 Jan 2015 17:01:40 -0800 Laura Abbott <lauraa@codeaurora.org> wrote:
> >>
> >>>Srinivas Kandagatla reported bad page messages when trying to
> >>>remove the bottom 2MB on an ARM based IFC6410 board
> >>>
> >>>BUG: Bad page state in process swapper  pfn:fffa8
> >>>page:ef7fb500 count:0 mapcount:0 mapping:  (null) index:0x0
> >>>flags: 0x96640253(locked|error|dirty|active|arch_1|reclaim|mlocked)
> >>>page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> >>>bad because of flags:
> >>>flags: 0x200041(locked|active|mlocked)
> >>>Modules linked in:
> >>>CPU: 0 PID: 0 Comm: swapper Not tainted 3.19.0-rc3-00007-g412f9ba-dirty #816
> >>>Hardware name: Qualcomm (Flattened Device Tree)
> >>>[<c0218280>] (unwind_backtrace) from [<c0212be8>] (show_stack+0x20/0x24)
> >>>[<c0212be8>] (show_stack) from [<c0af7124>] (dump_stack+0x80/0x9c)
> >>>[<c0af7124>] (dump_stack) from [<c0301570>] (bad_page+0xc8/0x128)
> >>>[<c0301570>] (bad_page) from [<c03018a8>] (free_pages_prepare+0x168/0x1e0)
> >>>[<c03018a8>] (free_pages_prepare) from [<c030369c>] (free_hot_cold_page+0x3c/0x174)
> >>>[<c030369c>] (free_hot_cold_page) from [<c0303828>] (__free_pages+0x54/0x58)
> >>>[<c0303828>] (__free_pages) from [<c030395c>] (free_highmem_page+0x38/0x88)
> >>>[<c030395c>] (free_highmem_page) from [<c0f62d5c>] (mem_init+0x240/0x430)
> >>>[<c0f62d5c>] (mem_init) from [<c0f5db3c>] (start_kernel+0x1e4/0x3c8)
> >>>[<c0f5db3c>] (start_kernel) from [<80208074>] (0x80208074)
> >>>Disabling lock debugging due to kernel taint
> >>>
> >>>Removing the lower 2MB made the start of the lowmem zone to no longer
> >>>be page block aligned. IFC6410 uses CONFIG_FLATMEM where
> >>>alloc_node_mem_map allocates memory for the mem_map. alloc_node_mem_map
> >>>will offset for unaligned nodes with the assumption the pfn/page
> >>>translation functions will account for the offset. The functions for
> >>>CONFIG_FLATMEM do not offset however, resulting in overrunning
> >>>the memmap array. Just use the allocated memmap without any offset
> >>>when running with CONFIG_FLATMEM to avoid the overrun.
> >>>
> >>
> >>I don't think v2 addressed Vlastimil's review comment?
> >>
> >
> >We're still adding the offset to node_mem_map and then subtracting it from
> >just mem_map. Did I miss another comment somewhere?
> 
> Yes that was addressed, thanks. But I don't feel comfortable acking
> it yet, as I have no idea if we are doing the right thing for
> CONFIG_HAVE_MEMBLOCK_NODE_MAP && CONFIG_FLATMEM case here.
> 
> Also putting the CONFIG_FLATMEM && !CONFIG_HAVE_MEMBLOCK_NODE_MAP
> under the "if (page_to_pfn(mem_map) != pgdat->node_start_pfn)" will
> probably do the right thing, but looks like a weird test for this
> case here.
> 
> I have no good suggestion though, so let's CC Mel who apparently
> wrote the ARCH_PFN_OFFSET correction?
> 

I don't recall introducing ARCH_PFN_OFFSET, are you sure it was me?  I'm just
back today after been offline a week so didn't review the patch but IIRC,
ARCH_PFN_OFFSET deals with the case where physical memory does not start
at 0. Without the offset, virtual _PAGE_OFFSET would not physical page 0.
I don't recall it being related to the alignment of node 0 so if there
are crashes due to misalignment of node 0 and the fix is ARCH_PFN_OFFSET
related then I'm surprised.
Vlastimil Babka Jan. 29, 2015, 1:13 p.m. UTC | #5
On 01/26/2015 04:56 PM, Mel Gorman wrote:
> On Fri, Jan 23, 2015 at 10:05:48AM +0100, Vlastimil Babka wrote:
>> On 01/23/2015 01:33 AM, Laura Abbott wrote:
>>> On 1/22/2015 4:20 PM, Andrew Morton wrote:
>>>>
>>>> I don't think v2 addressed Vlastimil's review comment?
>>>>
>>>
>>> We're still adding the offset to node_mem_map and then subtracting it from
>>> just mem_map. Did I miss another comment somewhere?
>>
>> Yes that was addressed, thanks. But I don't feel comfortable acking
>> it yet, as I have no idea if we are doing the right thing for
>> CONFIG_HAVE_MEMBLOCK_NODE_MAP && CONFIG_FLATMEM case here.
>>
>> Also putting the CONFIG_FLATMEM && !CONFIG_HAVE_MEMBLOCK_NODE_MAP
>> under the "if (page_to_pfn(mem_map) != pgdat->node_start_pfn)" will
>> probably do the right thing, but looks like a weird test for this
>> case here.
>>
>> I have no good suggestion though, so let's CC Mel who apparently
>> wrote the ARCH_PFN_OFFSET correction?
>>
>
> I don't recall introducing ARCH_PFN_OFFSET, are you sure it was me?  I'm just
> back today after been offline a week so didn't review the patch but IIRC,
> ARCH_PFN_OFFSET deals with the case where physical memory does not start
> at 0. Without the offset, virtual _PAGE_OFFSET would not physical page 0.
> I don't recall it being related to the alignment of node 0 so if there
> are crashes due to misalignment of node 0 and the fix is ARCH_PFN_OFFSET
> related then I'm surprised.

You're right that ARCH_PFN_OFFSET wasn't added by you, but by commit 
467bc461d2 which was a bugfix to your commit c713216dee, which did 
introduce the mem_map correction code, and after which the code looked like:

mem_map = NODE_DATA(0)->node_mem_map;
#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
                if (page_to_pfn(mem_map) != pgdat->node_start_pfn)
                        mem_map -= pgdat->node_start_pfn;
#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */


It's from 2006 so I can't expect you remember the details, but I had 
some trouble finding out what this does. I assume it makes sure that 
mem_map points to struct page corresponding to pfn 0, because that's 
what translations using mem_map expect.
But pgdat->node_mem_map points to struct page corresponding to 
pgdat->node_start_pfn, which might not be 0. So it subtracts 
node_start_pfn to fix that. This is OK, as the node_mem_map is allocated 
(in this very function) with padding so that it covers a 
MAX_ORDER_NR_PAGES aligned area where node_mem_map may point to the 
middle of it.

Commit 467bc461d2 fixed this in case the first pfn is not 0, but 
ARCH_PFN_OFFSET. So mem_map points to struct page corresponding to 
pfn=ARCH_PFN_OFFSET, which is OK. But I still have few doubts:

1) The "if (page_to_pfn(mem_map) != pgdat->node_start_pfn)" sort of 
silently assumes that mem_map is allocated at the beginning of the node, 
i.e. at pgdat->node_start_pfn. And the only reason for this if-condition 
to be true, is that we haven't corrected the page_to_pfn translation, 
which uses mem_map. Is this assumption always OK to do? Shouldn't the 
if-condition be instead about pgdat->node_start_pfn not being aligned?

2) The #ifdef guard is about CONFIG_ARCH_POPULATES_NODE_MAP, which is 
nowadays called CONFIG_HAVE_MEMBLOCK_NODE_MAP. But shouldn't it be 
#ifdef FLATMEM instead? After all, we are correcting value of mem_map 
based on page_to_pfn code variant used on FLATMEM. arm doesn't define
CONFIG_ARCH_POPULATES_NODE_MAP but apparently needs this correction.

3) The node_mem_map allocation code aligns the allocation to 
MAX_ORDER_NR_PAGES, so the offset between the start of the allocated map 
and where node_mem_map points to will be up to MAX_ORDER_NR_PAGES.
However, here we subtract (in current kernel) (pgdat->node_start_pfn - 
ARCH_PFN_OFFSET). That looks like another silent assumption, that 
pgdat->node_start_pfn is always between ARCH_PFN_OFFSET and 
ARCH_PFN_OFFSET + MAX_ORDER_NR_PAGES. If it were larger, the mem_map 
correction would subtract too much and end up below what was allocated 
for node_mem_map, no? The bug report behind this patch said that first 
2MB of memory was reserved using "no-map flag using DT". Unless this 
somehow translates to ARCH_PFN_OFFSET at build time, we would underflow 
mem_map, right? Maybe I'm just overly paranoid here and of course 
ARCH_PFN_OFFSET is determined properly on arm...

If anyone can confirm my doubts or point me to what I'm missing, thanks.
Laura Abbott Feb. 4, 2015, 2:25 a.m. UTC | #6
On 1/29/2015 5:13 AM, Vlastimil Babka wrote:
> On 01/26/2015 04:56 PM, Mel Gorman wrote:
>> On Fri, Jan 23, 2015 at 10:05:48AM +0100, Vlastimil Babka wrote:
>>> On 01/23/2015 01:33 AM, Laura Abbott wrote:
>>>> On 1/22/2015 4:20 PM, Andrew Morton wrote:
>>>>>
>>>>> I don't think v2 addressed Vlastimil's review comment?
>>>>>
>>>>
>>>> We're still adding the offset to node_mem_map and then subtracting it from
>>>> just mem_map. Did I miss another comment somewhere?
>>>
>>> Yes that was addressed, thanks. But I don't feel comfortable acking
>>> it yet, as I have no idea if we are doing the right thing for
>>> CONFIG_HAVE_MEMBLOCK_NODE_MAP && CONFIG_FLATMEM case here.
>>>
>>> Also putting the CONFIG_FLATMEM && !CONFIG_HAVE_MEMBLOCK_NODE_MAP
>>> under the "if (page_to_pfn(mem_map) != pgdat->node_start_pfn)" will
>>> probably do the right thing, but looks like a weird test for this
>>> case here.
>>>
>>> I have no good suggestion though, so let's CC Mel who apparently
>>> wrote the ARCH_PFN_OFFSET correction?
>>>
>>
>> I don't recall introducing ARCH_PFN_OFFSET, are you sure it was me?  I'm just
>> back today after been offline a week so didn't review the patch but IIRC,
>> ARCH_PFN_OFFSET deals with the case where physical memory does not start
>> at 0. Without the offset, virtual _PAGE_OFFSET would not physical page 0.
>> I don't recall it being related to the alignment of node 0 so if there
>> are crashes due to misalignment of node 0 and the fix is ARCH_PFN_OFFSET
>> related then I'm surprised.
>
> You're right that ARCH_PFN_OFFSET wasn't added by you, but by commit
> 467bc461d2 which was a bugfix to your commit c713216dee, which did
>  introduce the mem_map correction code, and after which the code looked like:
>
> mem_map = NODE_DATA(0)->node_mem_map;
> #ifdef CONFIG_ARCH_POPULATES_NODE_MAP
>                 if (page_to_pfn(mem_map) != pgdat->node_start_pfn)
>                         mem_map -= pgdat->node_start_pfn;
> #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
>
>
> It's from 2006 so I can't expect you remember the details, but I had some
>  trouble finding out what this does. I assume it makes sure that mem_map points
>  to struct page corresponding to pfn 0, because that's what translations using
>  mem_map expect.
> But pgdat->node_mem_map points to struct page corresponding to
>  pgdat->node_start_pfn, which might not be 0. So it subtracts node_start_pfn
>  to fix that. This is OK, as the node_mem_map is allocated (in this very
>  function) with padding so that it covers a MAX_ORDER_NR_PAGES aligned area
>  where node_mem_map may point to the middle of it.
>
> Commit 467bc461d2 fixed this in case the first pfn is not 0, but ARCH_PFN_OFFSET.
>  So mem_map points to struct page corresponding to pfn=ARCH_PFN_OFFSET, which
>  is OK. But I still have few doubts:
>
> 1) The "if (page_to_pfn(mem_map) != pgdat->node_start_pfn)" sort of silently
>  assumes that mem_map is allocated at the beginning of the node, i.e. at
>  pgdat->node_start_pfn. And the only reason for this if-condition to be true,
>  is that we haven't corrected the page_to_pfn translation, which uses mem_map.
>  Is this assumption always OK to do? Shouldn't the if-condition be instead about
>  pgdat->node_start_pfn not being aligned?
>
> 2) The #ifdef guard is about CONFIG_ARCH_POPULATES_NODE_MAP, which is nowadays  called  > CONFIG_HAVE_MEMBLOCK_NODE_MAP. But shouldn't it be #ifdef FLATMEM instead?
>  After all, we are correcting value of mem_map based on page_to_pfn code
>variant used on FLATMEM. arm doesn't define
> CONFIG_ARCH_POPULATES_NODE_MAP but apparently needs this correction.
>

Just doing #ifdef FLATMEM doesn't work because ARCH_PFN_OFFSET doesn't
seem to be picked up properly for NOMMU arches properly. Probably just
missing a header somewhere.

> 3) The node_mem_map allocation code aligns the allocation to MAX_ORDER_NR_PAGES,
>  so the offset between the start of the allocated map and where node_mem_map
>  points to will be up to MAX_ORDER_NR_PAGES.
> However, here we subtract (in current kernel) (pgdat->node_start_pfn - ARCH_PFN_OFFSET).
>  That looks like another silent assumption, that pgdat->node_start_pfn is always
>  between ARCH_PFN_OFFSET and ARCH_PFN_OFFSET + MAX_ORDER_NR_PAGES. If it were
>  larger, the mem_map correction would subtract too much and end up below what
>  was allocated for node_mem_map, no? The bug report behind this patch said that
>  first 2MB of memory was reserved using "no-map flag using DT". Unless this somehow
>  translates to ARCH_PFN_OFFSET at build time, we would underflow mem_map, right?
>  Maybe I'm just overly paranoid here and of course ARCH_PFN_OFFSET is determined
>  properly on arm...
>
> If anyone can confirm my doubts or point me to what I'm missing, thanks.

ARCH_PFN_OFFSET should always be the lowest PFN in the system, otherwise
I think plenty of other things are broken given how many architectures
make this assumption. That said, I don't think subtracting ARCH_PFN_OFFSET
makes it obvious why the adjustment is being made.

Thanks,
Laura
Laura Abbott Feb. 24, 2015, 7:54 p.m. UTC | #7
Reviving this thread because I don't think it ever got resolved.

On 2/3/2015 6:25 PM, Laura Abbott wrote:
> On 1/29/2015 5:13 AM, Vlastimil Babka wrote:
>> On 01/26/2015 04:56 PM, Mel Gorman wrote:
>>> On Fri, Jan 23, 2015 at 10:05:48AM +0100, Vlastimil Babka wrote:
>>>> On 01/23/2015 01:33 AM, Laura Abbott wrote:
>>>>> On 1/22/2015 4:20 PM, Andrew Morton wrote:
>>>>>>
>>>>>> I don't think v2 addressed Vlastimil's review comment?
>>>>>>
>>>>>
>>>>> We're still adding the offset to node_mem_map and then subtracting it from
>>>>> just mem_map. Did I miss another comment somewhere?
>>>>
>>>> Yes that was addressed, thanks. But I don't feel comfortable acking
>>>> it yet, as I have no idea if we are doing the right thing for
>>>> CONFIG_HAVE_MEMBLOCK_NODE_MAP && CONFIG_FLATMEM case here.
>>>>
>>>> Also putting the CONFIG_FLATMEM && !CONFIG_HAVE_MEMBLOCK_NODE_MAP
>>>> under the "if (page_to_pfn(mem_map) != pgdat->node_start_pfn)" will
>>>> probably do the right thing, but looks like a weird test for this
>>>> case here.
>>>>
>>>> I have no good suggestion though, so let's CC Mel who apparently
>>>> wrote the ARCH_PFN_OFFSET correction?
>>>>
>>>
>>> I don't recall introducing ARCH_PFN_OFFSET, are you sure it was me?  I'm just
>>> back today after been offline a week so didn't review the patch but IIRC,
>>> ARCH_PFN_OFFSET deals with the case where physical memory does not start
>>> at 0. Without the offset, virtual _PAGE_OFFSET would not physical page 0.
>>> I don't recall it being related to the alignment of node 0 so if there
>>> are crashes due to misalignment of node 0 and the fix is ARCH_PFN_OFFSET
>>> related then I'm surprised.
>>
>> You're right that ARCH_PFN_OFFSET wasn't added by you, but by commit
>> 467bc461d2 which was a bugfix to your commit c713216dee, which did
>>  introduce the mem_map correction code, and after which the code looked like:
>>
>> mem_map = NODE_DATA(0)->node_mem_map;
>> #ifdef CONFIG_ARCH_POPULATES_NODE_MAP
>>                 if (page_to_pfn(mem_map) != pgdat->node_start_pfn)
>>                         mem_map -= pgdat->node_start_pfn;
>> #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
>>
>>
>> It's from 2006 so I can't expect you remember the details, but I had some
>>  trouble finding out what this does. I assume it makes sure that mem_map points
>>  to struct page corresponding to pfn 0, because that's what translations using
>>  mem_map expect.
>> But pgdat->node_mem_map points to struct page corresponding to
>>  pgdat->node_start_pfn, which might not be 0. So it subtracts node_start_pfn
>>  to fix that. This is OK, as the node_mem_map is allocated (in this very
>>  function) with padding so that it covers a MAX_ORDER_NR_PAGES aligned area
>>  where node_mem_map may point to the middle of it.
>>
>> Commit 467bc461d2 fixed this in case the first pfn is not 0, but ARCH_PFN_OFFSET.
>>  So mem_map points to struct page corresponding to pfn=ARCH_PFN_OFFSET, which
>>  is OK. But I still have few doubts:
>>
>> 1) The "if (page_to_pfn(mem_map) != pgdat->node_start_pfn)" sort of silently
>>  assumes that mem_map is allocated at the beginning of the node, i.e. at
>>  pgdat->node_start_pfn. And the only reason for this if-condition to be true,
>>  is that we haven't corrected the page_to_pfn translation, which uses mem_map.
>>  Is this assumption always OK to do? Shouldn't the if-condition be instead about
>>  pgdat->node_start_pfn not being aligned?
>>
>> 2) The #ifdef guard is about CONFIG_ARCH_POPULATES_NODE_MAP, which is nowadays  called  > CONFIG_HAVE_MEMBLOCK_NODE_MAP. But shouldn't it be #ifdef FLATMEM instead?
>>  After all, we are correcting value of mem_map based on page_to_pfn code
>> variant used on FLATMEM. arm doesn't define
>> CONFIG_ARCH_POPULATES_NODE_MAP but apparently needs this correction.
>>
>
> Just doing #ifdef FLATMEM doesn't work because ARCH_PFN_OFFSET doesn't
> seem to be picked up properly for NOMMU arches properly. Probably just
> missing a header somewhere.
>
>> 3) The node_mem_map allocation code aligns the allocation to MAX_ORDER_NR_PAGES,
>>  so the offset between the start of the allocated map and where node_mem_map
>>  points to will be up to MAX_ORDER_NR_PAGES.
>> However, here we subtract (in current kernel) (pgdat->node_start_pfn - ARCH_PFN_OFFSET).
>>  That looks like another silent assumption, that pgdat->node_start_pfn is always
>>  between ARCH_PFN_OFFSET and ARCH_PFN_OFFSET + MAX_ORDER_NR_PAGES. If it were
>>  larger, the mem_map correction would subtract too much and end up below what
>>  was allocated for node_mem_map, no? The bug report behind this patch said that
>>  first 2MB of memory was reserved using "no-map flag using DT". Unless this somehow
>>  translates to ARCH_PFN_OFFSET at build time, we would underflow mem_map, right?
>>  Maybe I'm just overly paranoid here and of course ARCH_PFN_OFFSET is determined
>>  properly on arm...
>>
>> If anyone can confirm my doubts or point me to what I'm missing, thanks.
>
> ARCH_PFN_OFFSET should always be the lowest PFN in the system, otherwise
> I think plenty of other things are broken given how many architectures
> make this assumption. That said, I don't think subtracting ARCH_PFN_OFFSET
> makes it obvious why the adjustment is being made.
>
> Thanks,
> Laura
>

I was incorrect before: it isn't just NOMMU but architectures that don't use
asm-generic/memory_model.h which failed to compile. I could respin with
more ifdefery around the ARCH_PFN_OFFSET if that sounds reasonable.

Thanks,
Laura
Vlastimil Babka Feb. 27, 2015, 3:24 p.m. UTC | #8
On 02/24/2015 08:54 PM, Laura Abbott wrote:
> Reviving this thread because I don't think it ever got resolved.
> 
> On 2/3/2015 6:25 PM, Laura Abbott wrote:
>> On 1/29/2015 5:13 AM, Vlastimil Babka wrote:
>>>>
>>>> I don't recall introducing ARCH_PFN_OFFSET, are you sure it was me?  I'm just
>>>> back today after been offline a week so didn't review the patch but IIRC,
>>>> ARCH_PFN_OFFSET deals with the case where physical memory does not start
>>>> at 0. Without the offset, virtual _PAGE_OFFSET would not physical page 0.
>>>> I don't recall it being related to the alignment of node 0 so if there
>>>> are crashes due to misalignment of node 0 and the fix is ARCH_PFN_OFFSET
>>>> related then I'm surprised.
>>>
>>> You're right that ARCH_PFN_OFFSET wasn't added by you, but by commit
>>> 467bc461d2 which was a bugfix to your commit c713216dee, which did
>>>  introduce the mem_map correction code, and after which the code looked like:
>>>
>>> mem_map = NODE_DATA(0)->node_mem_map;
>>> #ifdef CONFIG_ARCH_POPULATES_NODE_MAP
>>>                 if (page_to_pfn(mem_map) != pgdat->node_start_pfn)
>>>                         mem_map -= pgdat->node_start_pfn;
>>> #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
>>>
>>>
>>> It's from 2006 so I can't expect you remember the details, but I had some
>>>  trouble finding out what this does. I assume it makes sure that mem_map points
>>>  to struct page corresponding to pfn 0, because that's what translations using
>>>  mem_map expect.
>>> But pgdat->node_mem_map points to struct page corresponding to
>>>  pgdat->node_start_pfn, which might not be 0. So it subtracts node_start_pfn
>>>  to fix that. This is OK, as the node_mem_map is allocated (in this very
>>>  function) with padding so that it covers a MAX_ORDER_NR_PAGES aligned area
>>>  where node_mem_map may point to the middle of it.
>>>
>>> Commit 467bc461d2 fixed this in case the first pfn is not 0, but ARCH_PFN_OFFSET.
>>>  So mem_map points to struct page corresponding to pfn=ARCH_PFN_OFFSET, which
>>>  is OK. But I still have few doubts:
>>>
>>> 1) The "if (page_to_pfn(mem_map) != pgdat->node_start_pfn)" sort of silently
>>>  assumes that mem_map is allocated at the beginning of the node, i.e. at
>>>  pgdat->node_start_pfn. And the only reason for this if-condition to be true,
>>>  is that we haven't corrected the page_to_pfn translation, which uses mem_map.
>>>  Is this assumption always OK to do? Shouldn't the if-condition be instead about
>>>  pgdat->node_start_pfn not being aligned?
>>>
>>> 2) The #ifdef guard is about CONFIG_ARCH_POPULATES_NODE_MAP, which is nowadays  called  > CONFIG_HAVE_MEMBLOCK_NODE_MAP. But shouldn't it be #ifdef FLATMEM instead?
>>>  After all, we are correcting value of mem_map based on page_to_pfn code
>>> variant used on FLATMEM. arm doesn't define
>>> CONFIG_ARCH_POPULATES_NODE_MAP but apparently needs this correction.
>>>
>>
>> Just doing #ifdef FLATMEM doesn't work because ARCH_PFN_OFFSET doesn't
>> seem to be picked up properly for NOMMU arches properly. Probably just
>> missing a header somewhere.
>>
>>> 3) The node_mem_map allocation code aligns the allocation to MAX_ORDER_NR_PAGES,
>>>  so the offset between the start of the allocated map and where node_mem_map
>>>  points to will be up to MAX_ORDER_NR_PAGES.
>>> However, here we subtract (in current kernel) (pgdat->node_start_pfn - ARCH_PFN_OFFSET).
>>>  That looks like another silent assumption, that pgdat->node_start_pfn is always
>>>  between ARCH_PFN_OFFSET and ARCH_PFN_OFFSET + MAX_ORDER_NR_PAGES. If it were
>>>  larger, the mem_map correction would subtract too much and end up below what
>>>  was allocated for node_mem_map, no? The bug report behind this patch said that
>>>  first 2MB of memory was reserved using "no-map flag using DT". Unless this somehow
>>>  translates to ARCH_PFN_OFFSET at build time, we would underflow mem_map, right?
>>>  Maybe I'm just overly paranoid here and of course ARCH_PFN_OFFSET is determined
>>>  properly on arm...
>>>
>>> If anyone can confirm my doubts or point me to what I'm missing, thanks.
>>
>> ARCH_PFN_OFFSET should always be the lowest PFN in the system, otherwise
>> I think plenty of other things are broken given how many architectures
>> make this assumption. That said, I don't think subtracting ARCH_PFN_OFFSET
>> makes it obvious why the adjustment is being made.
>>
>> Thanks,
>> Laura
>>
> 
> I was incorrect before: it isn't just NOMMU but architectures that don't use
> asm-generic/memory_model.h which failed to compile. I could respin with

Hm I see, some architectures use own variant of page_to_pfn, that's why it's
being used in the if () check.

> more ifdefery around the ARCH_PFN_OFFSET if that sounds reasonable.

So I think your v2 might be correct already. Unless there's an architecture that
defines CONFIG_FLATMEM and not CONFIG_HAVE_MEMBLOCK_NODE_MAP and places memmap
somewhere else than pgdat->node_start_pfn, which would trigger the check for a
wrong reason after the patch.

Looks like arm is an arch that doesn't define CONFIG_HAVE_MEMBLOCK_NODE_MAP, yet
it defines ARCH_PFN_OFFSET. With your patch it would correct memmap by the
calculated offset, not the ARCH_PFN_OFFSET constant. Are these two the same
then? Should there be something like a VM_BUG_ON that ARCH_PFN_OFFSET (if it
exists) is indeed equal to the calculated offset? Or maybe a more general
VM_BUG_ON checking that after any correction we make, the (page_to_pfn(mem_map)
== pgdat->node_start_pfn) condition holds?

> Thanks,
> Laura
>
diff mbox

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7633c50..269fc93 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5005,6 +5005,8 @@  static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat)
 {
+	unsigned long __maybe_unused offset = 0;
+
 	/* Skip empty nodes */
 	if (!pgdat->node_spanned_pages)
 		return;
@@ -5021,6 +5023,7 @@  static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat)
 		 * for the buddy allocator to function correctly.
 		 */
 		start = pgdat->node_start_pfn & ~(MAX_ORDER_NR_PAGES - 1);
+		offset = pgdat->node_start_pfn - start;
 		end = pgdat_end_pfn(pgdat);
 		end = ALIGN(end, MAX_ORDER_NR_PAGES);
 		size =  (end - start) * sizeof(struct page);
@@ -5028,7 +5031,7 @@  static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat)
 		if (!map)
 			map = memblock_virt_alloc_node_nopanic(size,
 							       pgdat->node_id);
-		pgdat->node_mem_map = map + (pgdat->node_start_pfn - start);
+		pgdat->node_mem_map = map + offset;
 	}
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 	/*
@@ -5036,10 +5039,13 @@  static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat)
 	 */
 	if (pgdat == NODE_DATA(0)) {
 		mem_map = NODE_DATA(0)->node_mem_map;
-#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-		if (page_to_pfn(mem_map) != pgdat->node_start_pfn)
-			mem_map -= (pgdat->node_start_pfn - ARCH_PFN_OFFSET);
-#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+#if defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) || defined(CONFIG_FLATMEM)
+		if (page_to_pfn(mem_map) != pgdat->node_start_pfn) {
+			if (IS_ENABLED(CONFIG_HAVE_MEMBLOCK_NODE_MAP))
+				offset = pgdat->node_start_pfn - ARCH_PFN_OFFSET;
+			mem_map -= offset;
+		}
+#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP || CONFIG_FLATMEM */
 	}
 #endif
 #endif /* CONFIG_FLAT_NODE_MEM_MAP */