Message ID | 1421888500-24364-1-git-send-email-lauraa@codeaurora.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, 21 Jan 2015 17:01:40 -0800 Laura Abbott <lauraa@codeaurora.org> wrote: > Srinivas Kandagatla reported bad page messages when trying to > remove the bottom 2MB on an ARM based IFC6410 board > > BUG: Bad page state in process swapper pfn:fffa8 > page:ef7fb500 count:0 mapcount:0 mapping: (null) index:0x0 > flags: 0x96640253(locked|error|dirty|active|arch_1|reclaim|mlocked) > page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set > bad because of flags: > flags: 0x200041(locked|active|mlocked) > Modules linked in: > CPU: 0 PID: 0 Comm: swapper Not tainted 3.19.0-rc3-00007-g412f9ba-dirty #816 > Hardware name: Qualcomm (Flattened Device Tree) > [<c0218280>] (unwind_backtrace) from [<c0212be8>] (show_stack+0x20/0x24) > [<c0212be8>] (show_stack) from [<c0af7124>] (dump_stack+0x80/0x9c) > [<c0af7124>] (dump_stack) from [<c0301570>] (bad_page+0xc8/0x128) > [<c0301570>] (bad_page) from [<c03018a8>] (free_pages_prepare+0x168/0x1e0) > [<c03018a8>] (free_pages_prepare) from [<c030369c>] (free_hot_cold_page+0x3c/0x174) > [<c030369c>] (free_hot_cold_page) from [<c0303828>] (__free_pages+0x54/0x58) > [<c0303828>] (__free_pages) from [<c030395c>] (free_highmem_page+0x38/0x88) > [<c030395c>] (free_highmem_page) from [<c0f62d5c>] (mem_init+0x240/0x430) > [<c0f62d5c>] (mem_init) from [<c0f5db3c>] (start_kernel+0x1e4/0x3c8) > [<c0f5db3c>] (start_kernel) from [<80208074>] (0x80208074) > Disabling lock debugging due to kernel taint > > Removing the lower 2MB made the start of the lowmem zone to no longer > be page block aligned. IFC6410 uses CONFIG_FLATMEM where > alloc_node_mem_map allocates memory for the mem_map. alloc_node_mem_map > will offset for unaligned nodes with the assumption the pfn/page > translation functions will account for the offset. The functions for > CONFIG_FLATMEM do not offset however, resulting in overrunning > the memmap array. Just use the allocated memmap without any offset > when running with CONFIG_FLATMEM to avoid the overrun. > I don't think v2 addressed Vlastimil's review comment?
On 1/22/2015 4:20 PM, Andrew Morton wrote: > On Wed, 21 Jan 2015 17:01:40 -0800 Laura Abbott <lauraa@codeaurora.org> wrote: > >> Srinivas Kandagatla reported bad page messages when trying to >> remove the bottom 2MB on an ARM based IFC6410 board >> >> BUG: Bad page state in process swapper pfn:fffa8 >> page:ef7fb500 count:0 mapcount:0 mapping: (null) index:0x0 >> flags: 0x96640253(locked|error|dirty|active|arch_1|reclaim|mlocked) >> page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set >> bad because of flags: >> flags: 0x200041(locked|active|mlocked) >> Modules linked in: >> CPU: 0 PID: 0 Comm: swapper Not tainted 3.19.0-rc3-00007-g412f9ba-dirty #816 >> Hardware name: Qualcomm (Flattened Device Tree) >> [<c0218280>] (unwind_backtrace) from [<c0212be8>] (show_stack+0x20/0x24) >> [<c0212be8>] (show_stack) from [<c0af7124>] (dump_stack+0x80/0x9c) >> [<c0af7124>] (dump_stack) from [<c0301570>] (bad_page+0xc8/0x128) >> [<c0301570>] (bad_page) from [<c03018a8>] (free_pages_prepare+0x168/0x1e0) >> [<c03018a8>] (free_pages_prepare) from [<c030369c>] (free_hot_cold_page+0x3c/0x174) >> [<c030369c>] (free_hot_cold_page) from [<c0303828>] (__free_pages+0x54/0x58) >> [<c0303828>] (__free_pages) from [<c030395c>] (free_highmem_page+0x38/0x88) >> [<c030395c>] (free_highmem_page) from [<c0f62d5c>] (mem_init+0x240/0x430) >> [<c0f62d5c>] (mem_init) from [<c0f5db3c>] (start_kernel+0x1e4/0x3c8) >> [<c0f5db3c>] (start_kernel) from [<80208074>] (0x80208074) >> Disabling lock debugging due to kernel taint >> >> Removing the lower 2MB made the start of the lowmem zone to no longer >> be page block aligned. IFC6410 uses CONFIG_FLATMEM where >> alloc_node_mem_map allocates memory for the mem_map. alloc_node_mem_map >> will offset for unaligned nodes with the assumption the pfn/page >> translation functions will account for the offset. The functions for >> CONFIG_FLATMEM do not offset however, resulting in overrunning >> the memmap array. Just use the allocated memmap without any offset >> when running with CONFIG_FLATMEM to avoid the overrun. >> > > I don't think v2 addressed Vlastimil's review comment? > We're still adding the offset to node_mem_map and then subtracting it from just mem_map. Did I miss another comment somewhere?
On 01/23/2015 01:33 AM, Laura Abbott wrote: > On 1/22/2015 4:20 PM, Andrew Morton wrote: >> On Wed, 21 Jan 2015 17:01:40 -0800 Laura Abbott <lauraa@codeaurora.org> wrote: >> >>> Srinivas Kandagatla reported bad page messages when trying to >>> remove the bottom 2MB on an ARM based IFC6410 board >>> >>> BUG: Bad page state in process swapper pfn:fffa8 >>> page:ef7fb500 count:0 mapcount:0 mapping: (null) index:0x0 >>> flags: 0x96640253(locked|error|dirty|active|arch_1|reclaim|mlocked) >>> page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set >>> bad because of flags: >>> flags: 0x200041(locked|active|mlocked) >>> Modules linked in: >>> CPU: 0 PID: 0 Comm: swapper Not tainted 3.19.0-rc3-00007-g412f9ba-dirty #816 >>> Hardware name: Qualcomm (Flattened Device Tree) >>> [<c0218280>] (unwind_backtrace) from [<c0212be8>] (show_stack+0x20/0x24) >>> [<c0212be8>] (show_stack) from [<c0af7124>] (dump_stack+0x80/0x9c) >>> [<c0af7124>] (dump_stack) from [<c0301570>] (bad_page+0xc8/0x128) >>> [<c0301570>] (bad_page) from [<c03018a8>] (free_pages_prepare+0x168/0x1e0) >>> [<c03018a8>] (free_pages_prepare) from [<c030369c>] (free_hot_cold_page+0x3c/0x174) >>> [<c030369c>] (free_hot_cold_page) from [<c0303828>] (__free_pages+0x54/0x58) >>> [<c0303828>] (__free_pages) from [<c030395c>] (free_highmem_page+0x38/0x88) >>> [<c030395c>] (free_highmem_page) from [<c0f62d5c>] (mem_init+0x240/0x430) >>> [<c0f62d5c>] (mem_init) from [<c0f5db3c>] (start_kernel+0x1e4/0x3c8) >>> [<c0f5db3c>] (start_kernel) from [<80208074>] (0x80208074) >>> Disabling lock debugging due to kernel taint >>> >>> Removing the lower 2MB made the start of the lowmem zone to no longer >>> be page block aligned. IFC6410 uses CONFIG_FLATMEM where >>> alloc_node_mem_map allocates memory for the mem_map. alloc_node_mem_map >>> will offset for unaligned nodes with the assumption the pfn/page >>> translation functions will account for the offset. The functions for >>> CONFIG_FLATMEM do not offset however, resulting in overrunning >>> the memmap array. Just use the allocated memmap without any offset >>> when running with CONFIG_FLATMEM to avoid the overrun. >>> >> >> I don't think v2 addressed Vlastimil's review comment? >> > > We're still adding the offset to node_mem_map and then subtracting it from > just mem_map. Did I miss another comment somewhere? Yes that was addressed, thanks. But I don't feel comfortable acking it yet, as I have no idea if we are doing the right thing for CONFIG_HAVE_MEMBLOCK_NODE_MAP && CONFIG_FLATMEM case here. Also putting the CONFIG_FLATMEM && !CONFIG_HAVE_MEMBLOCK_NODE_MAP under the "if (page_to_pfn(mem_map) != pgdat->node_start_pfn)" will probably do the right thing, but looks like a weird test for this case here. I have no good suggestion though, so let's CC Mel who apparently wrote the ARCH_PFN_OFFSET correction?
On Fri, Jan 23, 2015 at 10:05:48AM +0100, Vlastimil Babka wrote: > On 01/23/2015 01:33 AM, Laura Abbott wrote: > >On 1/22/2015 4:20 PM, Andrew Morton wrote: > >>On Wed, 21 Jan 2015 17:01:40 -0800 Laura Abbott <lauraa@codeaurora.org> wrote: > >> > >>>Srinivas Kandagatla reported bad page messages when trying to > >>>remove the bottom 2MB on an ARM based IFC6410 board > >>> > >>>BUG: Bad page state in process swapper pfn:fffa8 > >>>page:ef7fb500 count:0 mapcount:0 mapping: (null) index:0x0 > >>>flags: 0x96640253(locked|error|dirty|active|arch_1|reclaim|mlocked) > >>>page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set > >>>bad because of flags: > >>>flags: 0x200041(locked|active|mlocked) > >>>Modules linked in: > >>>CPU: 0 PID: 0 Comm: swapper Not tainted 3.19.0-rc3-00007-g412f9ba-dirty #816 > >>>Hardware name: Qualcomm (Flattened Device Tree) > >>>[<c0218280>] (unwind_backtrace) from [<c0212be8>] (show_stack+0x20/0x24) > >>>[<c0212be8>] (show_stack) from [<c0af7124>] (dump_stack+0x80/0x9c) > >>>[<c0af7124>] (dump_stack) from [<c0301570>] (bad_page+0xc8/0x128) > >>>[<c0301570>] (bad_page) from [<c03018a8>] (free_pages_prepare+0x168/0x1e0) > >>>[<c03018a8>] (free_pages_prepare) from [<c030369c>] (free_hot_cold_page+0x3c/0x174) > >>>[<c030369c>] (free_hot_cold_page) from [<c0303828>] (__free_pages+0x54/0x58) > >>>[<c0303828>] (__free_pages) from [<c030395c>] (free_highmem_page+0x38/0x88) > >>>[<c030395c>] (free_highmem_page) from [<c0f62d5c>] (mem_init+0x240/0x430) > >>>[<c0f62d5c>] (mem_init) from [<c0f5db3c>] (start_kernel+0x1e4/0x3c8) > >>>[<c0f5db3c>] (start_kernel) from [<80208074>] (0x80208074) > >>>Disabling lock debugging due to kernel taint > >>> > >>>Removing the lower 2MB made the start of the lowmem zone to no longer > >>>be page block aligned. IFC6410 uses CONFIG_FLATMEM where > >>>alloc_node_mem_map allocates memory for the mem_map. alloc_node_mem_map > >>>will offset for unaligned nodes with the assumption the pfn/page > >>>translation functions will account for the offset. The functions for > >>>CONFIG_FLATMEM do not offset however, resulting in overrunning > >>>the memmap array. Just use the allocated memmap without any offset > >>>when running with CONFIG_FLATMEM to avoid the overrun. > >>> > >> > >>I don't think v2 addressed Vlastimil's review comment? > >> > > > >We're still adding the offset to node_mem_map and then subtracting it from > >just mem_map. Did I miss another comment somewhere? > > Yes that was addressed, thanks. But I don't feel comfortable acking > it yet, as I have no idea if we are doing the right thing for > CONFIG_HAVE_MEMBLOCK_NODE_MAP && CONFIG_FLATMEM case here. > > Also putting the CONFIG_FLATMEM && !CONFIG_HAVE_MEMBLOCK_NODE_MAP > under the "if (page_to_pfn(mem_map) != pgdat->node_start_pfn)" will > probably do the right thing, but looks like a weird test for this > case here. > > I have no good suggestion though, so let's CC Mel who apparently > wrote the ARCH_PFN_OFFSET correction? > I don't recall introducing ARCH_PFN_OFFSET, are you sure it was me? I'm just back today after been offline a week so didn't review the patch but IIRC, ARCH_PFN_OFFSET deals with the case where physical memory does not start at 0. Without the offset, virtual _PAGE_OFFSET would not physical page 0. I don't recall it being related to the alignment of node 0 so if there are crashes due to misalignment of node 0 and the fix is ARCH_PFN_OFFSET related then I'm surprised.
On 01/26/2015 04:56 PM, Mel Gorman wrote: > On Fri, Jan 23, 2015 at 10:05:48AM +0100, Vlastimil Babka wrote: >> On 01/23/2015 01:33 AM, Laura Abbott wrote: >>> On 1/22/2015 4:20 PM, Andrew Morton wrote: >>>> >>>> I don't think v2 addressed Vlastimil's review comment? >>>> >>> >>> We're still adding the offset to node_mem_map and then subtracting it from >>> just mem_map. Did I miss another comment somewhere? >> >> Yes that was addressed, thanks. But I don't feel comfortable acking >> it yet, as I have no idea if we are doing the right thing for >> CONFIG_HAVE_MEMBLOCK_NODE_MAP && CONFIG_FLATMEM case here. >> >> Also putting the CONFIG_FLATMEM && !CONFIG_HAVE_MEMBLOCK_NODE_MAP >> under the "if (page_to_pfn(mem_map) != pgdat->node_start_pfn)" will >> probably do the right thing, but looks like a weird test for this >> case here. >> >> I have no good suggestion though, so let's CC Mel who apparently >> wrote the ARCH_PFN_OFFSET correction? >> > > I don't recall introducing ARCH_PFN_OFFSET, are you sure it was me? I'm just > back today after been offline a week so didn't review the patch but IIRC, > ARCH_PFN_OFFSET deals with the case where physical memory does not start > at 0. Without the offset, virtual _PAGE_OFFSET would not physical page 0. > I don't recall it being related to the alignment of node 0 so if there > are crashes due to misalignment of node 0 and the fix is ARCH_PFN_OFFSET > related then I'm surprised. You're right that ARCH_PFN_OFFSET wasn't added by you, but by commit 467bc461d2 which was a bugfix to your commit c713216dee, which did introduce the mem_map correction code, and after which the code looked like: mem_map = NODE_DATA(0)->node_mem_map; #ifdef CONFIG_ARCH_POPULATES_NODE_MAP if (page_to_pfn(mem_map) != pgdat->node_start_pfn) mem_map -= pgdat->node_start_pfn; #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */ It's from 2006 so I can't expect you remember the details, but I had some trouble finding out what this does. I assume it makes sure that mem_map points to struct page corresponding to pfn 0, because that's what translations using mem_map expect. But pgdat->node_mem_map points to struct page corresponding to pgdat->node_start_pfn, which might not be 0. So it subtracts node_start_pfn to fix that. This is OK, as the node_mem_map is allocated (in this very function) with padding so that it covers a MAX_ORDER_NR_PAGES aligned area where node_mem_map may point to the middle of it. Commit 467bc461d2 fixed this in case the first pfn is not 0, but ARCH_PFN_OFFSET. So mem_map points to struct page corresponding to pfn=ARCH_PFN_OFFSET, which is OK. But I still have few doubts: 1) The "if (page_to_pfn(mem_map) != pgdat->node_start_pfn)" sort of silently assumes that mem_map is allocated at the beginning of the node, i.e. at pgdat->node_start_pfn. And the only reason for this if-condition to be true, is that we haven't corrected the page_to_pfn translation, which uses mem_map. Is this assumption always OK to do? Shouldn't the if-condition be instead about pgdat->node_start_pfn not being aligned? 2) The #ifdef guard is about CONFIG_ARCH_POPULATES_NODE_MAP, which is nowadays called CONFIG_HAVE_MEMBLOCK_NODE_MAP. But shouldn't it be #ifdef FLATMEM instead? After all, we are correcting value of mem_map based on page_to_pfn code variant used on FLATMEM. arm doesn't define CONFIG_ARCH_POPULATES_NODE_MAP but apparently needs this correction. 3) The node_mem_map allocation code aligns the allocation to MAX_ORDER_NR_PAGES, so the offset between the start of the allocated map and where node_mem_map points to will be up to MAX_ORDER_NR_PAGES. However, here we subtract (in current kernel) (pgdat->node_start_pfn - ARCH_PFN_OFFSET). That looks like another silent assumption, that pgdat->node_start_pfn is always between ARCH_PFN_OFFSET and ARCH_PFN_OFFSET + MAX_ORDER_NR_PAGES. If it were larger, the mem_map correction would subtract too much and end up below what was allocated for node_mem_map, no? The bug report behind this patch said that first 2MB of memory was reserved using "no-map flag using DT". Unless this somehow translates to ARCH_PFN_OFFSET at build time, we would underflow mem_map, right? Maybe I'm just overly paranoid here and of course ARCH_PFN_OFFSET is determined properly on arm... If anyone can confirm my doubts or point me to what I'm missing, thanks.
On 1/29/2015 5:13 AM, Vlastimil Babka wrote: > On 01/26/2015 04:56 PM, Mel Gorman wrote: >> On Fri, Jan 23, 2015 at 10:05:48AM +0100, Vlastimil Babka wrote: >>> On 01/23/2015 01:33 AM, Laura Abbott wrote: >>>> On 1/22/2015 4:20 PM, Andrew Morton wrote: >>>>> >>>>> I don't think v2 addressed Vlastimil's review comment? >>>>> >>>> >>>> We're still adding the offset to node_mem_map and then subtracting it from >>>> just mem_map. Did I miss another comment somewhere? >>> >>> Yes that was addressed, thanks. But I don't feel comfortable acking >>> it yet, as I have no idea if we are doing the right thing for >>> CONFIG_HAVE_MEMBLOCK_NODE_MAP && CONFIG_FLATMEM case here. >>> >>> Also putting the CONFIG_FLATMEM && !CONFIG_HAVE_MEMBLOCK_NODE_MAP >>> under the "if (page_to_pfn(mem_map) != pgdat->node_start_pfn)" will >>> probably do the right thing, but looks like a weird test for this >>> case here. >>> >>> I have no good suggestion though, so let's CC Mel who apparently >>> wrote the ARCH_PFN_OFFSET correction? >>> >> >> I don't recall introducing ARCH_PFN_OFFSET, are you sure it was me? I'm just >> back today after been offline a week so didn't review the patch but IIRC, >> ARCH_PFN_OFFSET deals with the case where physical memory does not start >> at 0. Without the offset, virtual _PAGE_OFFSET would not physical page 0. >> I don't recall it being related to the alignment of node 0 so if there >> are crashes due to misalignment of node 0 and the fix is ARCH_PFN_OFFSET >> related then I'm surprised. > > You're right that ARCH_PFN_OFFSET wasn't added by you, but by commit > 467bc461d2 which was a bugfix to your commit c713216dee, which did > introduce the mem_map correction code, and after which the code looked like: > > mem_map = NODE_DATA(0)->node_mem_map; > #ifdef CONFIG_ARCH_POPULATES_NODE_MAP > if (page_to_pfn(mem_map) != pgdat->node_start_pfn) > mem_map -= pgdat->node_start_pfn; > #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */ > > > It's from 2006 so I can't expect you remember the details, but I had some > trouble finding out what this does. I assume it makes sure that mem_map points > to struct page corresponding to pfn 0, because that's what translations using > mem_map expect. > But pgdat->node_mem_map points to struct page corresponding to > pgdat->node_start_pfn, which might not be 0. So it subtracts node_start_pfn > to fix that. This is OK, as the node_mem_map is allocated (in this very > function) with padding so that it covers a MAX_ORDER_NR_PAGES aligned area > where node_mem_map may point to the middle of it. > > Commit 467bc461d2 fixed this in case the first pfn is not 0, but ARCH_PFN_OFFSET. > So mem_map points to struct page corresponding to pfn=ARCH_PFN_OFFSET, which > is OK. But I still have few doubts: > > 1) The "if (page_to_pfn(mem_map) != pgdat->node_start_pfn)" sort of silently > assumes that mem_map is allocated at the beginning of the node, i.e. at > pgdat->node_start_pfn. And the only reason for this if-condition to be true, > is that we haven't corrected the page_to_pfn translation, which uses mem_map. > Is this assumption always OK to do? Shouldn't the if-condition be instead about > pgdat->node_start_pfn not being aligned? > > 2) The #ifdef guard is about CONFIG_ARCH_POPULATES_NODE_MAP, which is nowadays called > CONFIG_HAVE_MEMBLOCK_NODE_MAP. But shouldn't it be #ifdef FLATMEM instead? > After all, we are correcting value of mem_map based on page_to_pfn code >variant used on FLATMEM. arm doesn't define > CONFIG_ARCH_POPULATES_NODE_MAP but apparently needs this correction. > Just doing #ifdef FLATMEM doesn't work because ARCH_PFN_OFFSET doesn't seem to be picked up properly for NOMMU arches properly. Probably just missing a header somewhere. > 3) The node_mem_map allocation code aligns the allocation to MAX_ORDER_NR_PAGES, > so the offset between the start of the allocated map and where node_mem_map > points to will be up to MAX_ORDER_NR_PAGES. > However, here we subtract (in current kernel) (pgdat->node_start_pfn - ARCH_PFN_OFFSET). > That looks like another silent assumption, that pgdat->node_start_pfn is always > between ARCH_PFN_OFFSET and ARCH_PFN_OFFSET + MAX_ORDER_NR_PAGES. If it were > larger, the mem_map correction would subtract too much and end up below what > was allocated for node_mem_map, no? The bug report behind this patch said that > first 2MB of memory was reserved using "no-map flag using DT". Unless this somehow > translates to ARCH_PFN_OFFSET at build time, we would underflow mem_map, right? > Maybe I'm just overly paranoid here and of course ARCH_PFN_OFFSET is determined > properly on arm... > > If anyone can confirm my doubts or point me to what I'm missing, thanks. ARCH_PFN_OFFSET should always be the lowest PFN in the system, otherwise I think plenty of other things are broken given how many architectures make this assumption. That said, I don't think subtracting ARCH_PFN_OFFSET makes it obvious why the adjustment is being made. Thanks, Laura
Reviving this thread because I don't think it ever got resolved. On 2/3/2015 6:25 PM, Laura Abbott wrote: > On 1/29/2015 5:13 AM, Vlastimil Babka wrote: >> On 01/26/2015 04:56 PM, Mel Gorman wrote: >>> On Fri, Jan 23, 2015 at 10:05:48AM +0100, Vlastimil Babka wrote: >>>> On 01/23/2015 01:33 AM, Laura Abbott wrote: >>>>> On 1/22/2015 4:20 PM, Andrew Morton wrote: >>>>>> >>>>>> I don't think v2 addressed Vlastimil's review comment? >>>>>> >>>>> >>>>> We're still adding the offset to node_mem_map and then subtracting it from >>>>> just mem_map. Did I miss another comment somewhere? >>>> >>>> Yes that was addressed, thanks. But I don't feel comfortable acking >>>> it yet, as I have no idea if we are doing the right thing for >>>> CONFIG_HAVE_MEMBLOCK_NODE_MAP && CONFIG_FLATMEM case here. >>>> >>>> Also putting the CONFIG_FLATMEM && !CONFIG_HAVE_MEMBLOCK_NODE_MAP >>>> under the "if (page_to_pfn(mem_map) != pgdat->node_start_pfn)" will >>>> probably do the right thing, but looks like a weird test for this >>>> case here. >>>> >>>> I have no good suggestion though, so let's CC Mel who apparently >>>> wrote the ARCH_PFN_OFFSET correction? >>>> >>> >>> I don't recall introducing ARCH_PFN_OFFSET, are you sure it was me? I'm just >>> back today after been offline a week so didn't review the patch but IIRC, >>> ARCH_PFN_OFFSET deals with the case where physical memory does not start >>> at 0. Without the offset, virtual _PAGE_OFFSET would not physical page 0. >>> I don't recall it being related to the alignment of node 0 so if there >>> are crashes due to misalignment of node 0 and the fix is ARCH_PFN_OFFSET >>> related then I'm surprised. >> >> You're right that ARCH_PFN_OFFSET wasn't added by you, but by commit >> 467bc461d2 which was a bugfix to your commit c713216dee, which did >> introduce the mem_map correction code, and after which the code looked like: >> >> mem_map = NODE_DATA(0)->node_mem_map; >> #ifdef CONFIG_ARCH_POPULATES_NODE_MAP >> if (page_to_pfn(mem_map) != pgdat->node_start_pfn) >> mem_map -= pgdat->node_start_pfn; >> #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */ >> >> >> It's from 2006 so I can't expect you remember the details, but I had some >> trouble finding out what this does. I assume it makes sure that mem_map points >> to struct page corresponding to pfn 0, because that's what translations using >> mem_map expect. >> But pgdat->node_mem_map points to struct page corresponding to >> pgdat->node_start_pfn, which might not be 0. So it subtracts node_start_pfn >> to fix that. This is OK, as the node_mem_map is allocated (in this very >> function) with padding so that it covers a MAX_ORDER_NR_PAGES aligned area >> where node_mem_map may point to the middle of it. >> >> Commit 467bc461d2 fixed this in case the first pfn is not 0, but ARCH_PFN_OFFSET. >> So mem_map points to struct page corresponding to pfn=ARCH_PFN_OFFSET, which >> is OK. But I still have few doubts: >> >> 1) The "if (page_to_pfn(mem_map) != pgdat->node_start_pfn)" sort of silently >> assumes that mem_map is allocated at the beginning of the node, i.e. at >> pgdat->node_start_pfn. And the only reason for this if-condition to be true, >> is that we haven't corrected the page_to_pfn translation, which uses mem_map. >> Is this assumption always OK to do? Shouldn't the if-condition be instead about >> pgdat->node_start_pfn not being aligned? >> >> 2) The #ifdef guard is about CONFIG_ARCH_POPULATES_NODE_MAP, which is nowadays called > CONFIG_HAVE_MEMBLOCK_NODE_MAP. But shouldn't it be #ifdef FLATMEM instead? >> After all, we are correcting value of mem_map based on page_to_pfn code >> variant used on FLATMEM. arm doesn't define >> CONFIG_ARCH_POPULATES_NODE_MAP but apparently needs this correction. >> > > Just doing #ifdef FLATMEM doesn't work because ARCH_PFN_OFFSET doesn't > seem to be picked up properly for NOMMU arches properly. Probably just > missing a header somewhere. > >> 3) The node_mem_map allocation code aligns the allocation to MAX_ORDER_NR_PAGES, >> so the offset between the start of the allocated map and where node_mem_map >> points to will be up to MAX_ORDER_NR_PAGES. >> However, here we subtract (in current kernel) (pgdat->node_start_pfn - ARCH_PFN_OFFSET). >> That looks like another silent assumption, that pgdat->node_start_pfn is always >> between ARCH_PFN_OFFSET and ARCH_PFN_OFFSET + MAX_ORDER_NR_PAGES. If it were >> larger, the mem_map correction would subtract too much and end up below what >> was allocated for node_mem_map, no? The bug report behind this patch said that >> first 2MB of memory was reserved using "no-map flag using DT". Unless this somehow >> translates to ARCH_PFN_OFFSET at build time, we would underflow mem_map, right? >> Maybe I'm just overly paranoid here and of course ARCH_PFN_OFFSET is determined >> properly on arm... >> >> If anyone can confirm my doubts or point me to what I'm missing, thanks. > > ARCH_PFN_OFFSET should always be the lowest PFN in the system, otherwise > I think plenty of other things are broken given how many architectures > make this assumption. That said, I don't think subtracting ARCH_PFN_OFFSET > makes it obvious why the adjustment is being made. > > Thanks, > Laura > I was incorrect before: it isn't just NOMMU but architectures that don't use asm-generic/memory_model.h which failed to compile. I could respin with more ifdefery around the ARCH_PFN_OFFSET if that sounds reasonable. Thanks, Laura
On 02/24/2015 08:54 PM, Laura Abbott wrote: > Reviving this thread because I don't think it ever got resolved. > > On 2/3/2015 6:25 PM, Laura Abbott wrote: >> On 1/29/2015 5:13 AM, Vlastimil Babka wrote: >>>> >>>> I don't recall introducing ARCH_PFN_OFFSET, are you sure it was me? I'm just >>>> back today after been offline a week so didn't review the patch but IIRC, >>>> ARCH_PFN_OFFSET deals with the case where physical memory does not start >>>> at 0. Without the offset, virtual _PAGE_OFFSET would not physical page 0. >>>> I don't recall it being related to the alignment of node 0 so if there >>>> are crashes due to misalignment of node 0 and the fix is ARCH_PFN_OFFSET >>>> related then I'm surprised. >>> >>> You're right that ARCH_PFN_OFFSET wasn't added by you, but by commit >>> 467bc461d2 which was a bugfix to your commit c713216dee, which did >>> introduce the mem_map correction code, and after which the code looked like: >>> >>> mem_map = NODE_DATA(0)->node_mem_map; >>> #ifdef CONFIG_ARCH_POPULATES_NODE_MAP >>> if (page_to_pfn(mem_map) != pgdat->node_start_pfn) >>> mem_map -= pgdat->node_start_pfn; >>> #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */ >>> >>> >>> It's from 2006 so I can't expect you remember the details, but I had some >>> trouble finding out what this does. I assume it makes sure that mem_map points >>> to struct page corresponding to pfn 0, because that's what translations using >>> mem_map expect. >>> But pgdat->node_mem_map points to struct page corresponding to >>> pgdat->node_start_pfn, which might not be 0. So it subtracts node_start_pfn >>> to fix that. This is OK, as the node_mem_map is allocated (in this very >>> function) with padding so that it covers a MAX_ORDER_NR_PAGES aligned area >>> where node_mem_map may point to the middle of it. >>> >>> Commit 467bc461d2 fixed this in case the first pfn is not 0, but ARCH_PFN_OFFSET. >>> So mem_map points to struct page corresponding to pfn=ARCH_PFN_OFFSET, which >>> is OK. But I still have few doubts: >>> >>> 1) The "if (page_to_pfn(mem_map) != pgdat->node_start_pfn)" sort of silently >>> assumes that mem_map is allocated at the beginning of the node, i.e. at >>> pgdat->node_start_pfn. And the only reason for this if-condition to be true, >>> is that we haven't corrected the page_to_pfn translation, which uses mem_map. >>> Is this assumption always OK to do? Shouldn't the if-condition be instead about >>> pgdat->node_start_pfn not being aligned? >>> >>> 2) The #ifdef guard is about CONFIG_ARCH_POPULATES_NODE_MAP, which is nowadays called > CONFIG_HAVE_MEMBLOCK_NODE_MAP. But shouldn't it be #ifdef FLATMEM instead? >>> After all, we are correcting value of mem_map based on page_to_pfn code >>> variant used on FLATMEM. arm doesn't define >>> CONFIG_ARCH_POPULATES_NODE_MAP but apparently needs this correction. >>> >> >> Just doing #ifdef FLATMEM doesn't work because ARCH_PFN_OFFSET doesn't >> seem to be picked up properly for NOMMU arches properly. Probably just >> missing a header somewhere. >> >>> 3) The node_mem_map allocation code aligns the allocation to MAX_ORDER_NR_PAGES, >>> so the offset between the start of the allocated map and where node_mem_map >>> points to will be up to MAX_ORDER_NR_PAGES. >>> However, here we subtract (in current kernel) (pgdat->node_start_pfn - ARCH_PFN_OFFSET). >>> That looks like another silent assumption, that pgdat->node_start_pfn is always >>> between ARCH_PFN_OFFSET and ARCH_PFN_OFFSET + MAX_ORDER_NR_PAGES. If it were >>> larger, the mem_map correction would subtract too much and end up below what >>> was allocated for node_mem_map, no? The bug report behind this patch said that >>> first 2MB of memory was reserved using "no-map flag using DT". Unless this somehow >>> translates to ARCH_PFN_OFFSET at build time, we would underflow mem_map, right? >>> Maybe I'm just overly paranoid here and of course ARCH_PFN_OFFSET is determined >>> properly on arm... >>> >>> If anyone can confirm my doubts or point me to what I'm missing, thanks. >> >> ARCH_PFN_OFFSET should always be the lowest PFN in the system, otherwise >> I think plenty of other things are broken given how many architectures >> make this assumption. That said, I don't think subtracting ARCH_PFN_OFFSET >> makes it obvious why the adjustment is being made. >> >> Thanks, >> Laura >> > > I was incorrect before: it isn't just NOMMU but architectures that don't use > asm-generic/memory_model.h which failed to compile. I could respin with Hm I see, some architectures use own variant of page_to_pfn, that's why it's being used in the if () check. > more ifdefery around the ARCH_PFN_OFFSET if that sounds reasonable. So I think your v2 might be correct already. Unless there's an architecture that defines CONFIG_FLATMEM and not CONFIG_HAVE_MEMBLOCK_NODE_MAP and places memmap somewhere else than pgdat->node_start_pfn, which would trigger the check for a wrong reason after the patch. Looks like arm is an arch that doesn't define CONFIG_HAVE_MEMBLOCK_NODE_MAP, yet it defines ARCH_PFN_OFFSET. With your patch it would correct memmap by the calculated offset, not the ARCH_PFN_OFFSET constant. Are these two the same then? Should there be something like a VM_BUG_ON that ARCH_PFN_OFFSET (if it exists) is indeed equal to the calculated offset? Or maybe a more general VM_BUG_ON checking that after any correction we make, the (page_to_pfn(mem_map) == pgdat->node_start_pfn) condition holds? > Thanks, > Laura >
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7633c50..269fc93 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5005,6 +5005,8 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat) { + unsigned long __maybe_unused offset = 0; + /* Skip empty nodes */ if (!pgdat->node_spanned_pages) return; @@ -5021,6 +5023,7 @@ static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat) * for the buddy allocator to function correctly. */ start = pgdat->node_start_pfn & ~(MAX_ORDER_NR_PAGES - 1); + offset = pgdat->node_start_pfn - start; end = pgdat_end_pfn(pgdat); end = ALIGN(end, MAX_ORDER_NR_PAGES); size = (end - start) * sizeof(struct page); @@ -5028,7 +5031,7 @@ static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat) if (!map) map = memblock_virt_alloc_node_nopanic(size, pgdat->node_id); - pgdat->node_mem_map = map + (pgdat->node_start_pfn - start); + pgdat->node_mem_map = map + offset; } #ifndef CONFIG_NEED_MULTIPLE_NODES /* @@ -5036,10 +5039,13 @@ static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat) */ if (pgdat == NODE_DATA(0)) { mem_map = NODE_DATA(0)->node_mem_map; -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP - if (page_to_pfn(mem_map) != pgdat->node_start_pfn) - mem_map -= (pgdat->node_start_pfn - ARCH_PFN_OFFSET); -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */ +#if defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) || defined(CONFIG_FLATMEM) + if (page_to_pfn(mem_map) != pgdat->node_start_pfn) { + if (IS_ENABLED(CONFIG_HAVE_MEMBLOCK_NODE_MAP)) + offset = pgdat->node_start_pfn - ARCH_PFN_OFFSET; + mem_map -= offset; + } +#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP || CONFIG_FLATMEM */ } #endif #endif /* CONFIG_FLAT_NODE_MEM_MAP */
Srinivas Kandagatla reported bad page messages when trying to remove the bottom 2MB on an ARM based IFC6410 board BUG: Bad page state in process swapper pfn:fffa8 page:ef7fb500 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x96640253(locked|error|dirty|active|arch_1|reclaim|mlocked) page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set bad because of flags: flags: 0x200041(locked|active|mlocked) Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 3.19.0-rc3-00007-g412f9ba-dirty #816 Hardware name: Qualcomm (Flattened Device Tree) [<c0218280>] (unwind_backtrace) from [<c0212be8>] (show_stack+0x20/0x24) [<c0212be8>] (show_stack) from [<c0af7124>] (dump_stack+0x80/0x9c) [<c0af7124>] (dump_stack) from [<c0301570>] (bad_page+0xc8/0x128) [<c0301570>] (bad_page) from [<c03018a8>] (free_pages_prepare+0x168/0x1e0) [<c03018a8>] (free_pages_prepare) from [<c030369c>] (free_hot_cold_page+0x3c/0x174) [<c030369c>] (free_hot_cold_page) from [<c0303828>] (__free_pages+0x54/0x58) [<c0303828>] (__free_pages) from [<c030395c>] (free_highmem_page+0x38/0x88) [<c030395c>] (free_highmem_page) from [<c0f62d5c>] (mem_init+0x240/0x430) [<c0f62d5c>] (mem_init) from [<c0f5db3c>] (start_kernel+0x1e4/0x3c8) [<c0f5db3c>] (start_kernel) from [<80208074>] (0x80208074) Disabling lock debugging due to kernel taint Removing the lower 2MB made the start of the lowmem zone to no longer be page block aligned. IFC6410 uses CONFIG_FLATMEM where alloc_node_mem_map allocates memory for the mem_map. alloc_node_mem_map will offset for unaligned nodes with the assumption the pfn/page translation functions will account for the offset. The functions for CONFIG_FLATMEM do not offset however, resulting in overrunning the memmap array. Just use the allocated memmap without any offset when running with CONFIG_FLATMEM to avoid the overrun. Signed-off-by: Laura Abbott <lauraa@codeaurora.org> Reported-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> --- mm/page_alloc.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-)