Message ID | CAFPAmTQByL0YJT8Lvar1Oe+3Q1EREvqPA_GP=hHApJDz5dSOzQ@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Aug 02, 2011 at 05:38:31PM +0530, Kautuk Consul wrote: > Hi, > > In the case where the total kernel memory is not aligned to the > SECTION_SIZE_BITS I see a kernel crash. > > When I copy a huge file, then the kernel crashes at the following callstack: > The callstack should not be 80-column formatted as this is completely manged and unreadable without manual editting. Also, why did you not include the full error message? With it, I'd have a better idea of which bug check you hit. > Backtrace: > <SNIP> > > The reason for this is that the CONFIG_HOLES_IN_ZONE configuration > option is not automatically enabled when SPARSEMEM or > ARCH_HAS_HOLES_MEMORYMODEL are enabled. Due to this, the > pfn_valid_within() macro always returns 1 due to which the BUG_ON is > encountered. > This patch enables the CONFIG_HOLES_IN_ZONE config option if either > ARCH_HAS_HOLES_MEMORYMODEL or SPARSEMEM is enabled. > > Although I tested this on an older kernel, i.e., 2.6.35.13, I see that > this option has not been enabled as yet in linux-3.0 and this appears > to be a > logically correct change anyways with respect to pfn_valid_within() > functionality. > There is a performance cost associated with HOLES_IN_ZONE which may be offset by memory savings but not necessarily. If the BUG_ON you are hitting is this one BUG_ON(page_zone(start_page) != page_zone(end_page)) then I'd be wondering why the check in move_freepages_block() was insufficient. If it's because holes are punched in the memmap then the option does need to be set.
Hi Mel, Sorry for the formatting. I forgot to include the following entire backtrace: #> cp test_huge_file nfsmnt kernel BUG at mm/page_alloc.c:849! Unable to handle kernel NULL pointer dereference at virtual address 00000000 pgd = ce9f0000 <SNIP> Backtrace: [<c00269ac>] (__bug+0x0/0x30) from [<c008e8b0>] (move_freepages_block+0xd4/0x158) [<c008e7dc>] (move_freepages_block+0x0/0x158) from [<c008eb10>] (__rmqueue+0x1dc/0x32c) r8:c0481120 r7:c048107c r6:00000003 r5:00000001 r4:c04f8200 r3:00000000 [<c008e934>] (__rmqueue+0x0/0x32c) from [<c00906f8>] (get_page_from_freelist+0x12c/0x530) [<c00905cc>] (get_page_from_freelist+0x0/0x530) from [<c0090bec>] (__alloc_pages_nodemask+0xf0/0x544) [<c0090afc>] (__alloc_pages_nodemask+0x0/0x544) from [<c00b4da4>] (cache_alloc_refill+0x2d0/0x654) [<c00b4ad4>] (cache_alloc_refill+0x0/0x654) from [<c00b5258>] (kmem_cache_alloc+0x58/0x9c) [<c00b5200>] (kmem_cache_alloc+0x0/0x9c) from [<c01f0154>] (radix_tree_preload+0x58/0xbc) r7:00006741 r6:000000d0 r5:c04a98a0 r4:ce986000 [<c01f00fc>] (radix_tree_preload+0x0/0xbc) from [<c008ac94>] (add_to_page_cache_locked+0x20/0x1c4) r6:ce987d20 r5:ce346c1c r4:c04f8600 r3:000000d0 [<c008ac74>] (add_to_page_cache_locked+0x0/0x1c4) from [<c008ae84>] (add_to_page_cache_lru+0x4c/0x7c) r8:00000020 r7:ce7402a0 r6:ce987d20 r5:00000005 r4:c04f8600 r3:000000d0 [<c008ae38>] (add_to_page_cache_lru+0x0/0x7c) from [<c00e8428>] (mpage_readpages+0x7c/0x108) r5:00000005 r4:c04f8600 [<c00e83ac>] (mpage_readpages+0x0/0x108) from [<c010f450>] (fat_readpages+0x20/0x28) [<c010f430>] (fat_readpages+0x0/0x28) from [<c0092ebc>] (__do_page_cache_readahead+0x1c4/0x27c) [<c0092cf8>] (__do_page_cache_readahead+0x0/0x27c) from [<c0092fa0>] (ra_submit+0x2c/0x34) [<c0092f74>] (ra_submit+0x0/0x34) from [<c0093294>] (ondemand_readahead+0x20c/0x21c) [<c0093088>] (ondemand_readahead+0x0/0x21c) from [<c0093348>] (page_cache_async_readahead+0xa4/0xd8) [<c00932a4>] (page_cache_async_readahead+0x0/0xd8) from [<c008c6f4>] (generic_file_aio_read+0x360/0x7f4) r8:00000000 r7:ce346c1c r6:ce88dba0 r5:0000671c r4:c04f9640 [<c008c394>] (generic_file_aio_read+0x0/0x7f4) from [<c00b7954>] (do_sync_read+0xa0/0xd8) [<c00b78b4>] (do_sync_read+0x0/0xd8) from [<c00b84d8>] (vfs_read+0xb8/0x154) r6:bed53650 r5:ce88dba0 r4:00001000 [<c00b8420>] (vfs_read+0x0/0x154) from [<c00b863c>] (sys_read+0x44/0x70) r8:0671c000 r7:00000003 r6:00001000 r5:bed53650 r4:ce88dba0 [<c00b85f8>] (sys_read+0x0/0x70) from [<c0022f80>] (ret_fast_syscall+0x0/0x30) r9:ce986000 r8:c0023128 r6:bed53650 r5:00001000 r4:0013a4e0 Code: e59f0010 e1a01003 eb0d0852 e3a03000 (e5833000) <SNIP> Since I was testing on linux-2.6.35.9, line 849 in page_alloc.c is the same line as you have mentioned: BUG_ON(page_zone(start_page) != page_zone(end_page)) I reproduce this crash by altering the memory banks' memory ranges such that they are not aligned to the SECTION_SIZE_BITS size. For example, on my ARM system SECTION_SIZE_BITS is 23(8MB), so I change the code in arch/arm/mach-* code so that the total kernel memory size in the memory banks is lesser by 1 MB, which makes the total kernel memory size become not exactly divisible by 8 MB. Thanks, Kautuk. On Wed, Aug 3, 2011 at 4:35 PM, Mel Gorman <mgorman@suse.de> wrote: > On Tue, Aug 02, 2011 at 05:38:31PM +0530, Kautuk Consul wrote: >> Hi, >> >> In the case where the total kernel memory is not aligned to the >> SECTION_SIZE_BITS I see a kernel crash. >> >> When I copy a huge file, then the kernel crashes at the following callstack: >> > > The callstack should not be 80-column formatted as this is completely > manged and unreadable without manual editting. Also, why did you not > include the full error message? With it, I'd have a better idea of > which bug check you hit. > >> Backtrace: >> <SNIP> >> >> The reason for this is that the CONFIG_HOLES_IN_ZONE configuration >> option is not automatically enabled when SPARSEMEM or >> ARCH_HAS_HOLES_MEMORYMODEL are enabled. Due to this, the >> pfn_valid_within() macro always returns 1 due to which the BUG_ON is >> encountered. >> This patch enables the CONFIG_HOLES_IN_ZONE config option if either >> ARCH_HAS_HOLES_MEMORYMODEL or SPARSEMEM is enabled. >> >> Although I tested this on an older kernel, i.e., 2.6.35.13, I see that >> this option has not been enabled as yet in linux-3.0 and this appears >> to be a >> logically correct change anyways with respect to pfn_valid_within() >> functionality. >> > > There is a performance cost associated with HOLES_IN_ZONE which may be > offset by memory savings but not necessarily. > > If the BUG_ON you are hitting is this one > BUG_ON(page_zone(start_page) != page_zone(end_page)) then I'd be > wondering why the check in move_freepages_block() was insufficient. > > If it's because holes are punched in the memmap then the option does > need to be set. > > -- > Mel Gorman > SUSE Labs >
On Wed, Aug 03, 2011 at 05:59:03PM +0530, Kautuk Consul wrote: > Hi Mel, > > Sorry for the formatting. > > I forgot to include the following entire backtrace: > #> cp test_huge_file nfsmnt > kernel BUG at mm/page_alloc.c:849! > Unable to handle kernel NULL pointer dereference at virtual address 00000000 > pgd = ce9f0000 > <SNIP> > Backtrace: > [<c00269ac>] (__bug+0x0/0x30) from [<c008e8b0>] > (move_freepages_block+0xd4/0x158) It's still horribly mangled and pretty much unreadable but at least we know where the bug is hitting. > <SNIP> > > Since I was testing on linux-2.6.35.9, line 849 in page_alloc.c is the > same line as you have mentioned: > BUG_ON(page_zone(start_page) != page_zone(end_page)) > > I reproduce this crash by altering the memory banks' memory ranges > such that they are not aligned to the SECTION_SIZE_BITS size. How are you altering the ranges? Are you somehow breaking the checks based on the information in stuct zone that is in move_freepages_block()? It no longer seems like a punching-hole-in-memmap problem. Can you investigate how and why the range of pages passed in to move_freepages() belong to different zones? Thanks.
Hi Mel, My ARM system has 2 memory banks which have the following 2 PFN ranges: 60000-62000 and 70000-7ce00. My SECTION_SIZE_BITS is #defined to 23. I am altering the ranges via the following kind of pseudo-code in the arch/arm/mach-*/mach-*.c file: meminfo->bank[0].size -= (1 << 20) meminfo->bank[1].size -= (1 << 20) After altering the size of both memory banks, the PFN ranges now visible to the kernel are: 60000-61f00 and 70000-7cd00. There is only one node and one ZONE_NORMAL zone on the system and this zone accounts for both memory banks as CONFIG_SPARSEMEM is enabled in the kernel. I put some printks in the move_freeblockpages() function, compiled the kernel and then ran the following commands: ifconfig eth0 107.109.39.102 up mount /dev/sda1 /mnt # Mounted the USB pen drive mount -t nfs -o nolock 107.109.39.103:/home/kautuk nfsmnt # NFS mounted an NFS share cp test_huge_file nfsmnt/ I got the following output: #> cp test_huge nfsmnt/ ------------------------------------------------------------move_freepages_block_start---------------------------------------- kc: The page=c068a000 start_pfn=7c400 start_page=c0686000 end_page=c068dfe0 end_pfn=7c7ff page_zone(start_page)=c048107c page_zone(end_page)=c048107c page_zonenum(end_page) = 0 ------------------------------------------------------------move_freepages_block_end---------------------------------------- ------------------------------------------------------------move_freepages_block_start---------------------------------------- kc: The page=c0652000 start_pfn=7a800 start_page=c064e000 end_page=c0655fe0 end_pfn=7abff page_zone(start_page)=c048107c page_zone(end_page)=c048107c page_zonenum(end_page) = 0 ------------------------------------------------------------move_freepages_block_end---------------------------------------- ------------------------------------------------------------move_freepages_block_start---------------------------------------- kc: The page=c065a000 start_pfn=7ac00 start_page=c0656000 end_page=c065dfe0 end_pfn=7afff page_zone(start_page)=c048107c page_zone(end_page)=c048107c page_zonenum(end_page) = 0 ------------------------------------------------------------move_freepages_block_end---------------------------------------- ------------------------------------------------------------move_freepages_block_start---------------------------------------- kc: The page=c0695000 start_pfn=7c800 start_page=c068e000 end_page=c0695fe0 end_pfn=7cbff page_zone(start_page)=c048107c page_zone(end_page)=c048107c page_zonenum(end_page) = 0 ------------------------------------------------------------move_freepages_block_end---------------------------------------- ------------------------------------------------------------move_freepages_block_start---------------------------------------- kc: The page=c04f7c00 start_pfn=61c00 start_page=c04f6000 end_page=c04fdfe0 end_pfn=61fff page_zone(start_page)=c048107c page_zone(end_page)=c0481358 page_zonenum(end_page) = 1 ------------------------------------------------------------move_freepages_block_end---------------------------------------- kernel BUG at mm/page_alloc.c:849! Unable to handle kernel NULL pointer dereference at virtual address 00000000 pgd = ce9fc000 [00000000] *pgd=7ca5a031, *pte=00000000, *ppte=00000000 As per the last line, we can clearly see that the page_zone(start_page)=c048107c and page_zone(end_page)=c0481358, which are not equal to each other. Since they do not match, the code in move_freepages() bugchecks because of the following BUG_ON() check: page_zone(start_page) != page_zone(end_page) The reason for this that the page_zonenum(end_page) is equal to 1 and this is different from the page_zonenum(start_page) which is 0. On checking the code within page_zonenum(), I see that this code tries to retrieve the zone number from the end_page->flags. The reason why we cannot expect the 0x61fff end_page->flags to contain a valid zone number is: memmap_init_zone() initializes the zone number of all pages for a zone via the set_page_links() inline function. For the end_page (whose PFN is 0x61fff), set_page_links() cannot be possibly called, as the zones are simply not aware of of PFNs above 0x61f00 and below 0x70000. The (end >= zone->zone_start_pfn + zone->spanned_pages) in move_freepages_block() does not stop this crash from happening as both our memory banks are in the same zone and the empty space within them is accomodated into this zone via the CONFIG_SPARSEMEM config option. When we enable CONFIG_HOLES_IN_ZONE we survive this BUG_ON as well as any other BUG_ONs in the loop in move_freepages() as then the pfn_valid_within()/pfn_valid() function takes care of this functionality, especially in the case where the newly introduced CONFIG_HAVE_ARCH_PFN_VALID is enabled. Thanks, Kautuk. On Wed, Aug 3, 2011 at 6:58 PM, Mel Gorman <mgorman@suse.de> wrote: > On Wed, Aug 03, 2011 at 05:59:03PM +0530, Kautuk Consul wrote: >> Hi Mel, >> >> Sorry for the formatting. >> >> I forgot to include the following entire backtrace: >> #> cp test_huge_file nfsmnt >> kernel BUG at mm/page_alloc.c:849! >> Unable to handle kernel NULL pointer dereference at virtual address 00000000 >> pgd = ce9f0000 >> <SNIP> >> Backtrace: >> [<c00269ac>] (__bug+0x0/0x30) from [<c008e8b0>] >> (move_freepages_block+0xd4/0x158) > > It's still horribly mangled and pretty much unreadable but at least we > know where the bug is hitting. > >> <SNIP> >> >> Since I was testing on linux-2.6.35.9, line 849 in page_alloc.c is the >> same line as you have mentioned: >> BUG_ON(page_zone(start_page) != page_zone(end_page)) >> >> I reproduce this crash by altering the memory banks' memory ranges >> such that they are not aligned to the SECTION_SIZE_BITS size. > > How are you altering the ranges? Are you somehow breaking > the checks based on the information in stuct zone that is in > move_freepages_block()? > > It no longer seems like a punching-hole-in-memmap problem. Can > you investigate how and why the range of pages passed in to > move_freepages() belong to different zones? > > Thanks. > > -- > Mel Gorman > SUSE Labs >
On Thu, Aug 04, 2011 at 03:06:39PM +0530, Kautuk Consul wrote: > Hi Mel, > > My ARM system has 2 memory banks which have the following 2 PFN ranges: > 60000-62000 and 70000-7ce00. > > My SECTION_SIZE_BITS is #defined to 23. > So bank 0 is 4 sections and bank 1 is 26 sections with the last section incomplete. > I am altering the ranges via the following kind of pseudo-code in the > arch/arm/mach-*/mach-*.c file: > meminfo->bank[0].size -= (1 << 20) > meminfo->bank[1].size -= (1 << 20) > Why are you taking 1M off each bank? I could understand aligning the banks to a section size at least. That said, there is an assumption that pages within a MAX_ORDER-aligned block. If you are taking 1M off the end of bank 0, it is no longer MAX_ORDER aligned. This is the assumption move_freepages_block() is falling foul of. The problem can be avoided by ensuring that memmap is valid within MAX_ORDER-aligned ranges. > After altering the size of both memory banks, the PFN ranges now > visible to the kernel are: > 60000-61f00 and 70000-7cd00. > > There is only one node and one ZONE_NORMAL zone on the system and this > zone accounts for both memory banks as CONFIG_SPARSEMEM > is enabled in the kernel. > Ok. > I put some printks in the move_freeblockpages() function, compiled the > kernel and then ran the following commands: > ifconfig eth0 107.109.39.102 up > mount /dev/sda1 /mnt # Mounted the USB pen drive > mount -t nfs -o nolock 107.109.39.103:/home/kautuk nfsmnt # NFS > mounted an NFS share > cp test_huge_file nfsmnt/ > > I got the following output: > #> cp test_huge nfsmnt/ > ------------------------------------------------------------move_freepages_block_start---------------------------------------- > kc: The page=c068a000 start_pfn=7c400 start_page=c0686000 > end_page=c068dfe0 end_pfn=7c7ff > page_zone(start_page)=c048107c page_zone(end_page)=c048107c > page_zonenum(end_page) = 0 > ------------------------------------------------------------move_freepages_block_end---------------------------------------- > ------------------------------------------------------------move_freepages_block_start---------------------------------------- > kc: The page=c0652000 start_pfn=7a800 start_page=c064e000 > end_page=c0655fe0 end_pfn=7abff > page_zone(start_page)=c048107c page_zone(end_page)=c048107c > page_zonenum(end_page) = 0 > ------------------------------------------------------------move_freepages_block_end---------------------------------------- > ------------------------------------------------------------move_freepages_block_start---------------------------------------- > kc: The page=c065a000 start_pfn=7ac00 start_page=c0656000 > end_page=c065dfe0 end_pfn=7afff > page_zone(start_page)=c048107c page_zone(end_page)=c048107c > page_zonenum(end_page) = 0 > ------------------------------------------------------------move_freepages_block_end---------------------------------------- > ------------------------------------------------------------move_freepages_block_start---------------------------------------- > kc: The page=c0695000 start_pfn=7c800 start_page=c068e000 > end_page=c0695fe0 end_pfn=7cbff > page_zone(start_page)=c048107c page_zone(end_page)=c048107c > page_zonenum(end_page) = 0 > ------------------------------------------------------------move_freepages_block_end---------------------------------------- > ------------------------------------------------------------move_freepages_block_start---------------------------------------- > kc: The page=c04f7c00 start_pfn=61c00 start_page=c04f6000 > end_page=c04fdfe0 end_pfn=61fff > page_zone(start_page)=c048107c page_zone(end_page)=c0481358 > page_zonenum(end_page) = 1 > ------------------------------------------------------------move_freepages_block_end---------------------------------------- > kernel BUG at mm/page_alloc.c:849! > Unable to handle kernel NULL pointer dereference at virtual address 00000000 > pgd = ce9fc000 > [00000000] *pgd=7ca5a031, *pte=00000000, *ppte=00000000 > > As per the last line, we can clearly see that the > page_zone(start_page)=c048107c and page_zone(end_page)=c0481358, > which are not equal to each other. > Since they do not match, the code in move_freepages() bugchecks > because of the following BUG_ON() check: > page_zone(start_page) != page_zone(end_page) > The reason for this that the page_zonenum(end_page) is equal to 1 and > this is different from the page_zonenum(start_page) which is 0. > Because the MAX_ORDER alignment is gone. > On checking the code within page_zonenum(), I see that this code tries > to retrieve the zone number from the end_page->flags. > Yes. In the majority of cases a pages node and zone is stored in the page->flags. > The reason why we cannot expect the 0x61fff end_page->flags to contain > a valid zone number is: > memmap_init_zone() initializes the zone number of all pages for a zone > via the set_page_links() inline function. > For the end_page (whose PFN is 0x61fff), set_page_links() cannot be > possibly called, as the zones are simply not aware of of PFNs above > 0x61f00 and below 0x70000. > Can you ensure that the ranges passed into free_area_init_node() are MAX_ORDER aligned as this would initialise the struct pages. You may have already seen that care is taken when freeing memmap that it is aligned to MAX_ORDER in free_unused_memmap() in ARM. > The (end >= zone->zone_start_pfn + zone->spanned_pages) in > move_freepages_block() does not stop this crash from happening as both > our memory banks are in the same zone and the empty space within them > is accomodated into this zone via the CONFIG_SPARSEMEM > config option. > > When we enable CONFIG_HOLES_IN_ZONE we survive this BUG_ON as well as > any other BUG_ONs in the loop in move_freepages() as then the > pfn_valid_within()/pfn_valid() function takes care of this > functionality, especially in the case where the newly introduced > CONFIG_HAVE_ARCH_PFN_VALID is > enabled. > This is an expensive option in terms of performance. If Russell wants to pick it up, I won't object but I would strongly suggest that you solve this problem by ensuring that memmap is initialised on a MAX_ORDER-aligned boundaries as it'll perform better. Thanks.
Hi Mel, Please find my comments inline to the email below. 2 general questions: i) If an email chain such as this leads to another kernel patch for the same problem, do I need to create a new email chain for that ? ii) Sorry about my formatting problems. However, text such as backtraces and logs tend to wrap irrespective of whatever gmail settings/browser I try. Any pointers here ? Thanks, Kautuk. On Thu, Aug 4, 2011 at 3:39 PM, Mel Gorman <mgorman@suse.de> wrote: > On Thu, Aug 04, 2011 at 03:06:39PM +0530, Kautuk Consul wrote: >> Hi Mel, >> >> My ARM system has 2 memory banks which have the following 2 PFN ranges: >> 60000-62000 and 70000-7ce00. >> >> My SECTION_SIZE_BITS is #defined to 23. >> > > So bank 0 is 4 sections and bank 1 is 26 sections with the last section > incomplete. > >> I am altering the ranges via the following kind of pseudo-code in the >> arch/arm/mach-*/mach-*.c file: >> meminfo->bank[0].size -= (1 << 20) >> meminfo->bank[1].size -= (1 << 20) >> > > Why are you taking 1M off each bank? I could understand aligning the > banks to a section size at least. The reason I am doing this is that one of our embedded boards actually has this problem, due to which we see this kernel crash. I am merely reproducing this problem by performing this step. > > That said, there is an assumption that pages within a MAX_ORDER-aligned > block. If you are taking 1M off the end of bank 0, it is no longer > MAX_ORDER aligned. This is the assumption move_freepages_block() > is falling foul of. The problem can be avoided by ensuring that memmap > is valid within MAX_ORDER-aligned ranges. > >> After altering the size of both memory banks, the PFN ranges now >> visible to the kernel are: >> 60000-61f00 and 70000-7cd00. >> >> There is only one node and one ZONE_NORMAL zone on the system and this >> zone accounts for both memory banks as CONFIG_SPARSEMEM >> is enabled in the kernel. >> > > Ok. > >> I put some printks in the move_freeblockpages() function, compiled the >> kernel and then ran the following commands: >> ifconfig eth0 107.109.39.102 up >> mount /dev/sda1 /mnt # Mounted the USB pen drive >> mount -t nfs -o nolock 107.109.39.103:/home/kautuk nfsmnt # NFS >> mounted an NFS share >> cp test_huge_file nfsmnt/ >> >> I got the following output: >> #> cp test_huge nfsmnt/ >> ------------------------------------------------------------move_freepages_block_start---------------------------------------- >> kc: The page=c068a000 start_pfn=7c400 start_page=c0686000 >> end_page=c068dfe0 end_pfn=7c7ff >> page_zone(start_page)=c048107c page_zone(end_page)=c048107c >> page_zonenum(end_page) = 0 >> ------------------------------------------------------------move_freepages_block_end---------------------------------------- >> ------------------------------------------------------------move_freepages_block_start---------------------------------------- >> kc: The page=c0652000 start_pfn=7a800 start_page=c064e000 >> end_page=c0655fe0 end_pfn=7abff >> page_zone(start_page)=c048107c page_zone(end_page)=c048107c >> page_zonenum(end_page) = 0 >> ------------------------------------------------------------move_freepages_block_end---------------------------------------- >> ------------------------------------------------------------move_freepages_block_start---------------------------------------- >> kc: The page=c065a000 start_pfn=7ac00 start_page=c0656000 >> end_page=c065dfe0 end_pfn=7afff >> page_zone(start_page)=c048107c page_zone(end_page)=c048107c >> page_zonenum(end_page) = 0 >> ------------------------------------------------------------move_freepages_block_end---------------------------------------- >> ------------------------------------------------------------move_freepages_block_start---------------------------------------- >> kc: The page=c0695000 start_pfn=7c800 start_page=c068e000 >> end_page=c0695fe0 end_pfn=7cbff >> page_zone(start_page)=c048107c page_zone(end_page)=c048107c >> page_zonenum(end_page) = 0 >> ------------------------------------------------------------move_freepages_block_end---------------------------------------- >> ------------------------------------------------------------move_freepages_block_start---------------------------------------- >> kc: The page=c04f7c00 start_pfn=61c00 start_page=c04f6000 >> end_page=c04fdfe0 end_pfn=61fff >> page_zone(start_page)=c048107c page_zone(end_page)=c0481358 >> page_zonenum(end_page) = 1 >> ------------------------------------------------------------move_freepages_block_end---------------------------------------- >> kernel BUG at mm/page_alloc.c:849! >> Unable to handle kernel NULL pointer dereference at virtual address 00000000 >> pgd = ce9fc000 >> [00000000] *pgd=7ca5a031, *pte=00000000, *ppte=00000000 >> >> As per the last line, we can clearly see that the >> page_zone(start_page)=c048107c and page_zone(end_page)=c0481358, >> which are not equal to each other. >> Since they do not match, the code in move_freepages() bugchecks >> because of the following BUG_ON() check: >> page_zone(start_page) != page_zone(end_page) > >> The reason for this that the page_zonenum(end_page) is equal to 1 and >> this is different from the page_zonenum(start_page) which is 0. >> > > Because the MAX_ORDER alignment is gone. > >> On checking the code within page_zonenum(), I see that this code tries >> to retrieve the zone number from the end_page->flags. >> > > Yes. In the majority of cases a pages node and zone is stored in the > page->flags. > >> The reason why we cannot expect the 0x61fff end_page->flags to contain >> a valid zone number is: >> memmap_init_zone() initializes the zone number of all pages for a zone >> via the set_page_links() inline function. >> For the end_page (whose PFN is 0x61fff), set_page_links() cannot be >> possibly called, as the zones are simply not aware of of PFNs above >> 0x61f00 and below 0x70000. >> > > Can you ensure that the ranges passed into free_area_init_node() > are MAX_ORDER aligned as this would initialise the struct pages. You > may have already seen that care is taken when freeing memmap that it > is aligned to MAX_ORDER in free_unused_memmap() in ARM. > Will this work ? My doubt arises from the fact that there is only one zone on the entire system which contains both memory banks. The crash arises at the PFN 0x61fff, which will not be covered by such a check, as this function will try to act on the entire zone, which is the PFN range: 60000-7cd00, including the holes within as all of this RAM falls into the same node and zone. ( Please correct me if I am wrong about this. ) I tried aligning the end parameter in the memory_present() function which is called separately for each memory bank. I tried the following change in memory_present() as well as mminit_validate_memodel_limits(): end &= ~(pageblock_nr_pages-1); But, in this case, the board simply does not boot up. I think that will then require some change in the arch/arm code which I think would be an arch-specific solution to a possibly generic problem. >> The (end >= zone->zone_start_pfn + zone->spanned_pages) in >> move_freepages_block() does not stop this crash from happening as both >> our memory banks are in the same zone and the empty space within them >> is accomodated into this zone via the CONFIG_SPARSEMEM >> config option. >> >> When we enable CONFIG_HOLES_IN_ZONE we survive this BUG_ON as well as >> any other BUG_ONs in the loop in move_freepages() as then the >> pfn_valid_within()/pfn_valid() function takes care of this >> functionality, especially in the case where the newly introduced >> CONFIG_HAVE_ARCH_PFN_VALID is >> enabled. >> > > This is an expensive option in terms of performance. If Russell > wants to pick it up, I won't object but I would strongly suggest that > you solve this problem by ensuring that memmap is initialised on a > MAX_ORDER-aligned boundaries as it'll perform better. > I couldn't really locate a method in the kernel wherein we can validate a pageblock(1024 pages for my platform) with respect to the memory banks on that system. How about this : We implement an arch_is_valid_pageblock() function, controlled by a new config option CONFIG_ARCH_HAVE IS_VALID_PAGEBLOCK. This arch function will simply check whether this pageblock is valid or not, in terms of arch-specific memory banks or by using the memblock APIs depending on CONFIG_HAVE_MEMBLOCK. We can modify the memmap_init_zone() function so that an outer loop works in measures of pageblocks thus enabling us to avoid invalid pageblocks. We then wouldn't need to go for the HOLES_IN_ZONE option as all PFN ranges will be aligned to the pageblock_nr_pages thus removing the possibility of this crash in move_freepages(). > Thanks. > > -- > Mel Gorman > SUSE Labs >
On Fri, Aug 05, 2011 at 11:27:21AM +0530, Kautuk Consul wrote: > Hi Mel, > > Please find my comments inline to the email below. > > 2 general questions: > i) If an email chain such as this leads to another kernel patch for > the same problem, do I need to > create a new email chain for that ? Yes. I believe it's easier on the maintainer if a new thread is started with the new patch leader summarising relevant information from the discussion. A happy maintainer makes the day easier. > ii) Sorry about my formatting problems. However, text such as > backtraces and logs tend to wrap > irrespective of whatever gmail settings/browser I try. Any > pointers here ? > I don't use gmail so I do not have any suggestions other than using a different mailer. There are no suggestions in Documentation/email-clients.txt on how to deal with gmail. > On Thu, Aug 4, 2011 at 3:39 PM, Mel Gorman <mgorman@suse.de> wrote: > > On Thu, Aug 04, 2011 at 03:06:39PM +0530, Kautuk Consul wrote: > >> Hi Mel, > >> > >> My ARM system has 2 memory banks which have the following 2 PFN ranges: > >> 60000-62000 and 70000-7ce00. > >> > >> My SECTION_SIZE_BITS is #defined to 23. > >> > > > > So bank 0 is 4 sections and bank 1 is 26 sections with the last section > > incomplete. > > > >> I am altering the ranges via the following kind of pseudo-code in the > >> arch/arm/mach-*/mach-*.c file: > >> meminfo->bank[0].size -= (1 << 20) > >> meminfo->bank[1].size -= (1 << 20) > >> > > > > Why are you taking 1M off each bank? I could understand aligning the > > banks to a section size at least. > > The reason I am doing this is that one of our embedded boards actually > has this problem, due > to which we see this kernel crash. I am merely reproducing this > problem by performing this step. > Ah, that makes sense. > >> <SNIP> > >> The reason why we cannot expect the 0x61fff end_page->flags to contain > >> a valid zone number is: > >> memmap_init_zone() initializes the zone number of all pages for a zone > >> via the set_page_links() inline function. > >> For the end_page (whose PFN is 0x61fff), set_page_links() cannot be > >> possibly called, as the zones are simply not aware of of PFNs above > >> 0x61f00 and below 0x70000. > >> > > > > Can you ensure that the ranges passed into free_area_init_node() > > are MAX_ORDER aligned as this would initialise the struct pages. You > > may have already seen that care is taken when freeing memmap that it > > is aligned to MAX_ORDER in free_unused_memmap() in ARM. > > > > Will this work ? My doubt arises from the fact that there is only one > zone on the entire > system which contains both memory banks. That is a common situation. It's why present_pages and spanned_pages in a zone can differ. As long as valid memmap is MAX_ORDER-aligned, it's fine. > The crash arises at the PFN 0x61fff, which will not be covered by such > a check, as this function No, it won't be covered by the range check. However, the memmap will be initialised so even though the page is outside a valid bank of memory, it'll still resolve to the correct zone. The struct page will be marked Reserved so it'll never be used. > will try to act on the entire zone, which is the PFN range: > 60000-7cd00, including the holes within as > all of this RAM falls into the same node and zone. > ( Please correct me if I am wrong about this. ) > > I tried aligning the end parameter in the memory_present() function > which is called separately > for each memory bank. > I tried the following change in memory_present() as well as > mminit_validate_memodel_limits(): > end &= ~(pageblock_nr_pages-1); > But, in this case, the board simply does not boot up. I think that > will then require some change in the > arch/arm code which I think would be an arch-specific solution to a > possibly generic problem. > I do not believe this is a generic problem. Machines have holes in the zone all the time and this bug does not trigger. mminit_validate_memodel_limits() is the wrong place to make a change. Look more towards where free_area_init_node() gets called to initialise memmap. > >> The (end >= zone->zone_start_pfn + zone->spanned_pages) in > >> move_freepages_block() does not stop this crash from happening as both > >> our memory banks are in the same zone and the empty space within them > >> is accomodated into this zone via the CONFIG_SPARSEMEM > >> config option. > >> > >> When we enable CONFIG_HOLES_IN_ZONE we survive this BUG_ON as well as > >> any other BUG_ONs in the loop in move_freepages() as then the > >> pfn_valid_within()/pfn_valid() function takes care of this > >> functionality, especially in the case where the newly introduced > >> CONFIG_HAVE_ARCH_PFN_VALID is > >> enabled. > >> > > > > This is an expensive option in terms of performance. If Russell > > wants to pick it up, I won't object but I would strongly suggest that > > you solve this problem by ensuring that memmap is initialised on a > > MAX_ORDER-aligned boundaries as it'll perform better. > > > > I couldn't really locate a method in the kernel wherein we can > validate a pageblock(1024 pages for my > platform) with respect to the memory banks on that system. > I suspect what you're looking for is somewhere in arch/arm/mm/init.c . I'm afraid I didn't dig through the ARM memory initialisation code to see where it could be done but if it was me, I'd be looking at how free_area_init_node() is called. > How about this : > We implement an arch_is_valid_pageblock() function, controlled by a > new config option > CONFIG_ARCH_HAVE IS_VALID_PAGEBLOCK. > This arch function will simply check whether this pageblock is valid > or not, in terms of arch-specific > memory banks or by using the memblock APIs depending on CONFIG_HAVE_MEMBLOCK. > We can modify the memmap_init_zone() function so that an outer loop > works in measures of > pageblocks thus enabling us to avoid invalid pageblocks. That seems like massive overkill for what should be an alignment problem when initialisating memmap. It's adding complexity that is similar to HOLES_IN_ZONE with very little gain. When it gets down to it, I'd even prefer deleting the BUG_ON as the PageBuddy check over such a solution. However, the BUG_ON is there because alignment to MAX_ORDER is expected so it is a valid sanity check. There would need to be good evidence that initialising memmap to MAX_ORDER alignment was somehow impossible.
Ok, I analyzed the code and it seems that this alignment problem has been solved by the changes made to the free_unused_memmap() code in arch/arm/mm/init.c. I backported those changes to free_unused_memmap_node() in linux-2.6.35.9 and I don't see any more crashes. This solves my problem. Thanks for all the help. On Fri, Aug 5, 2011 at 2:17 PM, Mel Gorman <mgorman@suse.de> wrote: > On Fri, Aug 05, 2011 at 11:27:21AM +0530, Kautuk Consul wrote: >> Hi Mel, >> >> Please find my comments inline to the email below. >> >> 2 general questions: >> i) If an email chain such as this leads to another kernel patch for >> the same problem, do I need to >> create a new email chain for that ? > > Yes. I believe it's easier on the maintainer if a new thread is started > with the new patch leader summarising relevant information from the > discussion. A happy maintainer makes the day easier. > >> ii) Sorry about my formatting problems. However, text such as >> backtraces and logs tend to wrap >> irrespective of whatever gmail settings/browser I try. Any >> pointers here ? >> > > I don't use gmail so I do not have any suggestions other than using a > different mailer. There are no suggestions in > Documentation/email-clients.txt on how to deal with gmail. > >> On Thu, Aug 4, 2011 at 3:39 PM, Mel Gorman <mgorman@suse.de> wrote: >> > On Thu, Aug 04, 2011 at 03:06:39PM +0530, Kautuk Consul wrote: >> >> Hi Mel, >> >> >> >> My ARM system has 2 memory banks which have the following 2 PFN ranges: >> >> 60000-62000 and 70000-7ce00. >> >> >> >> My SECTION_SIZE_BITS is #defined to 23. >> >> >> > >> > So bank 0 is 4 sections and bank 1 is 26 sections with the last section >> > incomplete. >> > >> >> I am altering the ranges via the following kind of pseudo-code in the >> >> arch/arm/mach-*/mach-*.c file: >> >> meminfo->bank[0].size -= (1 << 20) >> >> meminfo->bank[1].size -= (1 << 20) >> >> >> > >> > Why are you taking 1M off each bank? I could understand aligning the >> > banks to a section size at least. >> >> The reason I am doing this is that one of our embedded boards actually >> has this problem, due >> to which we see this kernel crash. I am merely reproducing this >> problem by performing this step. >> > > Ah, that makes sense. > >> >> <SNIP> >> >> The reason why we cannot expect the 0x61fff end_page->flags to contain >> >> a valid zone number is: >> >> memmap_init_zone() initializes the zone number of all pages for a zone >> >> via the set_page_links() inline function. >> >> For the end_page (whose PFN is 0x61fff), set_page_links() cannot be >> >> possibly called, as the zones are simply not aware of of PFNs above >> >> 0x61f00 and below 0x70000. >> >> >> > >> > Can you ensure that the ranges passed into free_area_init_node() >> > are MAX_ORDER aligned as this would initialise the struct pages. You >> > may have already seen that care is taken when freeing memmap that it >> > is aligned to MAX_ORDER in free_unused_memmap() in ARM. >> > >> >> Will this work ? My doubt arises from the fact that there is only one >> zone on the entire >> system which contains both memory banks. > > That is a common situation. It's why present_pages and spanned_pages > in a zone can differ. As long as valid memmap is MAX_ORDER-aligned, it's > fine. > >> The crash arises at the PFN 0x61fff, which will not be covered by such >> a check, as this function > > No, it won't be covered by the range check. However, the memmap will be > initialised so even though the page is outside a valid bank of memory, > it'll still resolve to the correct zone. The struct page will be marked > Reserved so it'll never be used. > >> will try to act on the entire zone, which is the PFN range: >> 60000-7cd00, including the holes within as >> all of this RAM falls into the same node and zone. >> ( Please correct me if I am wrong about this. ) >> >> I tried aligning the end parameter in the memory_present() function >> which is called separately >> for each memory bank. >> I tried the following change in memory_present() as well as >> mminit_validate_memodel_limits(): >> end &= ~(pageblock_nr_pages-1); >> But, in this case, the board simply does not boot up. I think that >> will then require some change in the >> arch/arm code which I think would be an arch-specific solution to a >> possibly generic problem. >> > > I do not believe this is a generic problem. Machines have > holes in the zone all the time and this bug does not trigger. > mminit_validate_memodel_limits() is the wrong place to make a change. > Look more towards where free_area_init_node() gets called to initialise > memmap. > >> >> The (end >= zone->zone_start_pfn + zone->spanned_pages) in >> >> move_freepages_block() does not stop this crash from happening as both >> >> our memory banks are in the same zone and the empty space within them >> >> is accomodated into this zone via the CONFIG_SPARSEMEM >> >> config option. >> >> >> >> When we enable CONFIG_HOLES_IN_ZONE we survive this BUG_ON as well as >> >> any other BUG_ONs in the loop in move_freepages() as then the >> >> pfn_valid_within()/pfn_valid() function takes care of this >> >> functionality, especially in the case where the newly introduced >> >> CONFIG_HAVE_ARCH_PFN_VALID is >> >> enabled. >> >> >> > >> > This is an expensive option in terms of performance. If Russell >> > wants to pick it up, I won't object but I would strongly suggest that >> > you solve this problem by ensuring that memmap is initialised on a >> > MAX_ORDER-aligned boundaries as it'll perform better. >> > >> >> I couldn't really locate a method in the kernel wherein we can >> validate a pageblock(1024 pages for my >> platform) with respect to the memory banks on that system. >> > > I suspect what you're looking for is somewhere in arch/arm/mm/init.c . > I'm afraid I didn't dig through the ARM memory initialisation > code to see where it could be done but if it was me, I'd be looking at > how free_area_init_node() is called. > >> How about this : >> We implement an arch_is_valid_pageblock() function, controlled by a >> new config option >> CONFIG_ARCH_HAVE IS_VALID_PAGEBLOCK. >> This arch function will simply check whether this pageblock is valid >> or not, in terms of arch-specific >> memory banks or by using the memblock APIs depending on CONFIG_HAVE_MEMBLOCK. >> We can modify the memmap_init_zone() function so that an outer loop >> works in measures of >> pageblocks thus enabling us to avoid invalid pageblocks. > > That seems like massive overkill for what should be an alignment problem > when initialisating memmap. It's adding complexity that is similar to > HOLES_IN_ZONE with very little gain. > > When it gets down to it, I'd even prefer deleting the BUG_ON as > the PageBuddy check over such a solution. However, the BUG_ON is > there because alignment to MAX_ORDER is expected so it is a valid > sanity check. There would need to be good evidence that initialising > memmap to MAX_ORDER alignment was somehow impossible. > > -- > Mel Gorman > SUSE Labs >
--- a/arch/arm/Kconfig 2011-08-02 16:51:32.109207996 +0530 +++ b/arch/arm/Kconfig 2011-08-02 16:56:19.036207994 +0530 @@ -1508,6 +1508,9 @@ config ARCH_SELECT_MEMORY_MODEL config HAVE_ARCH_PFN_VALID def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM +config HOLES_IN_ZONE + def_bool ARCH_HAS_HOLES_MEMORYMODEL || SPARSEMEM + config HIGHMEM bool "High Memory Support" depends on MMU
Hi, In the case where the total kernel memory is not aligned to the SECTION_SIZE_BITS I see a kernel crash. When I copy a huge file, then the kernel crashes at the following callstack: Backtrace: [<c00269ac>] (__bug+0x0/0x30) from [<c008e8b0>] (move_freepages_block+0xd4/0x158) [<c008e7dc>] (move_freepages_block+0x0/0x158) from [<c008eb10>] (__rmqueue+0x1dc/0x32c) r8:c0481120 r7:c048107c r6:00000003 r5:00000001 r4:c04f8200 r3:00000000 [<c008e934>] (__rmqueue+0x0/0x32c) from [<c00906f8>] (get_page_from_freelist+0x12c/0x530) [<c00905cc>] (get_page_from_freelist+0x0/0x530) from [<c0090bec>] (__alloc_pages_nodemask+0xf0/0x544) [<c0090afc>] (__alloc_pages_nodemask+0x0/0x544) from [<c00b4da4>] (cache_alloc_refill+0x2d0/0x654) [<c00b4ad4>] (cache_alloc_refill+0x0/0x654) from [<c00b5258>] (kmem_cache_alloc+0x58/0x9c) [<c00b5200>] (kmem_cache_alloc+0x0/0x9c) from [<c01f0154>] (radix_tree_preload+0x58/0xbc) r7:00006741 r6:000000d0 r5:c04a98a0 r4:ce986000 [<c01f00fc>] (radix_tree_preload+0x0/0xbc) from [<c008ac94>] (add_to_page_cache_locked+0x20/0x1c4) r6:ce987d20 r5:ce346c1c r4:c04f8600 r3:000000d0 [<c008ac74>] (add_to_page_cache_locked+0x0/0x1c4) from [<c008ae84>] (add_to_page_cache_lru+0x4c/0x7c) r8:00000020 r7:ce7402a0 r6:ce987d20 r5:00000005 r4:c04f8600 r3:000000d0 [<c008ae38>] (add_to_page_cache_lru+0x0/0x7c) from [<c00e8428>] (mpage_readpages+0x7c/0x108) r5:00000005 r4:c04f8600 [<c00e83ac>] (mpage_readpages+0x0/0x108) from [<c010f450>] (fat_readpages+0x20/0x28) [<c010f430>] (fat_readpages+0x0/0x28) from [<c0092ebc>] (__do_page_cache_readahead+0x1c4/0x27c) [<c0092cf8>] (__do_page_cache_readahead+0x0/0x27c) from [<c0092fa0>] (ra_submit+0x2c/0x34) [<c0092f74>] (ra_submit+0x0/0x34) from [<c0093294>] (ondemand_readahead+0x20c/0x21c) [<c0093088>] (ondemand_readahead+0x0/0x21c) from [<c0093348>] (page_cache_async_readahead+0xa4/0xd8) [<c00932a4>] (page_cache_async_readahead+0x0/0xd8) from [<c008c6f4>] (generic_file_aio_read+0x360/0x7f4) r8:00000000 r7:ce346c1c r6:ce88dba0 r5:0000671c r4:c04f9640 [<c008c394>] (generic_file_aio_read+0x0/0x7f4) from [<c00b7954>] (do_sync_read+0xa0/0xd8) [<c00b78b4>] (do_sync_read+0x0/0xd8) from [<c00b84d8>] (vfs_read+0xb8/0x154) r6:bed53650 r5:ce88dba0 r4:00001000 [<c00b8420>] (vfs_read+0x0/0x154) from [<c00b863c>] (sys_read+0x44/0x70) r8:0671c000 r7:00000003 r6:00001000 r5:bed53650 r4:ce88dba0 [<c00b85f8>] (sys_read+0x0/0x70) from [<c0022f80>] (ret_fast_syscall+0x0/0x30) r9:ce986000 r8:c0023128 r6:bed53650 r5:00001000 r4:0013a4e0 Code: e59f0010 e1a01003 eb0d0852 e3a03000 (e5833000) The reason for this is that the CONFIG_HOLES_IN_ZONE configuration option is not automatically enabled when SPARSEMEM or ARCH_HAS_HOLES_MEMORYMODEL are enabled. Due to this, the pfn_valid_within() macro always returns 1 due to which the BUG_ON is encountered. This patch enables the CONFIG_HOLES_IN_ZONE config option if either ARCH_HAS_HOLES_MEMORYMODEL or SPARSEMEM is enabled. Although I tested this on an older kernel, i.e., 2.6.35.13, I see that this option has not been enabled as yet in linux-3.0 and this appears to be a logically correct change anyways with respect to pfn_valid_within() functionality. Cc: Russell King <linux@arm.linux.org.uk> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com> --- Thanks, Kautuk.