diff mbox

bad page flags booting 32bit dom0 on 64bit hypervisor using dom0_mem (kernel >=4.2)

Message ID 5732E89A.1040501@canonical.com (mailing list archive)
State New, archived
Headers show

Commit Message

Stefan Bader May 11, 2016, 8:08 a.m. UTC
On 02.05.2016 16:24, Stefan Bader wrote:
> On 02.05.2016 13:41, Juergen Gross wrote:
>> On 02/05/16 12:47, Stefan Bader wrote:
>>> I recently tried to boot 32bit dom0 on 64bit Xen host which I configured to run
>>> with a limited, fix amount of memory for dom0. It seems that somewhere between
>>> kernel versions 3.19 and 4.2 (sorry that is still a wide range) the Linux kernel
>>> would report bad page flags for a range of pages (which seem to be around the
>>> end of the guest pfn range). For a 4.2 kernel that was easily missed as the boot
>>> finished ok and dom0 was accessible. However starting with 4.4 (tested 4.5 and a
>>> 4.6-rc) the serial console output freezes after some of those bad page flag
>>> messages and then (unfortunately without any further helpful output) the host
>>> reboots (I assume there is a panic that triggers a reset).
>>>
>>> I suspect the problem is more a kernel side one. It is just possible to
>>> influence things by variation of dom0_mem=#,max:#. 512M seems ok, 1024M, 2048M,
>>> and 3072M cause bad page flags starting around kernel 4.2 and reboots around
>>> 4.4. Then 4096M and not clamping dom0 memory seem to be ok again (though not
>>> limiting dom0 memory seems to cause trouble on 32bit dom0 later when a domU
>>> tries to balloon memory, but I think that is a different problem).
>>>
>>> I have not seen this on a 64bit dom0. Below is an example of those bad page
>>> errors. Somehow it looks to be a page marked as reserved. Initially I wondered
>>> whether this could be a problem of not clearing page flags when moving mappings
>>> to match the e820. But I never looked into i386 memory setup in that detail. So
>>> I am posting this, hoping that someone may have an idea from the detail about
>>> where to look next. PAE is enabled there. Usually its bpf init that gets hit but
>>> that likely is just because that is doing the first vmallocs.
>>
>> Could you please post the kernel config, Xen and dom0 boot parameters?
>> I'm quite sure this is no common problem as there are standard tests
>> running for each kernel version including 32 bit dom0 with limited
>> memory size.
> 
> Hi Jürgen,
> 
> sure. Though by doing that I realized where I actually messed the whole thing
> up. I got the max limit syntax completely wrong. :( Instead of the correct
> "dom0_mem=1024M,max:1024M" I am using "dom0_mem=1024M:max=1024M" which I guess
> is like not having max set at all. Not sure whether that is a valid use case.
> 
> When I actually do the dom0_mem argument right, there are no bad page flag
> errors even in 4.4 with 1024M limit. I was at least consistent in my
> mis-configuration, so doing the same stupid thing on 64bit seems to be handled
> more gracefully.
> 
> Likely false alarm. But at least cut&pasting the config into mail made me spot
> the problem...
> 

Ok, thinking that "dom0_mem=x" (without a max or min) still is a valid case, I
went ahead and did a bisect for when the bad page flag issue started. I ended up at:

  92923ca "mm: meminit: only set page reserved in the memblock region"

And with a few more printks in the new functions I finally realized why this
goes wrong. The new reserve_bootmem_region is using unsigned long for start and
end addresses which just isn't working too well for 32bit.
For Xen dom0 the problem with that can just be more easily triggered. When dom0
memory is limited to a small size but allowed to balloon for more, the
additional system memory is put into reserved regions.
In my case a host with 8G memory and say 1G initial dom0 memory this created
(apart from other) one reserved region which started at 4GB and covered the
remaining 4G of host memory. Which reserve_bootmem_region() got as 0-4G due to
the unsigned long conversion. This basically marked *all* memory below 4G as
reserved.
The fix is relatively simple, just use phys_addr_t for start and end. I tested
this on 4.2 and 4.4 kernels. Both now boot without errors and neither does the
4.4 kernel crash. Maybe still not 100% safe when running on very large memory
systems (if I did not get the math wrong 16T) but at least some improvement...

-Stefan
diff mbox

Patch

From 1588a8b3983f63f8e690b91e99fe631902e38805 Mon Sep 17 00:00:00 2001
From: Stefan Bader <stefan.bader@canonical.com>
Date: Tue, 10 May 2016 19:05:16 +0200
Subject: [PATCH] mm: Use phys_addr_t for reserve_bootmem_region arguments

Since 92923ca the reserved bit is set on reserved memblock regions.
However start and end address are passed as unsigned long. This is
only 32bit on i386, so it can end up marking the wrong pages reserved
for ranges at 4GB and above.

This was observed on a 32bit Xen dom0 which was booted with initial
memory set to a value below 4G but allowing to balloon in memory
(dom0_mem=1024M for example). This would define a reserved bootmem
region for the additional memory (for example on a 8GB system there was
a reverved region covering the 4GB-8GB range). But since the addresses
were passed on as unsigned long, this was actually marking all pages
from 0 to 4GB as reserved.

Fixes: 92923ca "mm: meminit: only set page reserved in the memblock region"
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Cc: <stable@kernel.org> # 4.2+
---
 include/linux/mm.h | 2 +-
 mm/page_alloc.c    | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b56ff72..4c1ff62 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1715,7 +1715,7 @@  extern void free_highmem_page(struct page *page);
 extern void adjust_managed_page_count(struct page *page, long count);
 extern void mem_init_print_info(const char *str);
 
-extern void reserve_bootmem_region(unsigned long start, unsigned long end);
+extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
 
 /* Free the reserved page into the buddy system, so it gets managed. */
 static inline void __free_reserved_page(struct page *page)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c69531a..eb66f89 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -951,7 +951,7 @@  static inline void init_reserved_page(unsigned long pfn)
  * marks the pages PageReserved. The remaining valid pages are later
  * sent to the buddy page allocator.
  */
-void __meminit reserve_bootmem_region(unsigned long start, unsigned long end)
+void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long end_pfn = PFN_UP(end);
-- 
1.9.1