Message ID | 20210130071025.65258-9-chenzhou10@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | support reserving crashkernel above 4G on arm64 kdump | expand |
On Sat, Jan 30, 2021 at 03:10:22PM +0800, Chen Zhou wrote: > There are following issues in arm64 kdump: > 1. We use crashkernel=X to reserve crashkernel below 4G, which > will fail when there is no enough low memory. > 2. If reserving crashkernel above 4G, in this case, crash dump > kernel will boot failure because there is no low memory available > for allocation. > > To solve these issues, change the behavior of crashkernel=X and > introduce crashkernel=X,[high,low]. crashkernel=X tries low allocation > in DMA zone, and fall back to high allocation if it fails. > We can also use "crashkernel=X,high" to select a region above DMA zone, > which also tries to allocate at least 256M in DMA zone automatically. > "crashkernel=Y,low" can be used to allocate specified size low memory. > > Another minor change, there may be two regions reserved for crash > dump kernel, in order to distinct from the high region and make no > effect to the use of existing kexec-tools, rename the low region as > "Crash kernel (low)". I think we discussed this but I don't remember the conclusion. Is this only renamed conditionally so that we don't break current kexec-tools? IOW, assuming that the full crashkernel region is reserved below 4GB, does the "(low)" suffix still appear or it's only if a high region is additionally reserved? > diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h > index 3f6ecae0bc68..f0caed0cb5e1 100644 > --- a/arch/arm64/include/asm/kexec.h > +++ b/arch/arm64/include/asm/kexec.h > @@ -96,6 +96,10 @@ static inline void crash_prepare_suspend(void) {} > static inline void crash_post_resume(void) {} > #endif > > +#ifdef CONFIG_KEXEC_CORE > +extern void __init reserve_crashkernel(void); > +#endif Why not have this in some generic header? > diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c > index c18aacde8bb0..69c592c546de 100644 > --- a/arch/arm64/kernel/setup.c > +++ b/arch/arm64/kernel/setup.c > @@ -238,7 +238,18 @@ static void __init request_standard_resources(void) > kernel_data.end <= res->end) > request_resource(res, &kernel_data); > #ifdef CONFIG_KEXEC_CORE > - /* Userspace will find "Crash kernel" region in /proc/iomem. */ > + /* > + * Userspace will find "Crash kernel" or "Crash kernel (low)" > + * region in /proc/iomem. > + * In order to distinct from the high region and make no effect > + * to the use of existing kexec-tools, rename the low region as > + * "Crash kernel (low)". > + */ > + if (crashk_low_res.end && crashk_low_res.start >= res->start && > + crashk_low_res.end <= res->end) { > + crashk_low_res.name = "Crash kernel (low)"; > + request_resource(res, &crashk_low_res); > + } > if (crashk_res.end && crashk_res.start >= res->start && > crashk_res.end <= res->end) > request_resource(res, &crashk_res); My reading of the new generic reserve_crashkernel() is that crashk_low_res will only be populated if crask_res is above 4GB. If that's correct, I'm fine with the renaming here since current systems would not get a renamed low reservation (as long as they don't change the kernel cmdline). > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c > index 912f64f505f7..d20f5c444ebf 100644 > --- a/arch/arm64/mm/init.c > +++ b/arch/arm64/mm/init.c > @@ -35,6 +35,7 @@ > #include <asm/fixmap.h> > #include <asm/kasan.h> > #include <asm/kernel-pgtable.h> > +#include <asm/kexec.h> > #include <asm/memory.h> > #include <asm/numa.h> > #include <asm/sections.h> > @@ -61,66 +62,11 @@ EXPORT_SYMBOL(memstart_addr); > */ > phys_addr_t arm64_dma_phys_limit __ro_after_init; > > -#ifdef CONFIG_KEXEC_CORE > -/* > - * reserve_crashkernel() - reserves memory for crash kernel > - * > - * This function reserves memory area given in "crashkernel=" kernel command > - * line parameter. The memory reserved is used by dump capture kernel when > - * primary kernel is crashing. > - */ > +#ifndef CONFIG_KEXEC_CORE > static void __init reserve_crashkernel(void) > { [...] > } > +#endif Can we not have the dummy reserve_crashkernel() in the generic code as well and avoid the #ifndef here? > #ifdef CONFIG_CRASH_DUMP > static int __init early_init_dt_scan_elfcorehdr(unsigned long node, > @@ -446,6 +392,14 @@ void __init bootmem_init(void) > * reserved, so do it here. > */ > reserve_crashkernel(); > +#ifdef CONFIG_KEXEC_CORE > + /* > + * The low region is intended to be used for crash dump kernel devices, > + * just mark the low region as "nomap" simply. > + */ > + if (crashk_low_res.end) > + memblock_mark_nomap(crashk_low_res.start, resource_size(&crashk_low_res)); > +#endif Do we do something similar for crashk_res? Also, I can see we call crash_exclude_mem_range() only for crashk_res. Do we need to do this for crashk_low_res as well?
On 2021/2/25 0:04, Catalin Marinas wrote: > On Sat, Jan 30, 2021 at 03:10:22PM +0800, Chen Zhou wrote: >> There are following issues in arm64 kdump: >> 1. We use crashkernel=X to reserve crashkernel below 4G, which >> will fail when there is no enough low memory. >> 2. If reserving crashkernel above 4G, in this case, crash dump >> kernel will boot failure because there is no low memory available >> for allocation. >> >> To solve these issues, change the behavior of crashkernel=X and >> introduce crashkernel=X,[high,low]. crashkernel=X tries low allocation >> in DMA zone, and fall back to high allocation if it fails. >> We can also use "crashkernel=X,high" to select a region above DMA zone, >> which also tries to allocate at least 256M in DMA zone automatically. >> "crashkernel=Y,low" can be used to allocate specified size low memory. >> >> Another minor change, there may be two regions reserved for crash >> dump kernel, in order to distinct from the high region and make no >> effect to the use of existing kexec-tools, rename the low region as >> "Crash kernel (low)". > I think we discussed this but I don't remember the conclusion. Is this > only renamed conditionally so that we don't break current kexec-tools? Yes. > > IOW, assuming that the full crashkernel region is reserved below 4GB, > does the "(low)" suffix still appear or it's only if a high region is > additionally reserved? Suffix "low" only appear if a high region is additionally reserved. > >> diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h >> index 3f6ecae0bc68..f0caed0cb5e1 100644 >> --- a/arch/arm64/include/asm/kexec.h >> +++ b/arch/arm64/include/asm/kexec.h >> @@ -96,6 +96,10 @@ static inline void crash_prepare_suspend(void) {} >> static inline void crash_post_resume(void) {} >> #endif >> >> +#ifdef CONFIG_KEXEC_CORE >> +extern void __init reserve_crashkernel(void); >> +#endif > Why not have this in some generic header? > >> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c >> index c18aacde8bb0..69c592c546de 100644 >> --- a/arch/arm64/kernel/setup.c >> +++ b/arch/arm64/kernel/setup.c >> @@ -238,7 +238,18 @@ static void __init request_standard_resources(void) >> kernel_data.end <= res->end) >> request_resource(res, &kernel_data); >> #ifdef CONFIG_KEXEC_CORE >> - /* Userspace will find "Crash kernel" region in /proc/iomem. */ >> + /* >> + * Userspace will find "Crash kernel" or "Crash kernel (low)" >> + * region in /proc/iomem. >> + * In order to distinct from the high region and make no effect >> + * to the use of existing kexec-tools, rename the low region as >> + * "Crash kernel (low)". >> + */ >> + if (crashk_low_res.end && crashk_low_res.start >= res->start && >> + crashk_low_res.end <= res->end) { >> + crashk_low_res.name = "Crash kernel (low)"; >> + request_resource(res, &crashk_low_res); >> + } >> if (crashk_res.end && crashk_res.start >= res->start && >> crashk_res.end <= res->end) >> request_resource(res, &crashk_res); > My reading of the new generic reserve_crashkernel() is that > crashk_low_res will only be populated if crask_res is above 4GB. If > that's correct, I'm fine with the renaming here since current systems > would not get a renamed low reservation (as long as they don't change > the kernel cmdline). > >> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c >> index 912f64f505f7..d20f5c444ebf 100644 >> --- a/arch/arm64/mm/init.c >> +++ b/arch/arm64/mm/init.c >> @@ -35,6 +35,7 @@ >> #include <asm/fixmap.h> >> #include <asm/kasan.h> >> #include <asm/kernel-pgtable.h> >> +#include <asm/kexec.h> >> #include <asm/memory.h> >> #include <asm/numa.h> >> #include <asm/sections.h> >> @@ -61,66 +62,11 @@ EXPORT_SYMBOL(memstart_addr); >> */ >> phys_addr_t arm64_dma_phys_limit __ro_after_init; >> >> -#ifdef CONFIG_KEXEC_CORE >> -/* >> - * reserve_crashkernel() - reserves memory for crash kernel >> - * >> - * This function reserves memory area given in "crashkernel=" kernel command >> - * line parameter. The memory reserved is used by dump capture kernel when >> - * primary kernel is crashing. >> - */ >> +#ifndef CONFIG_KEXEC_CORE >> static void __init reserve_crashkernel(void) >> { > [...] >> } >> +#endif > Can we not have the dummy reserve_crashkernel() in the generic code as > well and avoid the #ifndef here? You mean put the dummy reserve_crashkernel() and the relate function declaration in some generic header? Baoquan also mentioned about this. Now all the arch that support kdump have the dummy reserve_crashkernel() and function declaration, such as arm/arm64/ppc/s390.. But currently different arch may have different CONFIG and different function declaration about this, for example, for s390, static void __init reserve_crashkernel(void) { #ifdef CONFIG_CRASH_DUMP ... #endif } for ppc, #ifdef CONFIG_KEXEC_CORE extern void reserve_crashkernel(void); #else static inline void reserve_crashkernel(void) { ; } #endif If we move these to generic header we need think about: 1. the related config in different arch 2. function declaration(static/non static) As Baoquan said in patch 9, how about leave with it for now and i try to solve this later? > >> #ifdef CONFIG_CRASH_DUMP >> static int __init early_init_dt_scan_elfcorehdr(unsigned long node, >> @@ -446,6 +392,14 @@ void __init bootmem_init(void) >> * reserved, so do it here. >> */ >> reserve_crashkernel(); >> +#ifdef CONFIG_KEXEC_CORE >> + /* >> + * The low region is intended to be used for crash dump kernel devices, >> + * just mark the low region as "nomap" simply. >> + */ >> + if (crashk_low_res.end) >> + memblock_mark_nomap(crashk_low_res.start, resource_size(&crashk_low_res)); >> +#endif > Do we do something similar for crashk_res? Not. In the primary kernel(production kernel), we need to use crashk_res memory for crash kernel elf core header, initrd... Different with this, the crashk_low_res is only for crash dump kernel devices. > > Also, I can see we call crash_exclude_mem_range() only for crashk_res. > Do we need to do this for crashk_low_res as well? You are right, i missed about this. Will do in next version. Thanks, Chen Zhou >
On 2021/2/26 18:31, chenzhou wrote: > > On 2021/2/25 0:04, Catalin Marinas wrote: >> On Sat, Jan 30, 2021 at 03:10:22PM +0800, Chen Zhou wrote: >>> There are following issues in arm64 kdump: >>> 1. We use crashkernel=X to reserve crashkernel below 4G, which >>> will fail when there is no enough low memory. >>> 2. If reserving crashkernel above 4G, in this case, crash dump >>> kernel will boot failure because there is no low memory available >>> for allocation. >>> >>> To solve these issues, change the behavior of crashkernel=X and >>> introduce crashkernel=X,[high,low]. crashkernel=X tries low allocation >>> in DMA zone, and fall back to high allocation if it fails. >>> We can also use "crashkernel=X,high" to select a region above DMA zone, >>> which also tries to allocate at least 256M in DMA zone automatically. >>> "crashkernel=Y,low" can be used to allocate specified size low memory. >>> >>> Another minor change, there may be two regions reserved for crash >>> dump kernel, in order to distinct from the high region and make no >>> effect to the use of existing kexec-tools, rename the low region as >>> "Crash kernel (low)". >> I think we discussed this but I don't remember the conclusion. Is this >> only renamed conditionally so that we don't break current kexec-tools? > Yes. >> IOW, assuming that the full crashkernel region is reserved below 4GB, >> does the "(low)" suffix still appear or it's only if a high region is >> additionally reserved? > Suffix "low" only appear if a high region is additionally reserved. >>> diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h >>> index 3f6ecae0bc68..f0caed0cb5e1 100644 >>> --- a/arch/arm64/include/asm/kexec.h >>> +++ b/arch/arm64/include/asm/kexec.h >>> @@ -96,6 +96,10 @@ static inline void crash_prepare_suspend(void) {} >>> static inline void crash_post_resume(void) {} >>> #endif >>> >>> +#ifdef CONFIG_KEXEC_CORE >>> +extern void __init reserve_crashkernel(void); >>> +#endif >> Why not have this in some generic header? >> >>> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c >>> index c18aacde8bb0..69c592c546de 100644 >>> --- a/arch/arm64/kernel/setup.c >>> +++ b/arch/arm64/kernel/setup.c >>> @@ -238,7 +238,18 @@ static void __init request_standard_resources(void) >>> kernel_data.end <= res->end) >>> request_resource(res, &kernel_data); >>> #ifdef CONFIG_KEXEC_CORE >>> - /* Userspace will find "Crash kernel" region in /proc/iomem. */ >>> + /* >>> + * Userspace will find "Crash kernel" or "Crash kernel (low)" >>> + * region in /proc/iomem. >>> + * In order to distinct from the high region and make no effect >>> + * to the use of existing kexec-tools, rename the low region as >>> + * "Crash kernel (low)". >>> + */ >>> + if (crashk_low_res.end && crashk_low_res.start >= res->start && >>> + crashk_low_res.end <= res->end) { >>> + crashk_low_res.name = "Crash kernel (low)"; >>> + request_resource(res, &crashk_low_res); >>> + } >>> if (crashk_res.end && crashk_res.start >= res->start && >>> crashk_res.end <= res->end) >>> request_resource(res, &crashk_res); >> My reading of the new generic reserve_crashkernel() is that >> crashk_low_res will only be populated if crask_res is above 4GB. If >> that's correct, I'm fine with the renaming here since current systems >> would not get a renamed low reservation (as long as they don't change >> the kernel cmdline). >> >>> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c >>> index 912f64f505f7..d20f5c444ebf 100644 >>> --- a/arch/arm64/mm/init.c >>> +++ b/arch/arm64/mm/init.c >>> @@ -35,6 +35,7 @@ >>> #include <asm/fixmap.h> >>> #include <asm/kasan.h> >>> #include <asm/kernel-pgtable.h> >>> +#include <asm/kexec.h> >>> #include <asm/memory.h> >>> #include <asm/numa.h> >>> #include <asm/sections.h> >>> @@ -61,66 +62,11 @@ EXPORT_SYMBOL(memstart_addr); >>> */ >>> phys_addr_t arm64_dma_phys_limit __ro_after_init; >>> >>> -#ifdef CONFIG_KEXEC_CORE >>> -/* >>> - * reserve_crashkernel() - reserves memory for crash kernel >>> - * >>> - * This function reserves memory area given in "crashkernel=" kernel command >>> - * line parameter. The memory reserved is used by dump capture kernel when >>> - * primary kernel is crashing. >>> - */ >>> +#ifndef CONFIG_KEXEC_CORE >>> static void __init reserve_crashkernel(void) >>> { >> [...] >>> } >>> +#endif >> Can we not have the dummy reserve_crashkernel() in the generic code as >> well and avoid the #ifndef here? > You mean put the dummy reserve_crashkernel() and the relate function declaration in some generic header? > > Baoquan also mentioned about this. > Now all the arch that support kdump have the dummy reserve_crashkernel() and > function declaration, such as arm/arm64/ppc/s390.. > > But currently different arch may have different CONFIG and different function declaration about this, > for example, > > for s390, > static void __init reserve_crashkernel(void) > { > #ifdef CONFIG_CRASH_DUMP > ... > #endif > } > > for ppc, > #ifdef CONFIG_KEXEC_CORE > extern void reserve_crashkernel(void); > #else > static inline void reserve_crashkernel(void) { ; } > #endif > > If we move these to generic header we need think about: > 1. the related config in different arch > 2. function declaration(static/non static) > > As Baoquan said in patch 9, how about leave with it for now and i try to solve this later? > >>> #ifdef CONFIG_CRASH_DUMP >>> static int __init early_init_dt_scan_elfcorehdr(unsigned long node, >>> @@ -446,6 +392,14 @@ void __init bootmem_init(void) >>> * reserved, so do it here. >>> */ >>> reserve_crashkernel(); >>> +#ifdef CONFIG_KEXEC_CORE >>> + /* >>> + * The low region is intended to be used for crash dump kernel devices, >>> + * just mark the low region as "nomap" simply. >>> + */ >>> + if (crashk_low_res.end) >>> + memblock_mark_nomap(crashk_low_res.start, resource_size(&crashk_low_res)); >>> +#endif >> Do we do something similar for crashk_res? > Not. In the primary kernel(production kernel), we need to use crashk_res memory for crash kernel > elf core header, initrd... Sorry, missed one comma after crash kernel. Not. In the primary kernel(production kernel), we need to use crashk_res memory for crash kernel, elf core header, initrd and so on. > > Different with this, the crashk_low_res is only for crash dump kernel devices. >> Also, I can see we call crash_exclude_mem_range() only for crashk_res. >> Do we need to do this for crashk_low_res as well? > You are right, i missed about this. Will do in next version. > > Thanks, > Chen Zhou
diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h index 3f6ecae0bc68..f0caed0cb5e1 100644 --- a/arch/arm64/include/asm/kexec.h +++ b/arch/arm64/include/asm/kexec.h @@ -96,6 +96,10 @@ static inline void crash_prepare_suspend(void) {} static inline void crash_post_resume(void) {} #endif +#ifdef CONFIG_KEXEC_CORE +extern void __init reserve_crashkernel(void); +#endif + #ifdef CONFIG_KEXEC_FILE #define ARCH_HAS_KIMAGE_ARCH diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c index c18aacde8bb0..69c592c546de 100644 --- a/arch/arm64/kernel/setup.c +++ b/arch/arm64/kernel/setup.c @@ -238,7 +238,18 @@ static void __init request_standard_resources(void) kernel_data.end <= res->end) request_resource(res, &kernel_data); #ifdef CONFIG_KEXEC_CORE - /* Userspace will find "Crash kernel" region in /proc/iomem. */ + /* + * Userspace will find "Crash kernel" or "Crash kernel (low)" + * region in /proc/iomem. + * In order to distinct from the high region and make no effect + * to the use of existing kexec-tools, rename the low region as + * "Crash kernel (low)". + */ + if (crashk_low_res.end && crashk_low_res.start >= res->start && + crashk_low_res.end <= res->end) { + crashk_low_res.name = "Crash kernel (low)"; + request_resource(res, &crashk_low_res); + } if (crashk_res.end && crashk_res.start >= res->start && crashk_res.end <= res->end) request_resource(res, &crashk_res); diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 912f64f505f7..d20f5c444ebf 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -35,6 +35,7 @@ #include <asm/fixmap.h> #include <asm/kasan.h> #include <asm/kernel-pgtable.h> +#include <asm/kexec.h> #include <asm/memory.h> #include <asm/numa.h> #include <asm/sections.h> @@ -61,66 +62,11 @@ EXPORT_SYMBOL(memstart_addr); */ phys_addr_t arm64_dma_phys_limit __ro_after_init; -#ifdef CONFIG_KEXEC_CORE -/* - * reserve_crashkernel() - reserves memory for crash kernel - * - * This function reserves memory area given in "crashkernel=" kernel command - * line parameter. The memory reserved is used by dump capture kernel when - * primary kernel is crashing. - */ +#ifndef CONFIG_KEXEC_CORE static void __init reserve_crashkernel(void) { - unsigned long long crash_base, crash_size; - int ret; - - ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(), - &crash_size, &crash_base); - /* no crashkernel= or invalid value specified */ - if (ret || !crash_size) - return; - - crash_size = PAGE_ALIGN(crash_size); - - if (crash_base == 0) { - /* Current arm64 boot protocol requires 2MB alignment */ - crash_base = memblock_find_in_range(CRASH_ALIGN, CRASH_ADDR_LOW_MAX, - crash_size, CRASH_ALIGN); - if (crash_base == 0) { - pr_warn("cannot allocate crashkernel (size:0x%llx)\n", - crash_size); - return; - } - } else { - /* User specifies base address explicitly. */ - if (!memblock_is_region_memory(crash_base, crash_size)) { - pr_warn("cannot reserve crashkernel: region is not memory\n"); - return; - } - - if (memblock_is_region_reserved(crash_base, crash_size)) { - pr_warn("cannot reserve crashkernel: region overlaps reserved memory\n"); - return; - } - - if (!IS_ALIGNED(crash_base, CRASH_ALIGN)) { - pr_warn("cannot reserve crashkernel: base address is not 2MB aligned\n"); - return; - } - } - memblock_reserve(crash_base, crash_size); - - pr_info("crashkernel reserved: 0x%016llx - 0x%016llx (%lld MB)\n", - crash_base, crash_base + crash_size, crash_size >> 20); - - crashk_res.start = crash_base; - crashk_res.end = crash_base + crash_size - 1; } -#else -static void __init reserve_crashkernel(void) -{ -} -#endif /* CONFIG_KEXEC_CORE */ +#endif #ifdef CONFIG_CRASH_DUMP static int __init early_init_dt_scan_elfcorehdr(unsigned long node, @@ -446,6 +392,14 @@ void __init bootmem_init(void) * reserved, so do it here. */ reserve_crashkernel(); +#ifdef CONFIG_KEXEC_CORE + /* + * The low region is intended to be used for crash dump kernel devices, + * just mark the low region as "nomap" simply. + */ + if (crashk_low_res.end) + memblock_mark_nomap(crashk_low_res.start, resource_size(&crashk_low_res)); +#endif memblock_dump_all(); } diff --git a/kernel/crash_core.c b/kernel/crash_core.c index a0e790d6ea0f..8479be270c0b 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -322,10 +322,10 @@ int __init parse_crashkernel_low(char *cmdline, #ifdef CONFIG_KEXEC_CORE -#ifdef CONFIG_X86 +#if defined(CONFIG_X86) || defined(CONFIG_ARM64) static int __init reserve_crashkernel_low(void) { -#ifdef CONFIG_X86_64 +#ifdef CONFIG_64BIT unsigned long long base, low_base = 0, low_size = 0; unsigned long low_mem_limit; int ret; @@ -450,7 +450,7 @@ void __init reserve_crashkernel(void) crashk_res.start = crash_base; crashk_res.end = crash_base + crash_size - 1; } -#endif /* CONFIG_X86 */ +#endif #endif /* CONFIG_KEXEC_CORE */ Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,