Message ID | 20201103173159.27570-2-nsaenzjulienne@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: Default to 32-bit wide ZONE_DMA | expand |
Hi! On 03/11/2020 17:31, Nicolas Saenz Julienne wrote: > crashkernel might reserve memory located in ZONE_DMA. We plan to delay > ZONE_DMA's initialization after unflattening the devicetree and ACPI's > boot table initialization, so move it later in the boot process. > Specifically into mem_init(), this is the last place crashkernel will be > able to reserve the memory before the page allocator kicks in. > There > isn't any apparent reason for doing this earlier. It's so that map_mem() can carve it out of the linear/direct map. This is so that stray writes from a crashing kernel can't accidentally corrupt the kdump kernel. We depend on this if we continue with kdump, but failed to offline all the other CPUs. We also depend on this when skipping the checksum code in purgatory, which can be exceedingly slow. Grepping around, the current order is: start_kernel() -> setup_arch() -> arm64_memblock_init() /* reserve */ -> paging_init() -> map_mem() /* carve out reservation */ [...] -> mm_init() -> mem_init() I agree we should add comments to make this apparent! Thanks, James > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c > index 095540667f0f..fc4ab0d6d5d2 100644 > --- a/arch/arm64/mm/init.c > +++ b/arch/arm64/mm/init.c > @@ -386,8 +386,6 @@ void __init arm64_memblock_init(void) > else > arm64_dma32_phys_limit = PHYS_MASK + 1; > > - reserve_crashkernel(); > - > reserve_elfcorehdr(); > > high_memory = __va(memblock_end_of_DRAM() - 1) + 1; > @@ -508,6 +506,8 @@ void __init mem_init(void) > else > swiotlb_force = SWIOTLB_NO_FORCE; > > + reserve_crashkernel(); > + > set_max_mapnr(max_pfn - PHYS_PFN_OFFSET); > > #ifndef CONFIG_SPARSEMEM_VMEMMAP >
Hi James, thanks for the review. Some comments/questions below. On Thu, 2020-11-05 at 16:11 +0000, James Morse wrote: > Hi! > > On 03/11/2020 17:31, Nicolas Saenz Julienne wrote: > > crashkernel might reserve memory located in ZONE_DMA. We plan to delay > > ZONE_DMA's initialization after unflattening the devicetree and ACPI's > > boot table initialization, so move it later in the boot process. > > Specifically into mem_init(), this is the last place crashkernel will be > > able to reserve the memory before the page allocator kicks in. > > There > > isn't any apparent reason for doing this earlier. > > It's so that map_mem() can carve it out of the linear/direct map. > This is so that stray writes from a crashing kernel can't accidentally corrupt the kdump > kernel. We depend on this if we continue with kdump, but failed to offline all the other > CPUs. I presume here you refer to arch_kexec_protect_crashkres(), IIUC this will only happen further down the line, after having loaded the kdump kernel image. But it also depends on the mappings to be PAGE sized (flags == NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS). > We also depend on this when skipping the checksum code in purgatory, which can be > exceedingly slow. This one I don't fully understand, so I'll lazily assume the prerequisite is the same WRT how memory is mapped. :) Ultimately there's also /sys/kernel/kexec_crash_size's handling. Same prerequisite. Keeping in mind acpi_table_upgrade() and unflatten_device_tree() depend on having the linear mappings available. I don't see any simple way of solving this. Both moving the firmware description routines to use fixmap or correcting the linear mapping further down the line so as to include kdump's regions, seem excessive/impossible (feel free to correct me here). I'd be happy to hear suggestions. Otherwise we're back to hard-coding the information as we initially did. Let me stress that knowing the DMA constraints in the system before reserving crashkernel's regions is necessary if we ever want it to work seamlessly on all platforms. Be it small stuff like the Raspberry Pi or huge servers with TB of memory. Regards, Nicolas
On Fri, Nov 06, 2020 at 07:46:29PM +0100, Nicolas Saenz Julienne wrote: > On Thu, 2020-11-05 at 16:11 +0000, James Morse wrote: > > On 03/11/2020 17:31, Nicolas Saenz Julienne wrote: > > > crashkernel might reserve memory located in ZONE_DMA. We plan to delay > > > ZONE_DMA's initialization after unflattening the devicetree and ACPI's > > > boot table initialization, so move it later in the boot process. > > > Specifically into mem_init(), this is the last place crashkernel will be > > > able to reserve the memory before the page allocator kicks in. > > > There > > > isn't any apparent reason for doing this earlier. > > > > It's so that map_mem() can carve it out of the linear/direct map. > > This is so that stray writes from a crashing kernel can't accidentally corrupt the kdump > > kernel. We depend on this if we continue with kdump, but failed to offline all the other > > CPUs. > > I presume here you refer to arch_kexec_protect_crashkres(), IIUC this will only > happen further down the line, after having loaded the kdump kernel image. But > it also depends on the mappings to be PAGE sized (flags == NO_BLOCK_MAPPINGS | > NO_CONT_MAPPINGS). IIUC, arch_kexec_protect_crashkres() is only for the crashkernel image, not the whole reserved memory that the crashkernel will use. For the latter, we avoid the linear map by marking it as nomap in map_mem(). > > We also depend on this when skipping the checksum code in purgatory, which can be > > exceedingly slow. > > This one I don't fully understand, so I'll lazily assume the prerequisite is > the same WRT how memory is mapped. :) > > Ultimately there's also /sys/kernel/kexec_crash_size's handling. Same > prerequisite. > > Keeping in mind acpi_table_upgrade() and unflatten_device_tree() depend on > having the linear mappings available. So it looks like reserve_crashkernel() wants to reserve memory before setting up the linear map with the information about the DMA zones in place but that comes later when we can parse the firmware tables. I wonder, instead of not mapping the crashkernel reservation, can we not do an arch_kexec_protect_crashkres() for the whole reservation after we created the linear map? > Let me stress that knowing the DMA constraints in the system before reserving > crashkernel's regions is necessary if we ever want it to work seamlessly on all > platforms. Be it small stuff like the Raspberry Pi or huge servers with TB of > memory. Indeed. So we have 3 options (so far): 1. Allow the crashkernel reservation to go into the linear map but set it to invalid once allocated. 2. Parse the flattened DT (not sure what we do with ACPI) before creating the linear map. We may have to rely on some SoC ID here instead of actual DMA ranges. 3. Assume the smallest ZONE_DMA possible on arm64 (1GB) for crashkernel reservations and not rely on arm64_dma_phys_limit in reserve_crashkernel(). I think (2) we tried hard to avoid. Option (3) brings us back to the issues we had on large crashkernel reservations regressing on some platforms (though it's been a while since, they mostly went quiet ;)). However, with Chen's crashkernel patches we end up with two reservations, one in the low DMA zone and one higher, potentially above 4GB. Having a fixed 1GB limit wouldn't be any worse for crashkernel reservations than what we have now. If (1) works, I'd go for it (James knows this part better than me), otherwise we can go for (3).
Hi Catalin, On Tue, 2020-11-10 at 18:17 +0000, Catalin Marinas wrote: > On Fri, Nov 06, 2020 at 07:46:29PM +0100, Nicolas Saenz Julienne wrote: > > On Thu, 2020-11-05 at 16:11 +0000, James Morse wrote: > > > On 03/11/2020 17:31, Nicolas Saenz Julienne wrote: > > > > crashkernel might reserve memory located in ZONE_DMA. We plan to delay > > > > ZONE_DMA's initialization after unflattening the devicetree and ACPI's > > > > boot table initialization, so move it later in the boot process. > > > > Specifically into mem_init(), this is the last place crashkernel will be > > > > able to reserve the memory before the page allocator kicks in. > > > > There > > > > isn't any apparent reason for doing this earlier. > > > > > > It's so that map_mem() can carve it out of the linear/direct map. > > > This is so that stray writes from a crashing kernel can't accidentally corrupt the kdump > > > kernel. We depend on this if we continue with kdump, but failed to offline all the other > > > CPUs. > > > > I presume here you refer to arch_kexec_protect_crashkres(), IIUC this will only > > happen further down the line, after having loaded the kdump kernel image. But > > it also depends on the mappings to be PAGE sized (flags == NO_BLOCK_MAPPINGS | > > NO_CONT_MAPPINGS). > > IIUC, arch_kexec_protect_crashkres() is only for the crashkernel image, > not the whole reserved memory that the crashkernel will use. For the > latter, we avoid the linear map by marking it as nomap in map_mem(). I'm not sure we're on the same page here, so sorry if this was already implied. The crashkernel memory mapping is bypassed while preparing the linear mappings but it is then mapped right away, with page granularity and !MTE. See paging_init()->map_mem(): /* * Use page-level mappings here so that we can shrink the region * in page granularity and put back unused memory to buddy system * through /sys/kernel/kexec_crash_size interface. */ if (crashk_res.end) { __map_memblock(pgdp, crashk_res.start, crashk_res.end + 1, PAGE_KERNEL, NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS); memblock_clear_nomap(crashk_res.start, resource_size(&crashk_res)); } IIUC the inconvenience here is that we need special mapping options for crashkernel and updating those after having mapped that memory as regular memory isn't possible/easy to do. > > > We also depend on this when skipping the checksum code in purgatory, which can be > > > exceedingly slow. > > > > This one I don't fully understand, so I'll lazily assume the prerequisite is > > the same WRT how memory is mapped. :) > > > > Ultimately there's also /sys/kernel/kexec_crash_size's handling. Same > > prerequisite. > > > > Keeping in mind acpi_table_upgrade() and unflatten_device_tree() depend on > > having the linear mappings available. > > So it looks like reserve_crashkernel() wants to reserve memory before > setting up the linear map with the information about the DMA zones in > place but that comes later when we can parse the firmware tables. > > I wonder, instead of not mapping the crashkernel reservation, can we not > do an arch_kexec_protect_crashkres() for the whole reservation after we > created the linear map? arch_kexec_protect_crashkres() depends on __change_memory_common() which ultimately depends on the memory to be mapped with PAGE_SIZE pages. As I comment above, the trick would work as long as there is as way to update the linear mappings with whatever crashkernel needs later in the boot process. > > Let me stress that knowing the DMA constraints in the system before reserving > > crashkernel's regions is necessary if we ever want it to work seamlessly on all > > platforms. Be it small stuff like the Raspberry Pi or huge servers with TB of > > memory. > > Indeed. So we have 3 options (so far): > > 1. Allow the crashkernel reservation to go into the linear map but set > it to invalid once allocated. > > 2. Parse the flattened DT (not sure what we do with ACPI) before > creating the linear map. We may have to rely on some SoC ID here > instead of actual DMA ranges. > > 3. Assume the smallest ZONE_DMA possible on arm64 (1GB) for crashkernel > reservations and not rely on arm64_dma_phys_limit in > reserve_crashkernel(). > > I think (2) we tried hard to avoid. Option (3) brings us back to the > issues we had on large crashkernel reservations regressing on some > platforms (though it's been a while since, they mostly went quiet ;)). > However, with Chen's crashkernel patches we end up with two > reservations, one in the low DMA zone and one higher, potentially above > 4GB. Having a fixed 1GB limit wouldn't be any worse for crashkernel > reservations than what we have now. > > If (1) works, I'd go for it (James knows this part better than me), > otherwise we can go for (3). Overall, I'd prefer (1) as well, and I'd be happy to have a got at it. If not I'll append (3) in this series. Regards, Nicolas
Hi Nicolas, On Thu, Nov 12, 2020 at 04:56:38PM +0100, Nicolas Saenz Julienne wrote: > On Tue, 2020-11-10 at 18:17 +0000, Catalin Marinas wrote: > > On Fri, Nov 06, 2020 at 07:46:29PM +0100, Nicolas Saenz Julienne wrote: > > > On Thu, 2020-11-05 at 16:11 +0000, James Morse wrote: > > > > On 03/11/2020 17:31, Nicolas Saenz Julienne wrote: > > > > > crashkernel might reserve memory located in ZONE_DMA. We plan to delay > > > > > ZONE_DMA's initialization after unflattening the devicetree and ACPI's > > > > > boot table initialization, so move it later in the boot process. > > > > > Specifically into mem_init(), this is the last place crashkernel will be > > > > > able to reserve the memory before the page allocator kicks in. > > > > > There > > > > > isn't any apparent reason for doing this earlier. > > > > > > > > It's so that map_mem() can carve it out of the linear/direct map. > > > > This is so that stray writes from a crashing kernel can't accidentally corrupt the kdump > > > > kernel. We depend on this if we continue with kdump, but failed to offline all the other > > > > CPUs. > > > > > > I presume here you refer to arch_kexec_protect_crashkres(), IIUC this will only > > > happen further down the line, after having loaded the kdump kernel image. But > > > it also depends on the mappings to be PAGE sized (flags == NO_BLOCK_MAPPINGS | > > > NO_CONT_MAPPINGS). > > > > IIUC, arch_kexec_protect_crashkres() is only for the crashkernel image, > > not the whole reserved memory that the crashkernel will use. For the > > latter, we avoid the linear map by marking it as nomap in map_mem(). > > I'm not sure we're on the same page here, so sorry if this was already implied. > > The crashkernel memory mapping is bypassed while preparing the linear mappings > but it is then mapped right away, with page granularity and !MTE. > See paging_init()->map_mem(): > > /* > * Use page-level mappings here so that we can shrink the region > * in page granularity and put back unused memory to buddy system > * through /sys/kernel/kexec_crash_size interface. > */ > if (crashk_res.end) { > __map_memblock(pgdp, crashk_res.start, crashk_res.end + 1, > PAGE_KERNEL, > NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS); > memblock_clear_nomap(crashk_res.start, > resource_size(&crashk_res)); > } > > IIUC the inconvenience here is that we need special mapping options for > crashkernel and updating those after having mapped that memory as regular > memory isn't possible/easy to do. You are right, it still gets mapped but with page granularity. However, to James' point, we still need to know the crashkernel range in map_mem() as arch_kexec_protect_crashkres() relies on having page rather than block mappings. > > > > We also depend on this when skipping the checksum code in purgatory, which can be > > > > exceedingly slow. > > > > > > This one I don't fully understand, so I'll lazily assume the prerequisite is > > > the same WRT how memory is mapped. :) > > > > > > Ultimately there's also /sys/kernel/kexec_crash_size's handling. Same > > > prerequisite. > > > > > > Keeping in mind acpi_table_upgrade() and unflatten_device_tree() depend on > > > having the linear mappings available. > > > > So it looks like reserve_crashkernel() wants to reserve memory before > > setting up the linear map with the information about the DMA zones in > > place but that comes later when we can parse the firmware tables. > > > > I wonder, instead of not mapping the crashkernel reservation, can we not > > do an arch_kexec_protect_crashkres() for the whole reservation after we > > created the linear map? > > arch_kexec_protect_crashkres() depends on __change_memory_common() which > ultimately depends on the memory to be mapped with PAGE_SIZE pages. As I > comment above, the trick would work as long as there is as way to update the > linear mappings with whatever crashkernel needs later in the boot process. Breaking block mappings into pages is a lot more difficult later. OTOH, the default these days is rodata_full==true, so I don't think we have block mappings anyway. We could add NO_BLOCK_MAPPINGS if KEXEC_CORE is enabled. > > > Let me stress that knowing the DMA constraints in the system before reserving > > > crashkernel's regions is necessary if we ever want it to work seamlessly on all > > > platforms. Be it small stuff like the Raspberry Pi or huge servers with TB of > > > memory. > > > > Indeed. So we have 3 options (so far): > > > > 1. Allow the crashkernel reservation to go into the linear map but set > > it to invalid once allocated. > > > > 2. Parse the flattened DT (not sure what we do with ACPI) before > > creating the linear map. We may have to rely on some SoC ID here > > instead of actual DMA ranges. > > > > 3. Assume the smallest ZONE_DMA possible on arm64 (1GB) for crashkernel > > reservations and not rely on arm64_dma_phys_limit in > > reserve_crashkernel(). > > > > I think (2) we tried hard to avoid. Option (3) brings us back to the > > issues we had on large crashkernel reservations regressing on some > > platforms (though it's been a while since, they mostly went quiet ;)). > > However, with Chen's crashkernel patches we end up with two > > reservations, one in the low DMA zone and one higher, potentially above > > 4GB. Having a fixed 1GB limit wouldn't be any worse for crashkernel > > reservations than what we have now. > > > > If (1) works, I'd go for it (James knows this part better than me), > > otherwise we can go for (3). > > Overall, I'd prefer (1) as well, and I'd be happy to have a got at it. If not > I'll append (3) in this series. I think for 1 we could also remove the additional KEXEC_CORE checks, something like below, untested: diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 3e5a6913acc8..27ab609c1c0c 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -477,7 +477,8 @@ static void __init map_mem(pgd_t *pgdp) int flags = 0; u64 i; - if (rodata_full || debug_pagealloc_enabled()) + if (rodata_full || debug_pagealloc_enabled() || + IS_ENABLED(CONFIG_KEXEC_CORE)) flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; /* @@ -487,11 +488,6 @@ static void __init map_mem(pgd_t *pgdp) * the following for-loop */ memblock_mark_nomap(kernel_start, kernel_end - kernel_start); -#ifdef CONFIG_KEXEC_CORE - if (crashk_res.end) - memblock_mark_nomap(crashk_res.start, - resource_size(&crashk_res)); -#endif /* map all the memory banks */ for_each_mem_range(i, &start, &end) { @@ -518,21 +514,6 @@ static void __init map_mem(pgd_t *pgdp) __map_memblock(pgdp, kernel_start, kernel_end, PAGE_KERNEL, NO_CONT_MAPPINGS); memblock_clear_nomap(kernel_start, kernel_end - kernel_start); - -#ifdef CONFIG_KEXEC_CORE - /* - * Use page-level mappings here so that we can shrink the region - * in page granularity and put back unused memory to buddy system - * through /sys/kernel/kexec_crash_size interface. - */ - if (crashk_res.end) { - __map_memblock(pgdp, crashk_res.start, crashk_res.end + 1, - PAGE_KERNEL, - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS); - memblock_clear_nomap(crashk_res.start, - resource_size(&crashk_res)); - } -#endif } void mark_rodata_ro(void)
Hi Catalin, James, sorry for the late reply but I got sidetracked. On Fri, 2020-11-13 at 11:29 +0000, Catalin Marinas wrote: [...] > > > > Let me stress that knowing the DMA constraints in the system before reserving > > > > crashkernel's regions is necessary if we ever want it to work seamlessly on all > > > > platforms. Be it small stuff like the Raspberry Pi or huge servers with TB of > > > > memory. > > > > > > Indeed. So we have 3 options (so far): > > > > > > 1. Allow the crashkernel reservation to go into the linear map but set > > > it to invalid once allocated. > > > > > > 2. Parse the flattened DT (not sure what we do with ACPI) before > > > creating the linear map. We may have to rely on some SoC ID here > > > instead of actual DMA ranges. > > > > > > 3. Assume the smallest ZONE_DMA possible on arm64 (1GB) for crashkernel > > > reservations and not rely on arm64_dma_phys_limit in > > > reserve_crashkernel(). > > > > > > I think (2) we tried hard to avoid. Option (3) brings us back to the > > > issues we had on large crashkernel reservations regressing on some > > > platforms (though it's been a while since, they mostly went quiet ;)). > > > However, with Chen's crashkernel patches we end up with two > > > reservations, one in the low DMA zone and one higher, potentially above > > > 4GB. Having a fixed 1GB limit wouldn't be any worse for crashkernel > > > reservations than what we have now. > > > > > > If (1) works, I'd go for it (James knows this part better than me), > > > otherwise we can go for (3). > > > > Overall, I'd prefer (1) as well, and I'd be happy to have a got at it. If not > > I'll append (3) in this series. > > I think for 1 we could also remove the additional KEXEC_CORE checks, > something like below, untested: > > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c > index 3e5a6913acc8..27ab609c1c0c 100644 > --- a/arch/arm64/mm/mmu.c > +++ b/arch/arm64/mm/mmu.c > @@ -477,7 +477,8 @@ static void __init map_mem(pgd_t *pgdp) > int flags = 0; > u64 i; > > - if (rodata_full || debug_pagealloc_enabled()) > + if (rodata_full || debug_pagealloc_enabled() || > + IS_ENABLED(CONFIG_KEXEC_CORE)) > flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; > > /* > @@ -487,11 +488,6 @@ static void __init map_mem(pgd_t *pgdp) > * the following for-loop > */ > memblock_mark_nomap(kernel_start, kernel_end - kernel_start); > -#ifdef CONFIG_KEXEC_CORE > - if (crashk_res.end) > - memblock_mark_nomap(crashk_res.start, > - resource_size(&crashk_res)); > -#endif > > /* map all the memory banks */ > for_each_mem_range(i, &start, &end) { > @@ -518,21 +514,6 @@ static void __init map_mem(pgd_t *pgdp) > __map_memblock(pgdp, kernel_start, kernel_end, > PAGE_KERNEL, NO_CONT_MAPPINGS); > memblock_clear_nomap(kernel_start, kernel_end - kernel_start); > - > -#ifdef CONFIG_KEXEC_CORE > - /* > - * Use page-level mappings here so that we can shrink the region > - * in page granularity and put back unused memory to buddy system > - * through /sys/kernel/kexec_crash_size interface. > - */ > - if (crashk_res.end) { > - __map_memblock(pgdp, crashk_res.start, crashk_res.end + 1, > - PAGE_KERNEL, > - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS); > - memblock_clear_nomap(crashk_res.start, > - resource_size(&crashk_res)); > - } > -#endif > } > > void mark_rodata_ro(void) So as far as I'm concerned this is good enough for me. I took the time to properly test crashkernel on RPi4 using the series, this patch, and another small fix to properly update /proc/iomem. I'll send v7 soon, but before, James (or anyone for that matter) any obvious push-back to Catalin's solution? Regards, Nicolas
On Thu, Nov 19, 2020 at 03:09:58PM +0100, Nicolas Saenz Julienne wrote: > On Fri, 2020-11-13 at 11:29 +0000, Catalin Marinas wrote: > [...] > > > > > Let me stress that knowing the DMA constraints in the system before reserving > > > > > crashkernel's regions is necessary if we ever want it to work seamlessly on all > > > > > platforms. Be it small stuff like the Raspberry Pi or huge servers with TB of > > > > > memory. > > > > > > > > Indeed. So we have 3 options (so far): > > > > > > > > 1. Allow the crashkernel reservation to go into the linear map but set > > > > it to invalid once allocated. > > > > > > > > 2. Parse the flattened DT (not sure what we do with ACPI) before > > > > creating the linear map. We may have to rely on some SoC ID here > > > > instead of actual DMA ranges. > > > > > > > > 3. Assume the smallest ZONE_DMA possible on arm64 (1GB) for crashkernel > > > > reservations and not rely on arm64_dma_phys_limit in > > > > reserve_crashkernel(). > > > > > > > > I think (2) we tried hard to avoid. Option (3) brings us back to the > > > > issues we had on large crashkernel reservations regressing on some > > > > platforms (though it's been a while since, they mostly went quiet ;)). > > > > However, with Chen's crashkernel patches we end up with two > > > > reservations, one in the low DMA zone and one higher, potentially above > > > > 4GB. Having a fixed 1GB limit wouldn't be any worse for crashkernel > > > > reservations than what we have now. > > > > > > > > If (1) works, I'd go for it (James knows this part better than me), > > > > otherwise we can go for (3). > > > > > > Overall, I'd prefer (1) as well, and I'd be happy to have a got at it. If not > > > I'll append (3) in this series. > > > > I think for 1 we could also remove the additional KEXEC_CORE checks, > > something like below, untested: > > > > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c > > index 3e5a6913acc8..27ab609c1c0c 100644 > > --- a/arch/arm64/mm/mmu.c > > +++ b/arch/arm64/mm/mmu.c > > @@ -477,7 +477,8 @@ static void __init map_mem(pgd_t *pgdp) > > int flags = 0; > > u64 i; > > > > - if (rodata_full || debug_pagealloc_enabled()) > > + if (rodata_full || debug_pagealloc_enabled() || > > + IS_ENABLED(CONFIG_KEXEC_CORE)) > > flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; > > > > /* > > @@ -487,11 +488,6 @@ static void __init map_mem(pgd_t *pgdp) > > * the following for-loop > > */ > > memblock_mark_nomap(kernel_start, kernel_end - kernel_start); > > -#ifdef CONFIG_KEXEC_CORE > > - if (crashk_res.end) > > - memblock_mark_nomap(crashk_res.start, > > - resource_size(&crashk_res)); > > -#endif > > > > /* map all the memory banks */ > > for_each_mem_range(i, &start, &end) { > > @@ -518,21 +514,6 @@ static void __init map_mem(pgd_t *pgdp) > > __map_memblock(pgdp, kernel_start, kernel_end, > > PAGE_KERNEL, NO_CONT_MAPPINGS); > > memblock_clear_nomap(kernel_start, kernel_end - kernel_start); > > - > > -#ifdef CONFIG_KEXEC_CORE > > - /* > > - * Use page-level mappings here so that we can shrink the region > > - * in page granularity and put back unused memory to buddy system > > - * through /sys/kernel/kexec_crash_size interface. > > - */ > > - if (crashk_res.end) { > > - __map_memblock(pgdp, crashk_res.start, crashk_res.end + 1, > > - PAGE_KERNEL, > > - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS); > > - memblock_clear_nomap(crashk_res.start, > > - resource_size(&crashk_res)); > > - } > > -#endif > > } > > > > void mark_rodata_ro(void) > > So as far as I'm concerned this is good enough for me. I took the time to > properly test crashkernel on RPi4 using the series, this patch, and another > small fix to properly update /proc/iomem. > > I'll send v7 soon, but before, James (or anyone for that matter) any obvious > push-back to Catalin's solution? I talked to James earlier and he was suggesting that we check the command line for any crashkernel reservations and only disable block mappings in that case, see the diff below on top of the one I already sent (still testing it). If you don't have any other changes for v7, I'm happy to pick v6 up on top of the no-block-mapping fix. diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index ed71b1c305d7..acdec0c67d3b 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -469,6 +469,21 @@ void __init mark_linear_text_alias_ro(void) PAGE_KERNEL_RO); } +static bool crash_mem_map __initdata; + +static int __init enable_crash_mem_map(char *arg) +{ + /* + * Proper parameter parsing is done by reserve_crashkernel(). We only + * need to know if the linear map has to avoid block mappings so that + * the crashkernel reservations can be unmapped later. + */ + crash_mem_map = false; + + return 0; +} +early_param("crashkernel", enable_crash_mem_map); + static void __init map_mem(pgd_t *pgdp) { phys_addr_t kernel_start = __pa_symbol(_stext); @@ -477,8 +492,7 @@ static void __init map_mem(pgd_t *pgdp) int flags = 0; u64 i; - if (rodata_full || debug_pagealloc_enabled() || - IS_ENABLED(CONFIG_KEXEC_CORE)) + if (rodata_full || debug_pagealloc_enabled() || crash_mem_map) flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; /*
On Thu, Nov 19, 2020 at 05:10:49PM +0000, Catalin Marinas wrote: > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c > index ed71b1c305d7..acdec0c67d3b 100644 > --- a/arch/arm64/mm/mmu.c > +++ b/arch/arm64/mm/mmu.c > @@ -469,6 +469,21 @@ void __init mark_linear_text_alias_ro(void) > PAGE_KERNEL_RO); > } > > +static bool crash_mem_map __initdata; > + > +static int __init enable_crash_mem_map(char *arg) > +{ > + /* > + * Proper parameter parsing is done by reserve_crashkernel(). We only > + * need to know if the linear map has to avoid block mappings so that > + * the crashkernel reservations can be unmapped later. > + */ > + crash_mem_map = false; It should be set to true.
On Thu, 2020-11-19 at 17:10 +0000, Catalin Marinas wrote: > On Thu, Nov 19, 2020 at 03:09:58PM +0100, Nicolas Saenz Julienne wrote: > > On Fri, 2020-11-13 at 11:29 +0000, Catalin Marinas wrote: > > [...] > > > > > > Let me stress that knowing the DMA constraints in the system before reserving > > > > > > crashkernel's regions is necessary if we ever want it to work seamlessly on all > > > > > > platforms. Be it small stuff like the Raspberry Pi or huge servers with TB of > > > > > > memory. > > > > > > > > > > Indeed. So we have 3 options (so far): > > > > > > > > > > 1. Allow the crashkernel reservation to go into the linear map but set > > > > > it to invalid once allocated. > > > > > > > > > > 2. Parse the flattened DT (not sure what we do with ACPI) before > > > > > creating the linear map. We may have to rely on some SoC ID here > > > > > instead of actual DMA ranges. > > > > > > > > > > 3. Assume the smallest ZONE_DMA possible on arm64 (1GB) for crashkernel > > > > > reservations and not rely on arm64_dma_phys_limit in > > > > > reserve_crashkernel(). > > > > > > > > > > I think (2) we tried hard to avoid. Option (3) brings us back to the > > > > > issues we had on large crashkernel reservations regressing on some > > > > > platforms (though it's been a while since, they mostly went quiet ;)). > > > > > However, with Chen's crashkernel patches we end up with two > > > > > reservations, one in the low DMA zone and one higher, potentially above > > > > > 4GB. Having a fixed 1GB limit wouldn't be any worse for crashkernel > > > > > reservations than what we have now. > > > > > > > > > > If (1) works, I'd go for it (James knows this part better than me), > > > > > otherwise we can go for (3). > > > > > > > > Overall, I'd prefer (1) as well, and I'd be happy to have a got at it. If not > > > > I'll append (3) in this series. > > > > > > I think for 1 we could also remove the additional KEXEC_CORE checks, > > > something like below, untested: > > > > > > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c > > > index 3e5a6913acc8..27ab609c1c0c 100644 > > > --- a/arch/arm64/mm/mmu.c > > > +++ b/arch/arm64/mm/mmu.c > > > @@ -477,7 +477,8 @@ static void __init map_mem(pgd_t *pgdp) > > > int flags = 0; > > > u64 i; > > > > > > - if (rodata_full || debug_pagealloc_enabled()) > > > + if (rodata_full || debug_pagealloc_enabled() || > > > + IS_ENABLED(CONFIG_KEXEC_CORE)) > > > flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; > > > > > > /* > > > @@ -487,11 +488,6 @@ static void __init map_mem(pgd_t *pgdp) > > > * the following for-loop > > > */ > > > memblock_mark_nomap(kernel_start, kernel_end - kernel_start); > > > -#ifdef CONFIG_KEXEC_CORE > > > - if (crashk_res.end) > > > - memblock_mark_nomap(crashk_res.start, > > > - resource_size(&crashk_res)); > > > -#endif > > > > > > /* map all the memory banks */ > > > for_each_mem_range(i, &start, &end) { > > > @@ -518,21 +514,6 @@ static void __init map_mem(pgd_t *pgdp) > > > __map_memblock(pgdp, kernel_start, kernel_end, > > > PAGE_KERNEL, NO_CONT_MAPPINGS); > > > memblock_clear_nomap(kernel_start, kernel_end - kernel_start); > > > - > > > -#ifdef CONFIG_KEXEC_CORE > > > - /* > > > - * Use page-level mappings here so that we can shrink the region > > > - * in page granularity and put back unused memory to buddy system > > > - * through /sys/kernel/kexec_crash_size interface. > > > - */ > > > - if (crashk_res.end) { > > > - __map_memblock(pgdp, crashk_res.start, crashk_res.end + 1, > > > - PAGE_KERNEL, > > > - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS); > > > - memblock_clear_nomap(crashk_res.start, > > > - resource_size(&crashk_res)); > > > - } > > > -#endif > > > } > > > > > > void mark_rodata_ro(void) > > > > So as far as I'm concerned this is good enough for me. I took the time to > > properly test crashkernel on RPi4 using the series, this patch, and another > > small fix to properly update /proc/iomem. > > > > I'll send v7 soon, but before, James (or anyone for that matter) any obvious > > push-back to Catalin's solution? > > I talked to James earlier and he was suggesting that we check the > command line for any crashkernel reservations and only disable block > mappings in that case, see the diff below on top of the one I already > sent (still testing it). That's even better :) > If you don't have any other changes for v7, I'm happy to pick v6 up on > top of the no-block-mapping fix. Yes I've got a small change in patch #1, the crashkernel reservation has to be performed before request_standart_resouces() is called, which is OK, since we're all setup by then, I moved the crashkernel reservation at the end of bootmem_init(). I attached the patch. If it's easier for you I'll send v7. Regards, Nicolas
On Thu, Nov 19, 2020 at 06:25:29PM +0100, Nicolas Saenz Julienne wrote: > On Thu, 2020-11-19 at 17:10 +0000, Catalin Marinas wrote: > > On Thu, Nov 19, 2020 at 03:09:58PM +0100, Nicolas Saenz Julienne wrote: > > > On Fri, 2020-11-13 at 11:29 +0000, Catalin Marinas wrote: > > > [...] > > > > > > > Let me stress that knowing the DMA constraints in the system before reserving > > > > > > > crashkernel's regions is necessary if we ever want it to work seamlessly on all > > > > > > > platforms. Be it small stuff like the Raspberry Pi or huge servers with TB of > > > > > > > memory. > > > > > > > > > > > > Indeed. So we have 3 options (so far): > > > > > > > > > > > > 1. Allow the crashkernel reservation to go into the linear map but set > > > > > > it to invalid once allocated. > > > > > > > > > > > > 2. Parse the flattened DT (not sure what we do with ACPI) before > > > > > > creating the linear map. We may have to rely on some SoC ID here > > > > > > instead of actual DMA ranges. > > > > > > > > > > > > 3. Assume the smallest ZONE_DMA possible on arm64 (1GB) for crashkernel > > > > > > reservations and not rely on arm64_dma_phys_limit in > > > > > > reserve_crashkernel(). > > > > > > > > > > > > I think (2) we tried hard to avoid. Option (3) brings us back to the > > > > > > issues we had on large crashkernel reservations regressing on some > > > > > > platforms (though it's been a while since, they mostly went quiet ;)). > > > > > > However, with Chen's crashkernel patches we end up with two > > > > > > reservations, one in the low DMA zone and one higher, potentially above > > > > > > 4GB. Having a fixed 1GB limit wouldn't be any worse for crashkernel > > > > > > reservations than what we have now. > > > > > > > > > > > > If (1) works, I'd go for it (James knows this part better than me), > > > > > > otherwise we can go for (3). > > > > > > > > > > Overall, I'd prefer (1) as well, and I'd be happy to have a got at it. If not > > > > > I'll append (3) in this series. > > > > > > > > I think for 1 we could also remove the additional KEXEC_CORE checks, > > > > something like below, untested: > > > > > > > > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c > > > > index 3e5a6913acc8..27ab609c1c0c 100644 > > > > --- a/arch/arm64/mm/mmu.c > > > > +++ b/arch/arm64/mm/mmu.c > > > > @@ -477,7 +477,8 @@ static void __init map_mem(pgd_t *pgdp) > > > > int flags = 0; > > > > u64 i; > > > > > > > > - if (rodata_full || debug_pagealloc_enabled()) > > > > + if (rodata_full || debug_pagealloc_enabled() || > > > > + IS_ENABLED(CONFIG_KEXEC_CORE)) > > > > flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; > > > > > > > > /* > > > > @@ -487,11 +488,6 @@ static void __init map_mem(pgd_t *pgdp) > > > > * the following for-loop > > > > */ > > > > memblock_mark_nomap(kernel_start, kernel_end - kernel_start); > > > > -#ifdef CONFIG_KEXEC_CORE > > > > - if (crashk_res.end) > > > > - memblock_mark_nomap(crashk_res.start, > > > > - resource_size(&crashk_res)); > > > > -#endif > > > > > > > > /* map all the memory banks */ > > > > for_each_mem_range(i, &start, &end) { > > > > @@ -518,21 +514,6 @@ static void __init map_mem(pgd_t *pgdp) > > > > __map_memblock(pgdp, kernel_start, kernel_end, > > > > PAGE_KERNEL, NO_CONT_MAPPINGS); > > > > memblock_clear_nomap(kernel_start, kernel_end - kernel_start); > > > > - > > > > -#ifdef CONFIG_KEXEC_CORE > > > > - /* > > > > - * Use page-level mappings here so that we can shrink the region > > > > - * in page granularity and put back unused memory to buddy system > > > > - * through /sys/kernel/kexec_crash_size interface. > > > > - */ > > > > - if (crashk_res.end) { > > > > - __map_memblock(pgdp, crashk_res.start, crashk_res.end + 1, > > > > - PAGE_KERNEL, > > > > - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS); > > > > - memblock_clear_nomap(crashk_res.start, > > > > - resource_size(&crashk_res)); > > > > - } > > > > -#endif > > > > } > > > > > > > > void mark_rodata_ro(void) > > > > > > So as far as I'm concerned this is good enough for me. I took the time to > > > properly test crashkernel on RPi4 using the series, this patch, and another > > > small fix to properly update /proc/iomem. > > > > > > I'll send v7 soon, but before, James (or anyone for that matter) any obvious > > > push-back to Catalin's solution? > > > > I talked to James earlier and he was suggesting that we check the > > command line for any crashkernel reservations and only disable block > > mappings in that case, see the diff below on top of the one I already > > sent (still testing it). > > That's even better :) > > > If you don't have any other changes for v7, I'm happy to pick v6 up on > > top of the no-block-mapping fix. > > Yes I've got a small change in patch #1, the crashkernel reservation has to be > performed before request_standart_resouces() is called, which is OK, since > we're all setup by then, I moved the crashkernel reservation at the end of > bootmem_init(). I attached the patch. If it's easier for you I'll send v7. Please send a v7, otherwise b4 gets confused. Thanks.
Hi, (sorry for the late response) On 06/11/2020 18:46, Nicolas Saenz Julienne wrote: > On Thu, 2020-11-05 at 16:11 +0000, James Morse wrote:>> We also depend on this when skipping the checksum code in purgatory, which can be >> exceedingly slow. > > This one I don't fully understand, so I'll lazily assume the prerequisite is > the same WRT how memory is mapped. :) The aim is its never normally mapped by the kernel. This is so that if we can't get rid of the secondary CPUs (e.g. they have IRQs masked), but they are busy scribbling all over memory, we have a rough guarantee that they aren't scribbling over the kdump kernel. We can skip the checksum in purgatory, as there is very little risk of the memory having been corrupted. > Ultimately there's also /sys/kernel/kexec_crash_size's handling. Same > prerequisite. Yeah, this lets you release PAGE_SIZEs back to the allocator, which means the marked-invalid page tables we have hidden there need to be PAGE_SIZE mappings. Thanks, James > Keeping in mind acpi_table_upgrade() and unflatten_device_tree() depend on > having the linear mappings available. I don't see any simple way of solving > this. Both moving the firmware description routines to use fixmap or correcting > the linear mapping further down the line so as to include kdump's regions, seem > excessive/impossible (feel free to correct me here). I'd be happy to hear > suggestions. Otherwise we're back to hard-coding the information as we > initially did. > > Let me stress that knowing the DMA constraints in the system before reserving > crashkernel's regions is necessary if we ever want it to work seamlessly on all > platforms. Be it small stuff like the Raspberry Pi or huge servers with TB of > memory.
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 095540667f0f..fc4ab0d6d5d2 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -386,8 +386,6 @@ void __init arm64_memblock_init(void) else arm64_dma32_phys_limit = PHYS_MASK + 1; - reserve_crashkernel(); - reserve_elfcorehdr(); high_memory = __va(memblock_end_of_DRAM() - 1) + 1; @@ -508,6 +506,8 @@ void __init mem_init(void) else swiotlb_force = SWIOTLB_NO_FORCE; + reserve_crashkernel(); + set_max_mapnr(max_pfn - PHYS_PFN_OFFSET); #ifndef CONFIG_SPARSEMEM_VMEMMAP