diff mbox series

[1/2] arm64, kdump: enforce to take 4G as the crashkernel low memory end

Message ID 20220828005545.94389-2-bhe@redhat.com (mailing list archive)
State New
Headers show
Series arm64, kdump: enforce to take 4G as the crashkernel low memory end | expand

Commit Message

Baoquan He Aug. 28, 2022, 12:55 a.m. UTC
Problem:
=======
On arm64, block and section mapping is supported to build page tables.
However, currently it enforces to take base page mapping for the whole
linear mapping if CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabled and
crashkernel kernel parameter is set. This will cause longer time of the
linear mapping process during bootup and severe performance degradation
during running time.

Root cause:
==========
On arm64, crashkernel reservation relies on knowing the upper limit of
low memory zone because it needs to reserve memory in the zone so that
devices' DMA addressing in kdump kernel can be satisfied. However, the
limit on arm64 is variant. And the upper limit can only be decided late
till bootmem_init() is called.

And we need to map the crashkernel region with base page granularity when
doing linear mapping, because kdump needs to protect the crashkernel region
via set_memory_valid(,0) after kdump kernel loading. However, arm64 doesn't
support well on splitting the built block or section mapping due to some
cpu reststriction [1]. And unfortunately, the linear mapping is done before
bootmem_init().

To resolve the above conflict on arm64, the compromise is enforcing to
take base page mapping for the entire linear mapping if crashkernel is
set, and CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabed. Hence
performance is sacrificed.

Solution:
=========
To fix the problem, we should always take 4G as the crashkernel low
memory end in case CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabled.
With this, we don't need to defer the crashkernel reservation till
bootmem_init() is called to set the arm64_dma_phys_limit. As long as
memblock init is done, we can conclude what is the upper limit of low
memory zone.

1) both CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are disabled or memblock_start_of_DRAM() > 4G
  limit = PHYS_ADDR_MAX+1  (Corner cases)
2) CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are enabled:
   limit = 4G  (generic case)

[1]
https://lore.kernel.org/all/YrIIJkhKWSuAqkCx@arm.com/T/#u

Signed-off-by: Baoquan He <bhe@redhat.com>
---
 arch/arm64/mm/init.c | 24 ++++++++++++++----------
 arch/arm64/mm/mmu.c  | 38 ++++++++++++++++++++++----------------
 2 files changed, 36 insertions(+), 26 deletions(-)

Comments

Leizhen (ThunderTown) Aug. 31, 2022, 1:50 a.m. UTC | #1
On 2022/8/28 8:55, Baoquan He wrote:
> Problem:
> =======
> On arm64, block and section mapping is supported to build page tables.
> However, currently it enforces to take base page mapping for the whole
> linear mapping if CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabled and
> crashkernel kernel parameter is set. This will cause longer time of the
> linear mapping process during bootup and severe performance degradation
> during running time.
> 
> Root cause:
> ==========
> On arm64, crashkernel reservation relies on knowing the upper limit of
> low memory zone because it needs to reserve memory in the zone so that
> devices' DMA addressing in kdump kernel can be satisfied. However, the
> limit on arm64 is variant. And the upper limit can only be decided late
> till bootmem_init() is called.
> 
> And we need to map the crashkernel region with base page granularity when
> doing linear mapping, because kdump needs to protect the crashkernel region
> via set_memory_valid(,0) after kdump kernel loading. However, arm64 doesn't
> support well on splitting the built block or section mapping due to some
> cpu reststriction [1]. And unfortunately, the linear mapping is done before
> bootmem_init().
> 
> To resolve the above conflict on arm64, the compromise is enforcing to
> take base page mapping for the entire linear mapping if crashkernel is
> set, and CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabed. Hence
> performance is sacrificed.
> 
> Solution:
> =========
> To fix the problem, we should always take 4G as the crashkernel low
> memory end in case CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabled.
> With this, we don't need to defer the crashkernel reservation till
> bootmem_init() is called to set the arm64_dma_phys_limit. As long as
> memblock init is done, we can conclude what is the upper limit of low
> memory zone.
> 
> 1) both CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are disabled or memblock_start_of_DRAM() > 4G
>   limit = PHYS_ADDR_MAX+1  (Corner cases)
> 2) CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are enabled:
>    limit = 4G  (generic case)

If Raspberry Pi 4 doesn't need to be considered:
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>

In fact, I also think, for the sake of a niche scene, sacrificing the mass scene,
the cart is put before the horse.


> 
> [1]
> https://lore.kernel.org/all/YrIIJkhKWSuAqkCx@arm.com/T/#u
> 
> Signed-off-by: Baoquan He <bhe@redhat.com>
> ---
>  arch/arm64/mm/init.c | 24 ++++++++++++++----------
>  arch/arm64/mm/mmu.c  | 38 ++++++++++++++++++++++----------------
>  2 files changed, 36 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index b9af30be813e..8ae55afdd11c 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -90,10 +90,22 @@ phys_addr_t __ro_after_init arm64_dma_phys_limit;
>  phys_addr_t __ro_after_init arm64_dma_phys_limit = PHYS_MASK + 1;
>  #endif
>  
> +static phys_addr_t __init crash_addr_low_max(void)
> +{
> +	phys_addr_t low_mem_mask = U32_MAX;
> +	phys_addr_t phys_start = memblock_start_of_DRAM();
> +
> +	if ((!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) ||
> +	     (phys_start > U32_MAX))
> +		low_mem_mask = PHYS_ADDR_MAX;
> +
> +	return min(low_mem_mask, memblock_end_of_DRAM() - 1) + 1;
> +}
> +
>  /* Current arm64 boot protocol requires 2MB alignment */
>  #define CRASH_ALIGN			SZ_2M
>  
> -#define CRASH_ADDR_LOW_MAX		arm64_dma_phys_limit
> +#define CRASH_ADDR_LOW_MAX		crash_addr_low_max()
>  #define CRASH_ADDR_HIGH_MAX		(PHYS_MASK + 1)
>  
>  static int __init reserve_crashkernel_low(unsigned long long low_size)
> @@ -389,8 +401,7 @@ void __init arm64_memblock_init(void)
>  
>  	early_init_fdt_scan_reserved_mem();
>  
> -	if (!defer_reserve_crashkernel())
> -		reserve_crashkernel();
> +	reserve_crashkernel();
>  
>  	high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
>  }
> @@ -434,13 +445,6 @@ void __init bootmem_init(void)
>  	 */
>  	dma_contiguous_reserve(arm64_dma_phys_limit);
>  
> -	/*
> -	 * request_standard_resources() depends on crashkernel's memory being
> -	 * reserved, so do it here.
> -	 */
> -	if (defer_reserve_crashkernel())
> -		reserve_crashkernel();
> -
>  	memblock_dump_all();
>  }
>  
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index e7ad44585f40..cdd338fa2115 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -547,13 +547,12 @@ static void __init map_mem(pgd_t *pgdp)
>  	memblock_mark_nomap(kernel_start, kernel_end - kernel_start);
>  
>  #ifdef CONFIG_KEXEC_CORE
> -	if (crash_mem_map) {
> -		if (defer_reserve_crashkernel())
> -			flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> -		else if (crashk_res.end)
> -			memblock_mark_nomap(crashk_res.start,
> -			    resource_size(&crashk_res));
> -	}
> +	if (crashk_res.end)
> +		memblock_mark_nomap(crashk_res.start,
> +				    resource_size(&crashk_res));
> +	if (crashk_low_res.end)
> +		memblock_mark_nomap(crashk_low_res.start,
> +				    resource_size(&crashk_low_res));
>  #endif
>  
>  	/* map all the memory banks */
> @@ -589,16 +588,23 @@ static void __init map_mem(pgd_t *pgdp)
>  	 * through /sys/kernel/kexec_crash_size interface.
>  	 */
>  #ifdef CONFIG_KEXEC_CORE
> -	if (crash_mem_map && !defer_reserve_crashkernel()) {
> -		if (crashk_res.end) {
> -			__map_memblock(pgdp, crashk_res.start,
> -				       crashk_res.end + 1,
> -				       PAGE_KERNEL,
> -				       NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
> -			memblock_clear_nomap(crashk_res.start,
> -					     resource_size(&crashk_res));
> -		}
> +	if (crashk_res.end) {
> +		__map_memblock(pgdp, crashk_res.start,
> +			       crashk_res.end + 1,
> +			       PAGE_KERNEL,
> +			       NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
> +		memblock_clear_nomap(crashk_res.start,
> +				     resource_size(&crashk_res));
>  	}
> +	if (crashk_low_res.end) {
> +		__map_memblock(pgdp, crashk_low_res.start,
> +			       crashk_low_res.end + 1,
> +			       PAGE_KERNEL,
> +			       NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
> +		memblock_clear_nomap(crashk_low_res.start,
> +				     resource_size(&crashk_low_res));
> +	}
> +
>  #endif
>  }
>  
>
Mike Rapoport Aug. 31, 2022, 7:37 a.m. UTC | #2
On Sun, Aug 28, 2022 at 08:55:44AM +0800, Baoquan He wrote:
> Problem:
> =======
> On arm64, block and section mapping is supported to build page tables.
> However, currently it enforces to take base page mapping for the whole
> linear mapping if CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabled and
> crashkernel kernel parameter is set. This will cause longer time of the
> linear mapping process during bootup and severe performance degradation
> during running time.
> 
> Root cause:
> ==========
> On arm64, crashkernel reservation relies on knowing the upper limit of
> low memory zone because it needs to reserve memory in the zone so that
> devices' DMA addressing in kdump kernel can be satisfied. However, the
> limit on arm64 is variant. And the upper limit can only be decided late
> till bootmem_init() is called.
> 
> And we need to map the crashkernel region with base page granularity when
> doing linear mapping, because kdump needs to protect the crashkernel region
> via set_memory_valid(,0) after kdump kernel loading. However, arm64 doesn't
> support well on splitting the built block or section mapping due to some
> cpu reststriction [1]. And unfortunately, the linear mapping is done before
> bootmem_init().
> 
> To resolve the above conflict on arm64, the compromise is enforcing to
> take base page mapping for the entire linear mapping if crashkernel is
> set, and CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabed. Hence
> performance is sacrificed.
> 
> Solution:
> =========
> To fix the problem, we should always take 4G as the crashkernel low
> memory end in case CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabled.
> With this, we don't need to defer the crashkernel reservation till
> bootmem_init() is called to set the arm64_dma_phys_limit. As long as
> memblock init is done, we can conclude what is the upper limit of low
> memory zone.
> 
> 1) both CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are disabled or memblock_start_of_DRAM() > 4G
>   limit = PHYS_ADDR_MAX+1  (Corner cases)

Why these are corner cases? 
The case when CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are disabled is the
simplest one because it does not require the whole dancing around
arm64_dma_phys_limit initialization.

And AFAIK, memblock_start_of_DRAM() > 4G is not uncommon on arm64, but it
does not matter for device DMA addressing.

The actual corner cases are systems with ZONE_DMA/DMA32 and with <32 bits
limit for device DMA addressing (e.g RPi 4). I think the changelog should
mention that to use kdump on these devices user must specify
crashkernel=X@Y 

> 2) CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are enabled:
>    limit = 4G  (generic case)
> 
> [1]
> https://lore.kernel.org/all/YrIIJkhKWSuAqkCx@arm.com/T/#u
> 
> Signed-off-by: Baoquan He <bhe@redhat.com>
> ---
>  arch/arm64/mm/init.c | 24 ++++++++++++++----------
>  arch/arm64/mm/mmu.c  | 38 ++++++++++++++++++++++----------------
>  2 files changed, 36 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index b9af30be813e..8ae55afdd11c 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -90,10 +90,22 @@ phys_addr_t __ro_after_init arm64_dma_phys_limit;
>  phys_addr_t __ro_after_init arm64_dma_phys_limit = PHYS_MASK + 1;
>  #endif

Please also update the comment above this hunk.

> +static phys_addr_t __init crash_addr_low_max(void)
> +{
> +	phys_addr_t low_mem_mask = U32_MAX;
> +	phys_addr_t phys_start = memblock_start_of_DRAM();
> +
> +	if ((!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) ||
> +	     (phys_start > U32_MAX))
> +		low_mem_mask = PHYS_ADDR_MAX;
> +
> +	return min(low_mem_mask, memblock_end_of_DRAM() - 1) + 1;

Since RAM frequently starts on non-zero address the limit for systems with
ZONE_DMA/DMA32 should be memblock_start_of_DRAM() + 4G. There is no need to
take into the account the end of DRAM, memblock allocation will take care
of that. I'd suggest to simplify crash_addr_low_max() to be

static phys_addr_t __init crash_addr_low_max(void)
{
	if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32))
		return memblock_start_of_DRAM() + SZ_4G;

	return PHYS_ADDR_MAX;
}

> +}
> +
>  /* Current arm64 boot protocol requires 2MB alignment */
>  #define CRASH_ALIGN			SZ_2M
>  
> -#define CRASH_ADDR_LOW_MAX		arm64_dma_phys_limit
> +#define CRASH_ADDR_LOW_MAX		crash_addr_low_max()

With introduction of crash_addr_low_max() I think it's better to get rid of
the CRASH_ADDR_LOW_MAX and use local variables in reserve_crashkernel() and
reserve_crashkernel_low() that would get initialized to
crash_addr_low_max().

Besides, #ifdef around arm64_dma_phys_limit declaration can go away because
this variable will be used only after it is initialized in
zone_sizes_init().

>  #define CRASH_ADDR_HIGH_MAX		(PHYS_MASK + 1)
>  
>  static int __init reserve_crashkernel_low(unsigned long long low_size)
> @@ -389,8 +401,7 @@ void __init arm64_memblock_init(void)
>  
>  	early_init_fdt_scan_reserved_mem();
>  
> -	if (!defer_reserve_crashkernel())
> -		reserve_crashkernel();
> +	reserve_crashkernel();
>  
>  	high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
>  }
> @@ -434,13 +445,6 @@ void __init bootmem_init(void)
>  	 */
>  	dma_contiguous_reserve(arm64_dma_phys_limit);
>  
> -	/*
> -	 * request_standard_resources() depends on crashkernel's memory being
> -	 * reserved, so do it here.
> -	 */
> -	if (defer_reserve_crashkernel())
> -		reserve_crashkernel();
> -
>  	memblock_dump_all();
>  }
>  
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index e7ad44585f40..cdd338fa2115 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -547,13 +547,12 @@ static void __init map_mem(pgd_t *pgdp)
>  	memblock_mark_nomap(kernel_start, kernel_end - kernel_start);
>  
>  #ifdef CONFIG_KEXEC_CORE
> -	if (crash_mem_map) {
> -		if (defer_reserve_crashkernel())
> -			flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> -		else if (crashk_res.end)
> -			memblock_mark_nomap(crashk_res.start,
> -			    resource_size(&crashk_res));
> -	}
> +	if (crashk_res.end)
> +		memblock_mark_nomap(crashk_res.start,
> +				    resource_size(&crashk_res));
> +	if (crashk_low_res.end)
> +		memblock_mark_nomap(crashk_low_res.start,
> +				    resource_size(&crashk_low_res));
>  #endif
>  
>  	/* map all the memory banks */
> @@ -589,16 +588,23 @@ static void __init map_mem(pgd_t *pgdp)
>  	 * through /sys/kernel/kexec_crash_size interface.
>  	 */
>  #ifdef CONFIG_KEXEC_CORE
> -	if (crash_mem_map && !defer_reserve_crashkernel()) {
> -		if (crashk_res.end) {
> -			__map_memblock(pgdp, crashk_res.start,
> -				       crashk_res.end + 1,
> -				       PAGE_KERNEL,
> -				       NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
> -			memblock_clear_nomap(crashk_res.start,
> -					     resource_size(&crashk_res));
> -		}
> +	if (crashk_res.end) {
> +		__map_memblock(pgdp, crashk_res.start,
> +			       crashk_res.end + 1,
> +			       PAGE_KERNEL,
> +			       NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
> +		memblock_clear_nomap(crashk_res.start,
> +				     resource_size(&crashk_res));
>  	}
> +	if (crashk_low_res.end) {
> +		__map_memblock(pgdp, crashk_low_res.start,
> +			       crashk_low_res.end + 1,
> +			       PAGE_KERNEL,
> +			       NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
> +		memblock_clear_nomap(crashk_low_res.start,
> +				     resource_size(&crashk_low_res));
> +	}
> +
>  #endif
>  }
>  
> -- 
> 2.34.1
> 
>
Baoquan He Aug. 31, 2022, 2:29 p.m. UTC | #3
On 08/31/22 at 10:37am, Mike Rapoport wrote:
> On Sun, Aug 28, 2022 at 08:55:44AM +0800, Baoquan He wrote:
> > Problem:
> > =======
> > On arm64, block and section mapping is supported to build page tables.
> > However, currently it enforces to take base page mapping for the whole
> > linear mapping if CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabled and
> > crashkernel kernel parameter is set. This will cause longer time of the
> > linear mapping process during bootup and severe performance degradation
> > during running time.
> > 
> > Root cause:
> > ==========
> > On arm64, crashkernel reservation relies on knowing the upper limit of
> > low memory zone because it needs to reserve memory in the zone so that
> > devices' DMA addressing in kdump kernel can be satisfied. However, the
> > limit on arm64 is variant. And the upper limit can only be decided late
> > till bootmem_init() is called.
> > 
> > And we need to map the crashkernel region with base page granularity when
> > doing linear mapping, because kdump needs to protect the crashkernel region
> > via set_memory_valid(,0) after kdump kernel loading. However, arm64 doesn't
> > support well on splitting the built block or section mapping due to some
> > cpu reststriction [1]. And unfortunately, the linear mapping is done before
> > bootmem_init().
> > 
> > To resolve the above conflict on arm64, the compromise is enforcing to
> > take base page mapping for the entire linear mapping if crashkernel is
> > set, and CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabed. Hence
> > performance is sacrificed.
> > 
> > Solution:
> > =========
> > To fix the problem, we should always take 4G as the crashkernel low
> > memory end in case CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabled.
> > With this, we don't need to defer the crashkernel reservation till
> > bootmem_init() is called to set the arm64_dma_phys_limit. As long as
> > memblock init is done, we can conclude what is the upper limit of low
> > memory zone.
> > 
> > 1) both CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are disabled or memblock_start_of_DRAM() > 4G
> >   limit = PHYS_ADDR_MAX+1  (Corner cases)
> 
> Why these are corner cases? 
> The case when CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are disabled is the
> simplest one because it does not require the whole dancing around
> arm64_dma_phys_limit initialization.
> 
> And AFAIK, memblock_start_of_DRAM() > 4G is not uncommon on arm64, but it
> does not matter for device DMA addressing.

Thanks for reviewing.

I could be wrong and have misunderstanding about corner case.

With my understanding, both ZONE_DMA and ZONE_DMA32 are enabled by
default in kernel. And on distros, I believe they are on too. The both
ZONE_DMA and ZONE_DMA32 disabled case should only exist on one specific
product, and the memblock_start_of_DRAM() > 4G case too. At least, I
haven't seen one in our LAB. What I thought the non generic as corner
case could be wrong. I will change that phrasing.

mm/Kconfig:
config ZONE_DMA
        bool "Support DMA zone" if ARCH_HAS_ZONE_DMA_SET
        default y if ARM64 || X86

config ZONE_DMA32
        bool "Support DMA32 zone" if ARCH_HAS_ZONE_DMA_SET
        depends on !X86_32
        default y if ARM64

> 
> The actual corner cases are systems with ZONE_DMA/DMA32 and with <32 bits
> limit for device DMA addressing (e.g RPi 4). I think the changelog should

Right, RPi4's 30bit DMA addressing device is corner case.

> mention that to use kdump on these devices user must specify
> crashkernel=X@Y 

Makes sense. I will add words in log, and add sentences to
mention that in code comment or some place of document.
Thanks for advice.

> 
> > 2) CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are enabled:
> >    limit = 4G  (generic case)
> > 
> > [1]
> > https://lore.kernel.org/all/YrIIJkhKWSuAqkCx@arm.com/T/#u
> > 
> > Signed-off-by: Baoquan He <bhe@redhat.com>
> > ---
> >  arch/arm64/mm/init.c | 24 ++++++++++++++----------
> >  arch/arm64/mm/mmu.c  | 38 ++++++++++++++++++++++----------------
> >  2 files changed, 36 insertions(+), 26 deletions(-)
> > 
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index b9af30be813e..8ae55afdd11c 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -90,10 +90,22 @@ phys_addr_t __ro_after_init arm64_dma_phys_limit;
> >  phys_addr_t __ro_after_init arm64_dma_phys_limit = PHYS_MASK + 1;
> >  #endif
> 
> Please also update the comment above this hunk.

Sure, will do.

> 
> > +static phys_addr_t __init crash_addr_low_max(void)
> > +{
> > +	phys_addr_t low_mem_mask = U32_MAX;
> > +	phys_addr_t phys_start = memblock_start_of_DRAM();
> > +
> > +	if ((!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) ||
> > +	     (phys_start > U32_MAX))
> > +		low_mem_mask = PHYS_ADDR_MAX;
> > +
> > +	return min(low_mem_mask, memblock_end_of_DRAM() - 1) + 1;
> 
> Since RAM frequently starts on non-zero address the limit for systems with
> ZONE_DMA/DMA32 should be memblock_start_of_DRAM() + 4G. There is no need to

It may not be right for memblock_start_of_DRAM(). On most of arm64
servers I ever tested, their memblock usually starts from a higher
address, but not zero which is like x86. E.g below memory ranges printed
on an ampere-mtsnow-altra system, the starting addr is 0x83000000. With
my understanding, DMA addressing bits correspond to the cpu logical
address range devices can address. So memblock_start_of_DRAM() + 4G
seems not right for normal system, and not right for system which
starting physical address is above 4G. I refer to max_zone_phys() of
arch/arm64/mm/init.c when implementing crash_addr_low_max(). Please
correct me if I am wrong.

[  +0.000000] Zone ranges:
[  +0.000000]   DMA      [mem 0x0000000088300000-0x00000000ffffffff]
[  +0.000000]   DMA32    empty
[  +0.000000]   Normal   [mem 0x0000000100000000-0x00000817ffffffff]
[  +0.000000]   Device   empty
[  +0.000000] Movable zone start for each node
[  +0.000000] Early memory node ranges
[  +0.000000]   node   0: [mem 0x0000000088300000-0x00000000883fffff]
[  +0.000000]   node   0: [mem 0x0000000090000000-0x0000000091ffffff]
[  +0.000000]   node   0: [mem 0x0000000092000000-0x0000000093ffffff]
[  +0.000000]   node   0: [mem 0x0000000094000000-0x00000000ebc18fff]
[  +0.000000]   node   0: [mem 0x00000000ebc19000-0x00000000ebfbcfff]
[  +0.000000]   node   0: [mem 0x00000000ebfbd000-0x00000000ebfbdfff]
[  +0.000000]   node   0: [mem 0x00000000ebfbe000-0x00000000ebfbffff]
[  +0.000000]   node   0: [mem 0x00000000ebfc0000-0x00000000ec1dffff]
[  +0.000000]   node   0: [mem 0x00000000ec1e0000-0x00000000ec1effff]
[  +0.000000]   node   0: [mem 0x00000000ec1f0000-0x00000000ee5effff]
[  +0.000000]   node   0: [mem 0x00000000ee5f0000-0x00000000f765ffff]
[  +0.000000]   node   0: [mem 0x00000000f7660000-0x00000000f784ffff]
[  +0.000000]   node   0: [mem 0x00000000f7850000-0x00000000f7fdffff]
[  +0.000000]   node   0: [mem 0x00000000f7fe0000-0x00000000ffc8efff]
[  +0.000000]   node   0: [mem 0x00000000ffc8f000-0x00000000ffc8ffff]
[  +0.000000]   node   0: [mem 0x00000000ffc90000-0x00000000ffffffff]
[  +0.000000]   node   0: [mem 0x0000080000000000-0x000008007fffffff]
[  +0.000000]   node   0: [mem 0x0000080100000000-0x00000817ffffffff]

> take into the account the end of DRAM, memblock allocation will take care
> of that. I'd suggest to simplify crash_addr_low_max() to be
> 
> static phys_addr_t __init crash_addr_low_max(void)
> {
>       if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32))
>               return memblock_start_of_DRAM() + SZ_4G;
> 
>       return PHYS_ADDR_MAX;
> }
> 
> > +}
> > +
> >  /* Current arm64 boot protocol requires 2MB alignment */
> >  #define CRASH_ALIGN                        SZ_2M
> >  
> > -#define CRASH_ADDR_LOW_MAX         arm64_dma_phys_limit
> > +#define CRASH_ADDR_LOW_MAX         crash_addr_low_max()
> 
> With introduction of crash_addr_low_max() I think it's better to get rid of
> the CRASH_ADDR_LOW_MAX and use local variables in reserve_crashkernel() and
> reserve_crashkernel_low() that would get initialized to
> crash_addr_low_max().

CRASH_ADDR_LOW_MAX is introduced because we expected to make the
crashkernel reservation code generic and move into kernel/crash_core.c.
The original plan is to deduplicate the x86_64 and arm64 part, seems
it's hard to do now since arm64 has specific handling different than
x86. I think we can remove it now and can add it back if possible.

> 
> Besides, #ifdef around arm64_dma_phys_limit declaration can go away because
> this variable will be used only after it is initialized in
> zone_sizes_init().

Right, I will clean it up. Thanks again for careful reviewing.

> 
> >  #define CRASH_ADDR_HIGH_MAX                (PHYS_MASK + 1)
> >  
> >  static int __init reserve_crashkernel_low(unsigned long long low_size)
> > @@ -389,8 +401,7 @@ void __init arm64_memblock_init(void)
> >  
> >     early_init_fdt_scan_reserved_mem();
> >  
> > -   if (!defer_reserve_crashkernel())
> > -           reserve_crashkernel();
> > +   reserve_crashkernel();
> >  
> >     high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
> >  }
> > @@ -434,13 +445,6 @@ void __init bootmem_init(void)
> >      */
> >     dma_contiguous_reserve(arm64_dma_phys_limit);
> >  
> > -   /*
> > -    * request_standard_resources() depends on crashkernel's memory being
> > -    * reserved, so do it here.
> > -    */
> > -   if (defer_reserve_crashkernel())
> > -           reserve_crashkernel();
> > -
> >     memblock_dump_all();
> >  }
> >  
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index e7ad44585f40..cdd338fa2115 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -547,13 +547,12 @@ static void __init map_mem(pgd_t *pgdp)
> >     memblock_mark_nomap(kernel_start, kernel_end - kernel_start);
> >  
> >  #ifdef CONFIG_KEXEC_CORE
> > -   if (crash_mem_map) {
> > -           if (defer_reserve_crashkernel())
> > -                   flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> > -           else if (crashk_res.end)
> > -                   memblock_mark_nomap(crashk_res.start,
> > -                       resource_size(&crashk_res));
> > -   }
> > +   if (crashk_res.end)
> > +           memblock_mark_nomap(crashk_res.start,
> > +                               resource_size(&crashk_res));
> > +   if (crashk_low_res.end)
> > +           memblock_mark_nomap(crashk_low_res.start,
> > +                               resource_size(&crashk_low_res));
> >  #endif
> >  
> >     /* map all the memory banks */
> > @@ -589,16 +588,23 @@ static void __init map_mem(pgd_t *pgdp)
> >      * through /sys/kernel/kexec_crash_size interface.
> >      */
> >  #ifdef CONFIG_KEXEC_CORE
> > -   if (crash_mem_map && !defer_reserve_crashkernel()) {
> > -           if (crashk_res.end) {
> > -                   __map_memblock(pgdp, crashk_res.start,
> > -                                  crashk_res.end + 1,
> > -                                  PAGE_KERNEL,
> > -                                  NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
> > -                   memblock_clear_nomap(crashk_res.start,
> > -                                        resource_size(&crashk_res));
> > -           }
> > +   if (crashk_res.end) {
> > +           __map_memblock(pgdp, crashk_res.start,
> > +                          crashk_res.end + 1,
> > +                          PAGE_KERNEL,
> > +                          NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
> > +           memblock_clear_nomap(crashk_res.start,
> > +                                resource_size(&crashk_res));
> >     }
> > +
> >  #endif
> >  }
> >  
> > -- 
> > 2.34.1
> > 
> > 
> 
> -- 
> Sincerely yours,
> Mike.
>
Mike Rapoport Sept. 1, 2022, 7:24 a.m. UTC | #4
On Wed, Aug 31, 2022 at 10:29:39PM +0800, Baoquan He wrote:
> On 08/31/22 at 10:37am, Mike Rapoport wrote:
> > On Sun, Aug 28, 2022 at 08:55:44AM +0800, Baoquan He wrote:
> > > 
> > > Solution:
> > > =========
> > > To fix the problem, we should always take 4G as the crashkernel low
> > > memory end in case CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabled.
> > > With this, we don't need to defer the crashkernel reservation till
> > > bootmem_init() is called to set the arm64_dma_phys_limit. As long as
> > > memblock init is done, we can conclude what is the upper limit of low
> > > memory zone.
> > > 
> > > 1) both CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are disabled or memblock_start_of_DRAM() > 4G
> > >   limit = PHYS_ADDR_MAX+1  (Corner cases)
> > 
> > Why these are corner cases? 
> > The case when CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are disabled is the
> > simplest one because it does not require the whole dancing around
> > arm64_dma_phys_limit initialization.
> > 
> > And AFAIK, memblock_start_of_DRAM() > 4G is not uncommon on arm64, but it
> > does not matter for device DMA addressing.
> 
> Thanks for reviewing.
> 
> I could be wrong and have misunderstanding about corner case.
> 
> With my understanding, both ZONE_DMA and ZONE_DMA32 are enabled by
> default in kernel. And on distros, I believe they are on too. The both
> ZONE_DMA and ZONE_DMA32 disabled case should only exist on one specific
> product, and the memblock_start_of_DRAM() > 4G case too. At least, I
> haven't seen one in our LAB. What I thought the non generic as corner
> case could be wrong. I will change that phrasing.
> 
> mm/Kconfig:
> config ZONE_DMA
>         bool "Support DMA zone" if ARCH_HAS_ZONE_DMA_SET
>         default y if ARM64 || X86
> 
> config ZONE_DMA32
>         bool "Support DMA32 zone" if ARCH_HAS_ZONE_DMA_SET
>         depends on !X86_32
>         default y if ARM64

My point was that the cases with ZONE_DMA/DMA32 disabled or with RAM above
4G do not require detection of arm64_dma_phys_limit before reserving the
crash kernel, can use predefined constants and are simple to handle.
 
> > The actual corner cases are systems with ZONE_DMA/DMA32 and with <32 bits
> > limit for device DMA addressing (e.g RPi 4). I think the changelog should
> 
> Right, RPi4's 30bit DMA addressing device is corner case.
> 
> > mention that to use kdump on these devices user must specify
> > crashkernel=X@Y 
> 
> Makes sense. I will add words in log, and add sentences to
> mention that in code comment or some place of document.
> Thanks for advice.
> 
> > 
> > > 2) CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are enabled:
> > >    limit = 4G  (generic case)
> > > 

...

> > > +static phys_addr_t __init crash_addr_low_max(void)
> > > +{
> > > +	phys_addr_t low_mem_mask = U32_MAX;
> > > +	phys_addr_t phys_start = memblock_start_of_DRAM();
> > > +
> > > +	if ((!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) ||
> > > +	     (phys_start > U32_MAX))
> > > +		low_mem_mask = PHYS_ADDR_MAX;
> > > +
> > > +	return min(low_mem_mask, memblock_end_of_DRAM() - 1) + 1;
> > 
> > Since RAM frequently starts on non-zero address the limit for systems with
> > ZONE_DMA/DMA32 should be memblock_start_of_DRAM() + 4G. There is no need to
> 
> It may not be right for memblock_start_of_DRAM(). On most of arm64
> servers I ever tested, their memblock usually starts from a higher
> address, but not zero which is like x86. E.g below memory ranges printed
> on an ampere-mtsnow-altra system, the starting addr is 0x83000000. With
> my understanding, DMA addressing bits correspond to the cpu logical
> address range devices can address. So memblock_start_of_DRAM() + 4G
> seems not right for normal system, and not right for system which
> starting physical address is above 4G. I refer to max_zone_phys() of
> arch/arm64/mm/init.c when implementing crash_addr_low_max(). Please
> correct me if I am wrong.

My understanding was that no matter where DRAM starts, the first 4G would
be accessible by 32-bit devices, but I maybe wrong as well :)

I haven't notice you used max_zone_phys() as a reference. Wouldn't it be
simpler to just call it from crash_addr_low_max():

static phys_addr_t __init crash_addr_low_max(void)
{
	return max_zone_phys(32);
}
Baoquan He Sept. 1, 2022, 12:25 p.m. UTC | #5
On 09/01/22 at 10:24am, Mike Rapoport wrote:
> On Wed, Aug 31, 2022 at 10:29:39PM +0800, Baoquan He wrote:
> > On 08/31/22 at 10:37am, Mike Rapoport wrote:
> > > On Sun, Aug 28, 2022 at 08:55:44AM +0800, Baoquan He wrote:
> > > > 
> > > > Solution:
> > > > =========
> > > > To fix the problem, we should always take 4G as the crashkernel low
> > > > memory end in case CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabled.
> > > > With this, we don't need to defer the crashkernel reservation till
> > > > bootmem_init() is called to set the arm64_dma_phys_limit. As long as
> > > > memblock init is done, we can conclude what is the upper limit of low
> > > > memory zone.
> > > > 
> > > > 1) both CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are disabled or memblock_start_of_DRAM() > 4G
> > > >   limit = PHYS_ADDR_MAX+1  (Corner cases)
> > > 
> > > Why these are corner cases? 
> > > The case when CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are disabled is the
> > > simplest one because it does not require the whole dancing around
> > > arm64_dma_phys_limit initialization.
> > > 
> > > And AFAIK, memblock_start_of_DRAM() > 4G is not uncommon on arm64, but it
> > > does not matter for device DMA addressing.
> > 
> > Thanks for reviewing.
> > 
> > I could be wrong and have misunderstanding about corner case.
> > 
> > With my understanding, both ZONE_DMA and ZONE_DMA32 are enabled by
> > default in kernel. And on distros, I believe they are on too. The both
> > ZONE_DMA and ZONE_DMA32 disabled case should only exist on one specific
> > product, and the memblock_start_of_DRAM() > 4G case too. At least, I
> > haven't seen one in our LAB. What I thought the non generic as corner
> > case could be wrong. I will change that phrasing.
> > 
> > mm/Kconfig:
> > config ZONE_DMA
> >         bool "Support DMA zone" if ARCH_HAS_ZONE_DMA_SET
> >         default y if ARM64 || X86
> > 
> > config ZONE_DMA32
> >         bool "Support DMA32 zone" if ARCH_HAS_ZONE_DMA_SET
> >         depends on !X86_32
> >         default y if ARM64
> 
> My point was that the cases with ZONE_DMA/DMA32 disabled or with RAM above
> 4G do not require detection of arm64_dma_phys_limit before reserving the
> crash kernel, can use predefined constants and are simple to handle.

I see.

>  
> > > The actual corner cases are systems with ZONE_DMA/DMA32 and with <32 bits
> > > limit for device DMA addressing (e.g RPi 4). I think the changelog should
> > 
> > Right, RPi4's 30bit DMA addressing device is corner case.
> > 
> > > mention that to use kdump on these devices user must specify
> > > crashkernel=X@Y 
> > 
> > Makes sense. I will add words in log, and add sentences to
> > mention that in code comment or some place of document.
> > Thanks for advice.
> > 
> > > 
> > > > 2) CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are enabled:
> > > >    limit = 4G  (generic case)
> > > > 
> 
> ...
> 
> > > > +static phys_addr_t __init crash_addr_low_max(void)
> > > > +{
> > > > +	phys_addr_t low_mem_mask = U32_MAX;
> > > > +	phys_addr_t phys_start = memblock_start_of_DRAM();
> > > > +
> > > > +	if ((!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) ||
> > > > +	     (phys_start > U32_MAX))
> > > > +		low_mem_mask = PHYS_ADDR_MAX;
> > > > +
> > > > +	return min(low_mem_mask, memblock_end_of_DRAM() - 1) + 1;
> > > 
> > > Since RAM frequently starts on non-zero address the limit for systems with
> > > ZONE_DMA/DMA32 should be memblock_start_of_DRAM() + 4G. There is no need to
> > 
> > It may not be right for memblock_start_of_DRAM(). On most of arm64
> > servers I ever tested, their memblock usually starts from a higher
> > address, but not zero which is like x86. E.g below memory ranges printed
> > on an ampere-mtsnow-altra system, the starting addr is 0x83000000. With
> > my understanding, DMA addressing bits correspond to the cpu logical
> > address range devices can address. So memblock_start_of_DRAM() + 4G
> > seems not right for normal system, and not right for system which
> > starting physical address is above 4G. I refer to max_zone_phys() of
> > arch/arm64/mm/init.c when implementing crash_addr_low_max(). Please
> > correct me if I am wrong.
> 
> My understanding was that no matter where DRAM starts, the first 4G would
> be accessible by 32-bit devices, but I maybe wrong as well :)
> 
> I haven't notice you used max_zone_phys() as a reference. Wouldn't it be
> simpler to just call it from crash_addr_low_max():
> 
> static phys_addr_t __init crash_addr_low_max(void)
> {
> 	return max_zone_phys(32);
> }

max_zone_phys() only handles cases when CONFIG_ZONE_DMA/DMA32 enabled,
the disabledCONFIG_ZONE_DMA/DMA32 case is not included. I can change
it like:

static phys_addr_t __init crash_addr_low_max(void)
{
        phys_addr_t low_mem_mask = U32_MAX;
        phys_addr_t phys_start = memblock_start_of_DRAM();

        if ((!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) ||
             (phys_start > U32_MAX))
                low_mem_mask = PHYS_ADDR_MAX;

        return low_mem_mast + 1;
}

or add the disabled CONFIG_ZONE_DMA/DMA32 case into crash_addr_low_max()
as you suggested. Which one do you like better?

static phys_addr_t __init crash_addr_low_max(void)
{
        if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32))
		return PHYS_ADDR_MAX + 1;

        return max_zone_phys(32);
}
Mike Rapoport Sept. 5, 2022, 10:28 a.m. UTC | #6
On Thu, Sep 01, 2022 at 08:25:54PM +0800, Baoquan He wrote:
> On 09/01/22 at 10:24am, Mike Rapoport wrote:
> 
> max_zone_phys() only handles cases when CONFIG_ZONE_DMA/DMA32 enabled,
> the disabledCONFIG_ZONE_DMA/DMA32 case is not included. I can change
> it like:
> 
> static phys_addr_t __init crash_addr_low_max(void)
> {
>         phys_addr_t low_mem_mask = U32_MAX;
>         phys_addr_t phys_start = memblock_start_of_DRAM();
> 
>         if ((!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) ||
>              (phys_start > U32_MAX))
>                 low_mem_mask = PHYS_ADDR_MAX;
> 
>         return low_mem_mast + 1;
> }
> 
> or add the disabled CONFIG_ZONE_DMA/DMA32 case into crash_addr_low_max()
> as you suggested. Which one do you like better?
> 
> static phys_addr_t __init crash_addr_low_max(void)
> {
>         if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32))
> 		return PHYS_ADDR_MAX + 1;
> 
>         return max_zone_phys(32);
> }
 
I like the second variant better.
Baoquan He Sept. 5, 2022, 12:08 p.m. UTC | #7
On 09/05/22 at 01:28pm, Mike Rapoport wrote:
> On Thu, Sep 01, 2022 at 08:25:54PM +0800, Baoquan He wrote:
> > On 09/01/22 at 10:24am, Mike Rapoport wrote:
> > 
> > max_zone_phys() only handles cases when CONFIG_ZONE_DMA/DMA32 enabled,
> > the disabledCONFIG_ZONE_DMA/DMA32 case is not included. I can change
> > it like:
> > 
> > static phys_addr_t __init crash_addr_low_max(void)
> > {
> >         phys_addr_t low_mem_mask = U32_MAX;
> >         phys_addr_t phys_start = memblock_start_of_DRAM();
> > 
> >         if ((!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) ||
> >              (phys_start > U32_MAX))
> >                 low_mem_mask = PHYS_ADDR_MAX;
> > 
> >         return low_mem_mast + 1;
> > }
> > 
> > or add the disabled CONFIG_ZONE_DMA/DMA32 case into crash_addr_low_max()
> > as you suggested. Which one do you like better?
> > 
> > static phys_addr_t __init crash_addr_low_max(void)
> > {
> >         if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32))
> > 		return PHYS_ADDR_MAX + 1;
> > 
> >         return max_zone_phys(32);
> > }
>  
> I like the second variant better.

Sure, will change to use the 2nd one . Thanks.
Ard Biesheuvel Sept. 6, 2022, 1:05 p.m. UTC | #8
On Mon, 5 Sept 2022 at 14:08, Baoquan He <bhe@redhat.com> wrote:
>
> On 09/05/22 at 01:28pm, Mike Rapoport wrote:
> > On Thu, Sep 01, 2022 at 08:25:54PM +0800, Baoquan He wrote:
> > > On 09/01/22 at 10:24am, Mike Rapoport wrote:
> > >
> > > max_zone_phys() only handles cases when CONFIG_ZONE_DMA/DMA32 enabled,
> > > the disabledCONFIG_ZONE_DMA/DMA32 case is not included. I can change
> > > it like:
> > >
> > > static phys_addr_t __init crash_addr_low_max(void)
> > > {
> > >         phys_addr_t low_mem_mask = U32_MAX;
> > >         phys_addr_t phys_start = memblock_start_of_DRAM();
> > >
> > >         if ((!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) ||
> > >              (phys_start > U32_MAX))
> > >                 low_mem_mask = PHYS_ADDR_MAX;
> > >
> > >         return low_mem_mast + 1;
> > > }
> > >
> > > or add the disabled CONFIG_ZONE_DMA/DMA32 case into crash_addr_low_max()
> > > as you suggested. Which one do you like better?
> > >
> > > static phys_addr_t __init crash_addr_low_max(void)
> > > {
> > >         if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32))
> > >             return PHYS_ADDR_MAX + 1;
> > >
> > >         return max_zone_phys(32);
> > > }
> >
> > I like the second variant better.
>
> Sure, will change to use the 2nd one . Thanks.
>

While I appreciate the effort that has gone into solving this problem,
I don't think there is any consensus that an elaborate fix is required
to ensure that the crash kernel can be unmapped from the linear map at
all cost. In fact, I personally think we shouldn't bother, and IIRC,
Will made a remark along the same lines back when the Huawei engineers
were still driving this effort.

So perhaps we could align on that before doing yet another version of this?
Baoquan He Sept. 8, 2022, 1:33 p.m. UTC | #9
On 09/06/22 at 03:05pm, Ard Biesheuvel wrote:
> On Mon, 5 Sept 2022 at 14:08, Baoquan He <bhe@redhat.com> wrote:
> >
> > On 09/05/22 at 01:28pm, Mike Rapoport wrote:
> > > On Thu, Sep 01, 2022 at 08:25:54PM +0800, Baoquan He wrote:
> > > > On 09/01/22 at 10:24am, Mike Rapoport wrote:
> > > >
> > > > max_zone_phys() only handles cases when CONFIG_ZONE_DMA/DMA32 enabled,
> > > > the disabledCONFIG_ZONE_DMA/DMA32 case is not included. I can change
> > > > it like:
> > > >
> > > > static phys_addr_t __init crash_addr_low_max(void)
> > > > {
> > > >         phys_addr_t low_mem_mask = U32_MAX;
> > > >         phys_addr_t phys_start = memblock_start_of_DRAM();
> > > >
> > > >         if ((!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) ||
> > > >              (phys_start > U32_MAX))
> > > >                 low_mem_mask = PHYS_ADDR_MAX;
> > > >
> > > >         return low_mem_mast + 1;
> > > > }
> > > >
> > > > or add the disabled CONFIG_ZONE_DMA/DMA32 case into crash_addr_low_max()
> > > > as you suggested. Which one do you like better?
> > > >
> > > > static phys_addr_t __init crash_addr_low_max(void)
> > > > {
> > > >         if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32))
> > > >             return PHYS_ADDR_MAX + 1;
> > > >
> > > >         return max_zone_phys(32);
> > > > }
> > >
> > > I like the second variant better.
> >
> > Sure, will change to use the 2nd one . Thanks.
> >
> 
> While I appreciate the effort that has gone into solving this problem,
> I don't think there is any consensus that an elaborate fix is required
> to ensure that the crash kernel can be unmapped from the linear map at
> all cost. In fact, I personally think we shouldn't bother, and IIRC,
> Will made a remark along the same lines back when the Huawei engineers
> were still driving this effort.
>
> So perhaps we could align on that before doing yet another version of this?

Yes, certainly. That can save everybody's effort if there's different
opinion. Thanks for looking into this and the suggestion. 

About Will's remark, I checked those discussing threads, guess you are
mentioning the words in link [1]. I copy them at bottom for better
reference. Pleasae correct me if I am wrong.

With my understanding, Will said so because the patch is too complex,
and there's risk that page table kernel data itself is using could share
the same block/section mapping as crashkernel region. With these
two cons, I agree with Will that we would rather take off the protection
on crashkernel region which is done by mapping or unmapping the region,
even though the protection enhances kdump's ronusness.

Crashkernel reservation needs to know the low meory end so that DMA
buffer can be addressed by the dumping target, e.g storage disk. On the
current arm64, we have facts:
1)Currently, except of Raspberry Pi 4, all arm64 systems can support
  32bit DMA addressing. So, except of RPi4, the low memory end can be
  decided after memblock init is done, namely at the end of
  arm64_memblock_init(). We don't need to defer the crashkernel
  reservation until zone_sizes_init() is done. Those cases can be checked
  in patch code.
2)For RPi4, if its storage disk is 30bit DMA addressing, then we can
  use crashkernel=xM@yM to specify reservation location under 1G to
  work around this.

***
Based on above facts, with my patch applied:
pros:
1) Performance issue is resolved;
2) As you can see, the code with this patch applied will much 
  simpler, more straightforward and clearer;
3) The protection can be kept;
4) Crashkernel reservation can be easier to succeed on small memory
   system, e.g virt guest system. The earlier the reservation is done,
   it's more likely to get the whole chunk of meomry.
cons:
1) Only RPi4 is put in inconvenience for crashkernel reservation. It
   needs to use crashkernel=xM@yM to work around.

***
Take off the protection which is done by mapping or unmapping
crashkernel region as you and Will suggested:
pros:
1) Performance issue is resolved;
2) RPi4 will have the same convenience to set crashkernel;

cons:
1) No protection is taken on crashkernel region;
2) Code logic is twisting. There are two places to separately reserve
   crashkernel, one is at the end of arm64_memblock_init(), one is at
   the end of bootmem_init(). 
3) Except of both CONFIG_ZONE_DMA|DMA32 disabled case, crashkernel
   reservation is deferred. On small memory system, e.g virt guest system,
   it increases risk that the resrevation could fail very possibly caused
   by memory fragmentation.

Besides, comparing the above two solutions, I also want to say kdump
is developed for enterprise level of system. We need combine with
reality when considering reasonable solution. E.g on x86_64, it has DMA
zone of 16M and DMA32 zone from 16M to 4G always in normal kernel. For
kdump, we ignore DMA zone directly because it's for ISA style devices. 
Kdump doesn't support ISA style device with only 24bit DMA addressing
capability at the beginning, because it doesn't make sense, we never
hear that an enterprise level of x86_64 system needs to arm with kdump.

Hi Ard, Will, Catalin and other reviewers,

Above is my understaning and thinking about the encountered issue,
plesae help check and point out what's missing or incorrect.

Hi Nicolas,

If it's convenient to you, please help make clear if the storage disk or
network card can only address 32bit DMA buffer on RPi4. Really
appreciate that.

***
[1]Will's remark on Huawei's patch
https://lore.kernel.org/all/20220718131005.GA12406@willie-the-truck/T/#u

====quote Will's remark here
I do not think that this complexity is justified. As I have stated on
numerous occasions already, I would prefer that we leave the crashkernel
mapped when rodata is not "full". That fixes your performance issue and
matches what we do for module code, so I do not see a security argument
against it.

I do not plan to merge this patch as-is.
===
Baoquan He Sept. 8, 2022, 10:55 p.m. UTC | #10
On 09/08/22 at 09:33pm, Baoquan He wrote:
> On 09/06/22 at 03:05pm, Ard Biesheuvel wrote:
> > On Mon, 5 Sept 2022 at 14:08, Baoquan He <bhe@redhat.com> wrote:
> > >
> > > On 09/05/22 at 01:28pm, Mike Rapoport wrote:
> > > > On Thu, Sep 01, 2022 at 08:25:54PM +0800, Baoquan He wrote:
> > > > > On 09/01/22 at 10:24am, Mike Rapoport wrote:
> > > > >
> > > > > max_zone_phys() only handles cases when CONFIG_ZONE_DMA/DMA32 enabled,
> > > > > the disabledCONFIG_ZONE_DMA/DMA32 case is not included. I can change
> > > > > it like:
> > > > >
> > > > > static phys_addr_t __init crash_addr_low_max(void)
> > > > > {
> > > > >         phys_addr_t low_mem_mask = U32_MAX;
> > > > >         phys_addr_t phys_start = memblock_start_of_DRAM();
> > > > >
> > > > >         if ((!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) ||
> > > > >              (phys_start > U32_MAX))
> > > > >                 low_mem_mask = PHYS_ADDR_MAX;
> > > > >
> > > > >         return low_mem_mast + 1;
> > > > > }
> > > > >
> > > > > or add the disabled CONFIG_ZONE_DMA/DMA32 case into crash_addr_low_max()
> > > > > as you suggested. Which one do you like better?
> > > > >
> > > > > static phys_addr_t __init crash_addr_low_max(void)
> > > > > {
> > > > >         if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32))
> > > > >             return PHYS_ADDR_MAX + 1;
> > > > >
> > > > >         return max_zone_phys(32);
> > > > > }
> > > >
> > > > I like the second variant better.
> > >
> > > Sure, will change to use the 2nd one . Thanks.
> > >
> > 
> > While I appreciate the effort that has gone into solving this problem,
> > I don't think there is any consensus that an elaborate fix is required
> > to ensure that the crash kernel can be unmapped from the linear map at
> > all cost. In fact, I personally think we shouldn't bother, and IIRC,
> > Will made a remark along the same lines back when the Huawei engineers
> > were still driving this effort.
> >
> > So perhaps we could align on that before doing yet another version of this?
> 
> Yes, certainly. That can save everybody's effort if there's different
> opinion. Thanks for looking into this and the suggestion. 
> 
> About Will's remark, I checked those discussing threads, guess you are
> mentioning the words in link [1]. I copy them at bottom for better
> reference. Pleasae correct me if I am wrong.
> 
> With my understanding, Will said so because the patch is too complex,
> and there's risk that page table kernel data itself is using could share
> the same block/section mapping as crashkernel region. With these
> two cons, I agree with Will that we would rather take off the protection
> on crashkernel region which is done by mapping or unmapping the region,
> even though the protection enhances kdump's ronusness.
> 
> Crashkernel reservation needs to know the low meory end so that DMA
> buffer can be addressed by the dumping target, e.g storage disk. On the
> current arm64, we have facts:
> 1)Currently, except of Raspberry Pi 4, all arm64 systems can support
>   32bit DMA addressing. So, except of RPi4, the low memory end can be
>   decided after memblock init is done, namely at the end of
>   arm64_memblock_init(). We don't need to defer the crashkernel
>   reservation until zone_sizes_init() is done. Those cases can be checked
>   in patch code.
> 2)For RPi4, if its storage disk is 30bit DMA addressing, then we can
>   use crashkernel=xM@yM to specify reservation location under 1G to
>   work around this.
> 
> ***
> Based on above facts, with my patch applied:
> pros:
> 1) Performance issue is resolved;
> 2) As you can see, the code with this patch applied will much 
>   simpler, more straightforward and clearer;
> 3) The protection can be kept;
> 4) Crashkernel reservation can be easier to succeed on small memory
>    system, e.g virt guest system. The earlier the reservation is done,
>    it's more likely to get the whole chunk of meomry.
> cons:
> 1) Only RPi4 is put in inconvenience for crashkernel reservation. It
>    needs to use crashkernel=xM@yM to work around.
> 
> ***
> Take off the protection which is done by mapping or unmapping
> crashkernel region as you and Will suggested:
> pros:
> 1) Performance issue is resolved;
> 2) RPi4 will have the same convenience to set crashkernel;
> 
> cons:
> 1) No protection is taken on crashkernel region;
> 2) Code logic is twisting. There are two places to separately reserve
>    crashkernel, one is at the end of arm64_memblock_init(), one is at
>    the end of bootmem_init(). 
> 3) Except of both CONFIG_ZONE_DMA|DMA32 disabled case, crashkernel
>    reservation is deferred. On small memory system, e.g virt guest system,
>    it increases risk that the resrevation could fail very possibly caused
>    by memory fragmentation.
> 
> Besides, comparing the above two solutions, I also want to say kdump
> is developed for enterprise level of system. We need combine with
> reality when considering reasonable solution. E.g on x86_64, it has DMA
> zone of 16M and DMA32 zone from 16M to 4G always in normal kernel. For
> kdump, we ignore DMA zone directly because it's for ISA style devices. 
> Kdump doesn't support ISA style device with only 24bit DMA addressing
> capability at the beginning, because it doesn't make sense, we never
> hear that an enterprise level of x86_64 system needs to arm with kdump.

Sorry, here I mean we never hear that an enterprise level of x86_64
system owns ISA storage disk and needs to arm with kdump.

> 
> Hi Ard, Will, Catalin and other reviewers,
> 
> Above is my understaning and thinking about the encountered issue,
> plesae help check and point out what's missing or incorrect.
> 
> Hi Nicolas,
> 
> If it's convenient to you, please help make clear if the storage disk or
> network card can only address 32bit DMA buffer on RPi4. Really
                                ~~30bit, typo                              
> appreciate that.
> 
> ***
> [1]Will's remark on Huawei's patch
> https://lore.kernel.org/all/20220718131005.GA12406@willie-the-truck/T/#u
> 
> ====quote Will's remark here
> I do not think that this complexity is justified. As I have stated on
> numerous occasions already, I would prefer that we leave the crashkernel
> mapped when rodata is not "full". That fixes your performance issue and
> matches what we do for module code, so I do not see a security argument
> against it.
> 
> I do not plan to merge this patch as-is.
> ===
>
Mike Rapoport Sept. 21, 2022, 7:45 a.m. UTC | #11
On Tue, Sep 06, 2022 at 03:05:57PM +0200, Ard Biesheuvel wrote:
> 
> While I appreciate the effort that has gone into solving this problem,
> I don't think there is any consensus that an elaborate fix is required
> to ensure that the crash kernel can be unmapped from the linear map at
> all cost. In fact, I personally think we shouldn't bother, and IIRC,
> Will made a remark along the same lines back when the Huawei engineers
> were still driving this effort.
> 
> So perhaps we could align on that before doing yet another version of this?

I suggest to start with disabling crash kernel protection when its memory
reservation is deferred and then Baoquan and kdump folks can take it from
here.

From 6430407f784f3571da9b4d79340487f2647a44ab Mon Sep 17 00:00:00 2001
From: Mike Rapoport <rppt@linux.ibm.com>
Date: Wed, 21 Sep 2022 10:14:46 +0300
Subject: [PATCH] arm64/mm: don't protect crash kernel memory with
 CONFIG_ZONE_DMA/DMA32

Currently, in order to allow protection of crash kernel memory when
CONFIG_ZONE_DMA/DMA32 is enabled, the block mappings in the linear map are
disabled and the entire linear map uses base size pages.

This results in performance degradation because of higher TLB pressure for
kernel memory accesses, so there is a trade off between performance and
ability to protect the crash kernel memory.

Baoquan He said [1]:

	In fact, panic is a small probability event, and accidental
	corruption on kdump kernel data is a much smaller probability
	event.

With this, it makes sense to only protect crash kernel memory only when it
can be reserved before creation of the linear map.

Simplify the logic around crash kernel protection in map_mem() so that it
will use base pages only if crash kernel memory is already reserved and
introduce crashkres_protection_possible variable to ensure that
arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres() won't
try to modify page table if crash kernel is not mapped with base pages.

[1] https://lore.kernel.org/all/Yw2C9ahluhX4Mg3G@MiWiFi-R3L-srv

Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 arch/arm64/include/asm/mmu.h      |  1 +
 arch/arm64/kernel/machine_kexec.c |  6 ++++
 arch/arm64/mm/init.c              | 30 +++++++++-----------
 arch/arm64/mm/mmu.c               | 46 ++++++++-----------------------
 4 files changed, 32 insertions(+), 51 deletions(-)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 48f8466a4be9..975607843548 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -71,6 +71,7 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
 extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot);
 extern void mark_linear_text_alias_ro(void);
 extern bool kaslr_requires_kpti(void);
+extern bool crashkres_protection_possible;
 
 #define INIT_MM_CONTEXT(name)	\
 	.pgd = init_pg_dir,
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 19c2d487cb08..68295403aa40 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -272,6 +272,9 @@ void arch_kexec_protect_crashkres(void)
 {
 	int i;
 
+	if (!crashkres_protection_possible)
+		return;
+
 	for (i = 0; i < kexec_crash_image->nr_segments; i++)
 		set_memory_valid(
 			__phys_to_virt(kexec_crash_image->segment[i].mem),
@@ -282,6 +285,9 @@ void arch_kexec_unprotect_crashkres(void)
 {
 	int i;
 
+	if (!crashkres_protection_possible)
+		return;
+
 	for (i = 0; i < kexec_crash_image->nr_segments; i++)
 		set_memory_valid(
 			__phys_to_virt(kexec_crash_image->segment[i].mem),
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index b9af30be813e..220d45655918 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -62,27 +62,21 @@ EXPORT_SYMBOL(memstart_addr);
  * In such case, ZONE_DMA32 covers the rest of the 32-bit addressable memory,
  * otherwise it is empty.
  *
- * Memory reservation for crash kernel either done early or deferred
- * depending on DMA memory zones configs (ZONE_DMA) --
+ * Memory reservation for crash kernel must know the upper limit of low
+ * memory in order to allow DMA access for devices with kdump kernel. When
+ * ZONE_DMA/DMA32 is enabled, this limit is determined after DT/ACPI is
+ * parsed, and crash kernel reservation happens afterwards. In this case,
+ * the crash kernel memory is reserved after linear map is created, there
+ * is no guarantee that crash kernel memory will be mapped with the base
+ * pages in the linear map, and thus the protection if the crash kernel
+ * memory is disabled.
  *
  * In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
  * here instead of max_zone_phys().  This lets early reservation of
  * crash kernel memory which has a dependency on arm64_dma_phys_limit.
- * Reserving memory early for crash kernel allows linear creation of block
- * mappings (greater than page-granularity) for all the memory bank rangs.
- * In this scheme a comparatively quicker boot is observed.
- *
- * If ZONE_DMA configs are defined, crash kernel memory reservation
- * is delayed until DMA zone memory range size initialization performed in
- * zone_sizes_init().  The defer is necessary to steer clear of DMA zone
- * memory range to avoid overlap allocation.  So crash kernel memory boundaries
- * are not known when mapping all bank memory ranges, which otherwise means
- * not possible to exclude crash kernel range from creating block mappings
- * so page-granularity mappings are created for the entire memory range.
- * Hence a slightly slower boot is observed.
- *
- * Note: Page-granularity mappings are necessary for crash kernel memory
- * range for shrinking its size via /sys/kernel/kexec_crash_size interface.
+ * Reserving crash kernel memory early allows mapping it with base pages in
+ * the linear map so that it can be protected, without preventing usage of
+ * block mappings for creation of the linear map.
  */
 #if IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32)
 phys_addr_t __ro_after_init arm64_dma_phys_limit;
@@ -90,6 +84,8 @@ phys_addr_t __ro_after_init arm64_dma_phys_limit;
 phys_addr_t __ro_after_init arm64_dma_phys_limit = PHYS_MASK + 1;
 #endif
 
+bool __ro_after_init crashkres_protection_possible;
+
 /* Current arm64 boot protocol requires 2MB alignment */
 #define CRASH_ALIGN			SZ_2M
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index c5065abec55a..7b40f38dd3ee 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -502,21 +502,6 @@ void __init mark_linear_text_alias_ro(void)
 			    PAGE_KERNEL_RO);
 }
 
-static bool crash_mem_map __initdata;
-
-static int __init enable_crash_mem_map(char *arg)
-{
-	/*
-	 * Proper parameter parsing is done by reserve_crashkernel(). We only
-	 * need to know if the linear map has to avoid block mappings so that
-	 * the crashkernel reservations can be unmapped later.
-	 */
-	crash_mem_map = true;
-
-	return 0;
-}
-early_param("crashkernel", enable_crash_mem_map);
-
 static void __init map_mem(pgd_t *pgdp)
 {
 	static const u64 direct_map_end = _PAGE_END(VA_BITS_MIN);
@@ -547,13 +532,9 @@ static void __init map_mem(pgd_t *pgdp)
 	memblock_mark_nomap(kernel_start, kernel_end - kernel_start);
 
 #ifdef CONFIG_KEXEC_CORE
-	if (crash_mem_map) {
-		if (defer_reserve_crashkernel())
-			flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
-		else if (crashk_res.end)
-			memblock_mark_nomap(crashk_res.start,
-			    resource_size(&crashk_res));
-	}
+	if (crashk_res.end)
+		memblock_mark_nomap(crashk_res.start,
+				    resource_size(&crashk_res));
 #endif
 
 	/* map all the memory banks */
@@ -584,20 +565,17 @@ static void __init map_mem(pgd_t *pgdp)
 	memblock_clear_nomap(kernel_start, kernel_end - kernel_start);
 
 	/*
-	 * Use page-level mappings here so that we can shrink the region
-	 * in page granularity and put back unused memory to buddy system
-	 * through /sys/kernel/kexec_crash_size interface.
+	 * Use page-level mappings here so that we can protect crash kernel
+	 * memory to allow post-mortem analysis despite memory errors in
+	 * the main kernel.
 	 */
 #ifdef CONFIG_KEXEC_CORE
-	if (crash_mem_map && !defer_reserve_crashkernel()) {
-		if (crashk_res.end) {
-			__map_memblock(pgdp, crashk_res.start,
-				       crashk_res.end + 1,
-				       PAGE_KERNEL,
-				       NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
-			memblock_clear_nomap(crashk_res.start,
-					     resource_size(&crashk_res));
-		}
+	if (crashk_res.end) {
+		__map_memblock(pgdp, crashk_res.start, crashk_res.end + 1,
+			       PAGE_KERNEL, NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
+		memblock_clear_nomap(crashk_res.start,
+				     resource_size(&crashk_res));
+		crashkres_protection_possible = true;
 	}
 #endif
 }
Baoquan He Sept. 30, 2022, 7:04 a.m. UTC | #12
Hi Mike,

On 09/21/22 at 10:45am, Mike Rapoport wrote:
> On Tue, Sep 06, 2022 at 03:05:57PM +0200, Ard Biesheuvel wrote:
> > 
> > While I appreciate the effort that has gone into solving this problem,
> > I don't think there is any consensus that an elaborate fix is required
> > to ensure that the crash kernel can be unmapped from the linear map at
> > all cost. In fact, I personally think we shouldn't bother, and IIRC,
> > Will made a remark along the same lines back when the Huawei engineers
> > were still driving this effort.
> > 
> > So perhaps we could align on that before doing yet another version of this?
> 
> I suggest to start with disabling crash kernel protection when its memory
> reservation is deferred and then Baoquan and kdump folks can take it from
> here.

Thanks for the attempt, really appreciated. We all tried and all see
everybody's effort on this issue.

If disabling protection is chosen, I would suggest disable it at all.
The system w/o having CONFIG_ZONE_DMA|32 is rarely seen, I don't think
we need do the protection for it specifically to make code inconsistent
and could confuse people. We can revert below commit and its later
change do to that.

commit 98d2e1539b84 ("arm64: kdump: protect crash dump kernel memory")

For RPi4, I tried to find one to test and figure out if it can do crash
dumping with buffer above 1G. However, nobody care about kdump when I
asked people around who have one at hand for testing, hobby, or
developping. Since they are not familiar with kdump setting and not so
eager to get to know, and I don't want to take up too much time, finally
I just give up.

So, if solution in this patchset is not accepted, I would like to see
the protection code is reverted. Other opinion, please?

> 
> From 6430407f784f3571da9b4d79340487f2647a44ab Mon Sep 17 00:00:00 2001
> From: Mike Rapoport <rppt@linux.ibm.com>
> Date: Wed, 21 Sep 2022 10:14:46 +0300
> Subject: [PATCH] arm64/mm: don't protect crash kernel memory with
>  CONFIG_ZONE_DMA/DMA32
> 
> Currently, in order to allow protection of crash kernel memory when
> CONFIG_ZONE_DMA/DMA32 is enabled, the block mappings in the linear map are
> disabled and the entire linear map uses base size pages.
> 
> This results in performance degradation because of higher TLB pressure for
> kernel memory accesses, so there is a trade off between performance and
> ability to protect the crash kernel memory.
> 
> Baoquan He said [1]:
> 
> 	In fact, panic is a small probability event, and accidental
> 	corruption on kdump kernel data is a much smaller probability
> 	event.
> 
> With this, it makes sense to only protect crash kernel memory only when it
> can be reserved before creation of the linear map.
> 
> Simplify the logic around crash kernel protection in map_mem() so that it
> will use base pages only if crash kernel memory is already reserved and
> introduce crashkres_protection_possible variable to ensure that
> arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres() won't
> try to modify page table if crash kernel is not mapped with base pages.
> 
> [1] https://lore.kernel.org/all/Yw2C9ahluhX4Mg3G@MiWiFi-R3L-srv
> 
> Suggested-by: Will Deacon <will@kernel.org>
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>  arch/arm64/include/asm/mmu.h      |  1 +
>  arch/arm64/kernel/machine_kexec.c |  6 ++++
>  arch/arm64/mm/init.c              | 30 +++++++++-----------
>  arch/arm64/mm/mmu.c               | 46 ++++++++-----------------------
>  4 files changed, 32 insertions(+), 51 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index 48f8466a4be9..975607843548 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -71,6 +71,7 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
>  extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot);
>  extern void mark_linear_text_alias_ro(void);
>  extern bool kaslr_requires_kpti(void);
> +extern bool crashkres_protection_possible;
>  
>  #define INIT_MM_CONTEXT(name)	\
>  	.pgd = init_pg_dir,
> diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
> index 19c2d487cb08..68295403aa40 100644
> --- a/arch/arm64/kernel/machine_kexec.c
> +++ b/arch/arm64/kernel/machine_kexec.c
> @@ -272,6 +272,9 @@ void arch_kexec_protect_crashkres(void)
>  {
>  	int i;
>  
> +	if (!crashkres_protection_possible)
> +		return;
> +
>  	for (i = 0; i < kexec_crash_image->nr_segments; i++)
>  		set_memory_valid(
>  			__phys_to_virt(kexec_crash_image->segment[i].mem),
> @@ -282,6 +285,9 @@ void arch_kexec_unprotect_crashkres(void)
>  {
>  	int i;
>  
> +	if (!crashkres_protection_possible)
> +		return;
> +
>  	for (i = 0; i < kexec_crash_image->nr_segments; i++)
>  		set_memory_valid(
>  			__phys_to_virt(kexec_crash_image->segment[i].mem),
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index b9af30be813e..220d45655918 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -62,27 +62,21 @@ EXPORT_SYMBOL(memstart_addr);
>   * In such case, ZONE_DMA32 covers the rest of the 32-bit addressable memory,
>   * otherwise it is empty.
>   *
> - * Memory reservation for crash kernel either done early or deferred
> - * depending on DMA memory zones configs (ZONE_DMA) --
> + * Memory reservation for crash kernel must know the upper limit of low
> + * memory in order to allow DMA access for devices with kdump kernel. When
> + * ZONE_DMA/DMA32 is enabled, this limit is determined after DT/ACPI is
> + * parsed, and crash kernel reservation happens afterwards. In this case,
> + * the crash kernel memory is reserved after linear map is created, there
> + * is no guarantee that crash kernel memory will be mapped with the base
> + * pages in the linear map, and thus the protection if the crash kernel
> + * memory is disabled.
>   *
>   * In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
>   * here instead of max_zone_phys().  This lets early reservation of
>   * crash kernel memory which has a dependency on arm64_dma_phys_limit.
> - * Reserving memory early for crash kernel allows linear creation of block
> - * mappings (greater than page-granularity) for all the memory bank rangs.
> - * In this scheme a comparatively quicker boot is observed.
> - *
> - * If ZONE_DMA configs are defined, crash kernel memory reservation
> - * is delayed until DMA zone memory range size initialization performed in
> - * zone_sizes_init().  The defer is necessary to steer clear of DMA zone
> - * memory range to avoid overlap allocation.  So crash kernel memory boundaries
> - * are not known when mapping all bank memory ranges, which otherwise means
> - * not possible to exclude crash kernel range from creating block mappings
> - * so page-granularity mappings are created for the entire memory range.
> - * Hence a slightly slower boot is observed.
> - *
> - * Note: Page-granularity mappings are necessary for crash kernel memory
> - * range for shrinking its size via /sys/kernel/kexec_crash_size interface.
> + * Reserving crash kernel memory early allows mapping it with base pages in
> + * the linear map so that it can be protected, without preventing usage of
> + * block mappings for creation of the linear map.
>   */
>  #if IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32)
>  phys_addr_t __ro_after_init arm64_dma_phys_limit;
> @@ -90,6 +84,8 @@ phys_addr_t __ro_after_init arm64_dma_phys_limit;
>  phys_addr_t __ro_after_init arm64_dma_phys_limit = PHYS_MASK + 1;
>  #endif
>  
> +bool __ro_after_init crashkres_protection_possible;
> +
>  /* Current arm64 boot protocol requires 2MB alignment */
>  #define CRASH_ALIGN			SZ_2M
>  
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index c5065abec55a..7b40f38dd3ee 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -502,21 +502,6 @@ void __init mark_linear_text_alias_ro(void)
>  			    PAGE_KERNEL_RO);
>  }
>  
> -static bool crash_mem_map __initdata;
> -
> -static int __init enable_crash_mem_map(char *arg)
> -{
> -	/*
> -	 * Proper parameter parsing is done by reserve_crashkernel(). We only
> -	 * need to know if the linear map has to avoid block mappings so that
> -	 * the crashkernel reservations can be unmapped later.
> -	 */
> -	crash_mem_map = true;
> -
> -	return 0;
> -}
> -early_param("crashkernel", enable_crash_mem_map);
> -
>  static void __init map_mem(pgd_t *pgdp)
>  {
>  	static const u64 direct_map_end = _PAGE_END(VA_BITS_MIN);
> @@ -547,13 +532,9 @@ static void __init map_mem(pgd_t *pgdp)
>  	memblock_mark_nomap(kernel_start, kernel_end - kernel_start);
>  
>  #ifdef CONFIG_KEXEC_CORE
> -	if (crash_mem_map) {
> -		if (defer_reserve_crashkernel())
> -			flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> -		else if (crashk_res.end)
> -			memblock_mark_nomap(crashk_res.start,
> -			    resource_size(&crashk_res));
> -	}
> +	if (crashk_res.end)
> +		memblock_mark_nomap(crashk_res.start,
> +				    resource_size(&crashk_res));
>  #endif
>  
>  	/* map all the memory banks */
> @@ -584,20 +565,17 @@ static void __init map_mem(pgd_t *pgdp)
>  	memblock_clear_nomap(kernel_start, kernel_end - kernel_start);
>  
>  	/*
> -	 * Use page-level mappings here so that we can shrink the region
> -	 * in page granularity and put back unused memory to buddy system
> -	 * through /sys/kernel/kexec_crash_size interface.
> +	 * Use page-level mappings here so that we can protect crash kernel
> +	 * memory to allow post-mortem analysis despite memory errors in
> +	 * the main kernel.
>  	 */
>  #ifdef CONFIG_KEXEC_CORE
> -	if (crash_mem_map && !defer_reserve_crashkernel()) {
> -		if (crashk_res.end) {
> -			__map_memblock(pgdp, crashk_res.start,
> -				       crashk_res.end + 1,
> -				       PAGE_KERNEL,
> -				       NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
> -			memblock_clear_nomap(crashk_res.start,
> -					     resource_size(&crashk_res));
> -		}
> +	if (crashk_res.end) {
> +		__map_memblock(pgdp, crashk_res.start, crashk_res.end + 1,
> +			       PAGE_KERNEL, NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
> +		memblock_clear_nomap(crashk_res.start,
> +				     resource_size(&crashk_res));
> +		crashkres_protection_possible = true;
>  	}
>  #endif
>  }
> -- 
> 2.35.3
> 
> 
> -- 
> Sincerely yours,
> Mike.
>
Baoquan He Sept. 30, 2022, 9:24 a.m. UTC | #13
On 09/30/22 at 03:04pm, Baoquan He wrote:
> Hi Mike,
> 
> On 09/21/22 at 10:45am, Mike Rapoport wrote:
> > On Tue, Sep 06, 2022 at 03:05:57PM +0200, Ard Biesheuvel wrote:
> > > 
> > > While I appreciate the effort that has gone into solving this problem,
> > > I don't think there is any consensus that an elaborate fix is required
> > > to ensure that the crash kernel can be unmapped from the linear map at
> > > all cost. In fact, I personally think we shouldn't bother, and IIRC,
> > > Will made a remark along the same lines back when the Huawei engineers
> > > were still driving this effort.
> > > 
> > > So perhaps we could align on that before doing yet another version of this?
> > 
> > I suggest to start with disabling crash kernel protection when its memory
> > reservation is deferred and then Baoquan and kdump folks can take it from
> > here.
> 
> Thanks for the attempt, really appreciated. We all tried and all see
> everybody's effort on this issue.
> 
> If disabling protection is chosen, I would suggest disable it at all.
> The system w/o having CONFIG_ZONE_DMA|32 is rarely seen, I don't think
> we need do the protection for it specifically to make code inconsistent
> and could confuse people. We can revert below commit and its later
> change do to that.
> 
> commit 98d2e1539b84 ("arm64: kdump: protect crash dump kernel memory")
> 
> For RPi4, I tried to find one to test and figure out if it can do crash
> dumping with buffer above 1G. However, nobody care about kdump when I
> asked people around who have one at hand for testing, hobby, or
> developping. Since they are not familiar with kdump setting and not so
> eager to get to know, and I don't want to take up too much time, finally
> I just give up.

Finally, I got help from Chao who is team lead of our kernel QE. Chao
has a RPi4 and he tested different cases of kdump, now we can confirm
that on RPi4 the TF card and usb disk are all 30bit DMA addressing
limited. Just FYI. And thanks again to Chao for his help and patience.
diff mbox series

Patch

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index b9af30be813e..8ae55afdd11c 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -90,10 +90,22 @@  phys_addr_t __ro_after_init arm64_dma_phys_limit;
 phys_addr_t __ro_after_init arm64_dma_phys_limit = PHYS_MASK + 1;
 #endif
 
+static phys_addr_t __init crash_addr_low_max(void)
+{
+	phys_addr_t low_mem_mask = U32_MAX;
+	phys_addr_t phys_start = memblock_start_of_DRAM();
+
+	if ((!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) ||
+	     (phys_start > U32_MAX))
+		low_mem_mask = PHYS_ADDR_MAX;
+
+	return min(low_mem_mask, memblock_end_of_DRAM() - 1) + 1;
+}
+
 /* Current arm64 boot protocol requires 2MB alignment */
 #define CRASH_ALIGN			SZ_2M
 
-#define CRASH_ADDR_LOW_MAX		arm64_dma_phys_limit
+#define CRASH_ADDR_LOW_MAX		crash_addr_low_max()
 #define CRASH_ADDR_HIGH_MAX		(PHYS_MASK + 1)
 
 static int __init reserve_crashkernel_low(unsigned long long low_size)
@@ -389,8 +401,7 @@  void __init arm64_memblock_init(void)
 
 	early_init_fdt_scan_reserved_mem();
 
-	if (!defer_reserve_crashkernel())
-		reserve_crashkernel();
+	reserve_crashkernel();
 
 	high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
 }
@@ -434,13 +445,6 @@  void __init bootmem_init(void)
 	 */
 	dma_contiguous_reserve(arm64_dma_phys_limit);
 
-	/*
-	 * request_standard_resources() depends on crashkernel's memory being
-	 * reserved, so do it here.
-	 */
-	if (defer_reserve_crashkernel())
-		reserve_crashkernel();
-
 	memblock_dump_all();
 }
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index e7ad44585f40..cdd338fa2115 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -547,13 +547,12 @@  static void __init map_mem(pgd_t *pgdp)
 	memblock_mark_nomap(kernel_start, kernel_end - kernel_start);
 
 #ifdef CONFIG_KEXEC_CORE
-	if (crash_mem_map) {
-		if (defer_reserve_crashkernel())
-			flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
-		else if (crashk_res.end)
-			memblock_mark_nomap(crashk_res.start,
-			    resource_size(&crashk_res));
-	}
+	if (crashk_res.end)
+		memblock_mark_nomap(crashk_res.start,
+				    resource_size(&crashk_res));
+	if (crashk_low_res.end)
+		memblock_mark_nomap(crashk_low_res.start,
+				    resource_size(&crashk_low_res));
 #endif
 
 	/* map all the memory banks */
@@ -589,16 +588,23 @@  static void __init map_mem(pgd_t *pgdp)
 	 * through /sys/kernel/kexec_crash_size interface.
 	 */
 #ifdef CONFIG_KEXEC_CORE
-	if (crash_mem_map && !defer_reserve_crashkernel()) {
-		if (crashk_res.end) {
-			__map_memblock(pgdp, crashk_res.start,
-				       crashk_res.end + 1,
-				       PAGE_KERNEL,
-				       NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
-			memblock_clear_nomap(crashk_res.start,
-					     resource_size(&crashk_res));
-		}
+	if (crashk_res.end) {
+		__map_memblock(pgdp, crashk_res.start,
+			       crashk_res.end + 1,
+			       PAGE_KERNEL,
+			       NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
+		memblock_clear_nomap(crashk_res.start,
+				     resource_size(&crashk_res));
 	}
+	if (crashk_low_res.end) {
+		__map_memblock(pgdp, crashk_low_res.start,
+			       crashk_low_res.end + 1,
+			       PAGE_KERNEL,
+			       NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
+		memblock_clear_nomap(crashk_low_res.start,
+				     resource_size(&crashk_low_res));
+	}
+
 #endif
 }