diff mbox series

[2/4] arm64: kdump: support reserving crashkernel above 4G

Message ID 20190507035058.63992-3-chenzhou10@huawei.com (mailing list archive)
State New, archived
Headers show
Series support reserving crashkernel above 4G on arm64 kdump | expand

Commit Message

chenzhou May 7, 2019, 3:50 a.m. UTC
When crashkernel is reserved above 4G in memory, kernel should
reserve some amount of low memory for swiotlb and some DMA buffers.

Meanwhile, support crashkernel=X,[high,low] in arm64. When use
crashkernel=X parameter, try low memory first and fall back to high
memory unless "crashkernel=X,high" is specified.

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
---
 arch/arm64/include/asm/kexec.h |  3 +++
 arch/arm64/kernel/setup.c      |  3 +++
 arch/arm64/mm/init.c           | 34 ++++++++++++++++++++++++++++------
 3 files changed, 34 insertions(+), 6 deletions(-)

Comments

James Morse June 5, 2019, 4:29 p.m. UTC | #1
Hello,

On 07/05/2019 04:50, Chen Zhou wrote:
> When crashkernel is reserved above 4G in memory, kernel should
> reserve some amount of low memory for swiotlb and some DMA buffers.

> Meanwhile, support crashkernel=X,[high,low] in arm64. When use
> crashkernel=X parameter, try low memory first and fall back to high
> memory unless "crashkernel=X,high" is specified.

What is the 'unless crashkernel=...,high' for? I think it would be simpler to relax the
ARCH_LOW_ADDRESS_LIMIT if reserve_crashkernel_low() allocated something.

This way "crashkernel=1G" tries to allocate 1G below 4G, but fails if there isn't enough
memory. "crashkernel=1G crashkernel=16M,low" allocates 16M below 4G, which is more likely
to succeed, if it does it can then place the 1G block anywhere.


> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> index 413d566..82cd9a0 100644
> --- a/arch/arm64/kernel/setup.c
> +++ b/arch/arm64/kernel/setup.c
> @@ -243,6 +243,9 @@ static void __init request_standard_resources(void)
>  			request_resource(res, &kernel_data);
>  #ifdef CONFIG_KEXEC_CORE
>  		/* Userspace will find "Crash kernel" region in /proc/iomem. */
> +		if (crashk_low_res.end && crashk_low_res.start >= res->start &&
> +		    crashk_low_res.end <= res->end)
> +			request_resource(res, &crashk_low_res);
>  		if (crashk_res.end && crashk_res.start >= res->start &&
>  		    crashk_res.end <= res->end)
>  			request_resource(res, &crashk_res);

With both crashk_low_res and crashk_res, we end up with two entries in /proc/iomem called
"Crash kernel". Because its sorted by address, and kexec-tools stops searching when it
find "Crash kernel", you are always going to get the kernel placed in the lower portion.

I suspect this isn't what you want, can we rename crashk_low_res for arm64 so that
existing kexec-tools doesn't use it?


> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index d2adffb..3fcd739 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -74,20 +74,37 @@ phys_addr_t arm64_dma_phys_limit __ro_after_init;
>  static void __init reserve_crashkernel(void)
>  {
>  	unsigned long long crash_base, crash_size;
> +	bool high = false;
>  	int ret;
>  
>  	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
>  				&crash_size, &crash_base);
>  	/* no crashkernel= or invalid value specified */
> -	if (ret || !crash_size)
> -		return;
> +	if (ret || !crash_size) {
> +		/* crashkernel=X,high */
> +		ret = parse_crashkernel_high(boot_command_line,
> +				memblock_phys_mem_size(),
> +				&crash_size, &crash_base);
> +		if (ret || !crash_size)
> +			return;
> +		high = true;
> +	}
>  
>  	crash_size = PAGE_ALIGN(crash_size);
>  
>  	if (crash_base == 0) {
> -		/* Current arm64 boot protocol requires 2MB alignment */
> -		crash_base = memblock_find_in_range(0, ARCH_LOW_ADDRESS_LIMIT,
> -				crash_size, SZ_2M);
> +		/*
> +		 * Try low memory first and fall back to high memory
> +		 * unless "crashkernel=size[KMG],high" is specified.
> +		 */
> +		if (!high)
> +			crash_base = memblock_find_in_range(0,
> +					ARCH_LOW_ADDRESS_LIMIT,
> +					crash_size, CRASH_ALIGN);
> +		if (!crash_base)
> +			crash_base = memblock_find_in_range(0,
> +					memblock_end_of_DRAM(),
> +					crash_size, CRASH_ALIGN);
>  		if (crash_base == 0) {
>  			pr_warn("cannot allocate crashkernel (size:0x%llx)\n",
>  				crash_size);
> @@ -105,13 +122,18 @@ static void __init reserve_crashkernel(void)
>  			return;
>  		}
>  
> -		if (!IS_ALIGNED(crash_base, SZ_2M)) {
> +		if (!IS_ALIGNED(crash_base, CRASH_ALIGN)) {
>  			pr_warn("cannot reserve crashkernel: base address is not 2MB aligned\n");
>  			return;
>  		}
>  	}
>  	memblock_reserve(crash_base, crash_size);
>  
> +	if (crash_base >= SZ_4G && reserve_crashkernel_low()) {
> +		memblock_free(crash_base, crash_size);
> +		return;

This is going to be annoying on platforms that don't have, and don't need memory below 4G.
A "crashkernel=...,low" on these system will break crashdump. I don't think we should
expect users to know the memory layout. (I'm assuming distro's are going to add a low
reservation everywhere, just in case)

I think the 'low' region should be a small optional/best-effort extra, that kexec-tools
can't touch.


I'm afraid you've missed the ugly bit of the crashkernel reservation...

arch/arm64/mm/mmu.c::map_mem() marks the crashkernel as 'nomap' during the first pass of
page-table generation. This means it isn't mapped in the linear map. It then maps it with
page-size mappings, and removes the nomap flag.

This is done so that arch_kexec_protect_crashkres() and
arch_kexec_unprotect_crashkres() can remove the valid bits of the crashkernel mapping.
This way the old-kernel can't accidentally overwrite the crashkernel. It also saves us if
the old-kernel and the crashkernel use different memory attributes for the mapping.

As your low-memory reservation is intended to be used for devices, having it mapped by the
old-kernel as cacheable memory is going to cause problems if those CPUs aren't taken
offline and go corrupting this memory. (we did crash for a reason after all)


I think the simplest thing to do is mark the low region as 'nomap' in
reserve_crashkernel() and always leave it unmapped. We can then describe it via a
different string in /proc/iomem, something like "Crash kernel (low)". Older kexec-tools
shouldn't use it, (I assume its not using strncmp() in a way that would do this by
accident), and newer kexec-tools can know to describe it in the DT, but it can't write to it.


Thanks,

James
chenzhou June 13, 2019, 11:27 a.m. UTC | #2
Hi James,

On 2019/6/6 0:29, James Morse wrote:
> Hello,
> 
> On 07/05/2019 04:50, Chen Zhou wrote:
>> When crashkernel is reserved above 4G in memory, kernel should
>> reserve some amount of low memory for swiotlb and some DMA buffers.
> 
>> Meanwhile, support crashkernel=X,[high,low] in arm64. When use
>> crashkernel=X parameter, try low memory first and fall back to high
>> memory unless "crashkernel=X,high" is specified.
> 
> What is the 'unless crashkernel=...,high' for? I think it would be simpler to relax the
> ARCH_LOW_ADDRESS_LIMIT if reserve_crashkernel_low() allocated something.
> 
> This way "crashkernel=1G" tries to allocate 1G below 4G, but fails if there isn't enough
> memory. "crashkernel=1G crashkernel=16M,low" allocates 16M below 4G, which is more likely
> to succeed, if it does it can then place the 1G block anywhere.
> 
Yeah, this is much simpler.

> 
>> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
>> index 413d566..82cd9a0 100644
>> --- a/arch/arm64/kernel/setup.c
>> +++ b/arch/arm64/kernel/setup.c
>> @@ -243,6 +243,9 @@ static void __init request_standard_resources(void)
>>  			request_resource(res, &kernel_data);
>>  #ifdef CONFIG_KEXEC_CORE
>>  		/* Userspace will find "Crash kernel" region in /proc/iomem. */
>> +		if (crashk_low_res.end && crashk_low_res.start >= res->start &&
>> +		    crashk_low_res.end <= res->end)
>> +			request_resource(res, &crashk_low_res);
>>  		if (crashk_res.end && crashk_res.start >= res->start &&
>>  		    crashk_res.end <= res->end)
>>  			request_resource(res, &crashk_res);
> 
> With both crashk_low_res and crashk_res, we end up with two entries in /proc/iomem called
> "Crash kernel". Because its sorted by address, and kexec-tools stops searching when it
> find "Crash kernel", you are always going to get the kernel placed in the lower portion.
> 
> I suspect this isn't what you want, can we rename crashk_low_res for arm64 so that
> existing kexec-tools doesn't use it?
>

In my patchset, in addition to the kernel patches, i also modify the kexec-tools.
  arm64: support more than one crash kernel regions(http://lists.infradead.org/pipermail/kexec/2019-April/022792.html).
In kexec-tools patch, we read all the "Crash kernel" entry and load crash kernel high.

> 
>> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
>> index d2adffb..3fcd739 100644
>> --- a/arch/arm64/mm/init.c
>> +++ b/arch/arm64/mm/init.c
>> @@ -74,20 +74,37 @@ phys_addr_t arm64_dma_phys_limit __ro_after_init;
>>  static void __init reserve_crashkernel(void)
>>  {
>>  	unsigned long long crash_base, crash_size;
>> +	bool high = false;
>>  	int ret;
>>  
>>  	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
>>  				&crash_size, &crash_base);
>>  	/* no crashkernel= or invalid value specified */
>> -	if (ret || !crash_size)
>> -		return;
>> +	if (ret || !crash_size) {
>> +		/* crashkernel=X,high */
>> +		ret = parse_crashkernel_high(boot_command_line,
>> +				memblock_phys_mem_size(),
>> +				&crash_size, &crash_base);
>> +		if (ret || !crash_size)
>> +			return;
>> +		high = true;
>> +	}
>>  
>>  	crash_size = PAGE_ALIGN(crash_size);
>>  
>>  	if (crash_base == 0) {
>> -		/* Current arm64 boot protocol requires 2MB alignment */
>> -		crash_base = memblock_find_in_range(0, ARCH_LOW_ADDRESS_LIMIT,
>> -				crash_size, SZ_2M);
>> +		/*
>> +		 * Try low memory first and fall back to high memory
>> +		 * unless "crashkernel=size[KMG],high" is specified.
>> +		 */
>> +		if (!high)
>> +			crash_base = memblock_find_in_range(0,
>> +					ARCH_LOW_ADDRESS_LIMIT,
>> +					crash_size, CRASH_ALIGN);
>> +		if (!crash_base)
>> +			crash_base = memblock_find_in_range(0,
>> +					memblock_end_of_DRAM(),
>> +					crash_size, CRASH_ALIGN);
>>  		if (crash_base == 0) {
>>  			pr_warn("cannot allocate crashkernel (size:0x%llx)\n",
>>  				crash_size);
>> @@ -105,13 +122,18 @@ static void __init reserve_crashkernel(void)
>>  			return;
>>  		}
>>  
>> -		if (!IS_ALIGNED(crash_base, SZ_2M)) {
>> +		if (!IS_ALIGNED(crash_base, CRASH_ALIGN)) {
>>  			pr_warn("cannot reserve crashkernel: base address is not 2MB aligned\n");
>>  			return;
>>  		}
>>  	}
>>  	memblock_reserve(crash_base, crash_size);
>>  
>> +	if (crash_base >= SZ_4G && reserve_crashkernel_low()) {
>> +		memblock_free(crash_base, crash_size);
>> +		return;
> 
> This is going to be annoying on platforms that don't have, and don't need memory below 4G.
> A "crashkernel=...,low" on these system will break crashdump. I don't think we should
> expect users to know the memory layout. (I'm assuming distro's are going to add a low
> reservation everywhere, just in case)
> 
> I think the 'low' region should be a small optional/best-effort extra, that kexec-tools
> can't touch.
> 
> 
> I'm afraid you've missed the ugly bit of the crashkernel reservation...
> 
> arch/arm64/mm/mmu.c::map_mem() marks the crashkernel as 'nomap' during the first pass of
> page-table generation. This means it isn't mapped in the linear map. It then maps it with
> page-size mappings, and removes the nomap flag.
> 
> This is done so that arch_kexec_protect_crashkres() and
> arch_kexec_unprotect_crashkres() can remove the valid bits of the crashkernel mapping.
> This way the old-kernel can't accidentally overwrite the crashkernel. It also saves us if
> the old-kernel and the crashkernel use different memory attributes for the mapping.
> 
> As your low-memory reservation is intended to be used for devices, having it mapped by the
> old-kernel as cacheable memory is going to cause problems if those CPUs aren't taken
> offline and go corrupting this memory. (we did crash for a reason after all)
> 
> 
> I think the simplest thing to do is mark the low region as 'nomap' in
> reserve_crashkernel() and always leave it unmapped. We can then describe it via a
> different string in /proc/iomem, something like "Crash kernel (low)". Older kexec-tools
> shouldn't use it, (I assume its not using strncmp() in a way that would do this by
> accident), and newer kexec-tools can know to describe it in the DT, but it can't write to it.
> 

I did miss the bit of the crashkernel reservation. I will fix this in next version.

> 
> Thanks,
> 
> James
> 
> .
> 

Thanks,
Chen Zhou
James Morse June 13, 2019, 12:44 p.m. UTC | #3
Hi Chen Zhou,

On 13/06/2019 12:27, Chen Zhou wrote:
> On 2019/6/6 0:29, James Morse wrote:
>> On 07/05/2019 04:50, Chen Zhou wrote:
>>> When crashkernel is reserved above 4G in memory, kernel should
>>> reserve some amount of low memory for swiotlb and some DMA buffers.
>>
>>> Meanwhile, support crashkernel=X,[high,low] in arm64. When use
>>> crashkernel=X parameter, try low memory first and fall back to high
>>> memory unless "crashkernel=X,high" is specified.
>>
>> What is the 'unless crashkernel=...,high' for? I think it would be simpler to relax the
>> ARCH_LOW_ADDRESS_LIMIT if reserve_crashkernel_low() allocated something.
>>
>> This way "crashkernel=1G" tries to allocate 1G below 4G, but fails if there isn't enough
>> memory. "crashkernel=1G crashkernel=16M,low" allocates 16M below 4G, which is more likely
>> to succeed, if it does it can then place the 1G block anywhere.
>>
> Yeah, this is much simpler.

>>> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
>>> index 413d566..82cd9a0 100644
>>> --- a/arch/arm64/kernel/setup.c
>>> +++ b/arch/arm64/kernel/setup.c
>>> @@ -243,6 +243,9 @@ static void __init request_standard_resources(void)
>>>  			request_resource(res, &kernel_data);
>>>  #ifdef CONFIG_KEXEC_CORE
>>>  		/* Userspace will find "Crash kernel" region in /proc/iomem. */
>>> +		if (crashk_low_res.end && crashk_low_res.start >= res->start &&
>>> +		    crashk_low_res.end <= res->end)
>>> +			request_resource(res, &crashk_low_res);
>>>  		if (crashk_res.end && crashk_res.start >= res->start &&
>>>  		    crashk_res.end <= res->end)
>>>  			request_resource(res, &crashk_res);
>>
>> With both crashk_low_res and crashk_res, we end up with two entries in /proc/iomem called
>> "Crash kernel". Because its sorted by address, and kexec-tools stops searching when it
>> find "Crash kernel", you are always going to get the kernel placed in the lower portion.
>>
>> I suspect this isn't what you want, can we rename crashk_low_res for arm64 so that
>> existing kexec-tools doesn't use it?

> In my patchset, in addition to the kernel patches, i also modify the kexec-tools.
>   arm64: support more than one crash kernel regions(http://lists.infradead.org/pipermail/kexec/2019-April/022792.html).
> In kexec-tools patch, we read all the "Crash kernel" entry and load crash kernel high.

But we can't rely on people updating user-space when they update the kernel!

[...]


>> I'm afraid you've missed the ugly bit of the crashkernel reservation...
>>
>> arch/arm64/mm/mmu.c::map_mem() marks the crashkernel as 'nomap' during the first pass of
>> page-table generation. This means it isn't mapped in the linear map. It then maps it with
>> page-size mappings, and removes the nomap flag.
>>
>> This is done so that arch_kexec_protect_crashkres() and
>> arch_kexec_unprotect_crashkres() can remove the valid bits of the crashkernel mapping.
>> This way the old-kernel can't accidentally overwrite the crashkernel. It also saves us if
>> the old-kernel and the crashkernel use different memory attributes for the mapping.
>>
>> As your low-memory reservation is intended to be used for devices, having it mapped by the
>> old-kernel as cacheable memory is going to cause problems if those CPUs aren't taken
>> offline and go corrupting this memory. (we did crash for a reason after all)
>>
>>
>> I think the simplest thing to do is mark the low region as 'nomap' in
>> reserve_crashkernel() and always leave it unmapped. We can then describe it via a
>> different string in /proc/iomem, something like "Crash kernel (low)". Older kexec-tools
>> shouldn't use it, (I assume its not using strncmp() in a way that would do this by
>> accident), and newer kexec-tools can know to describe it in the DT, but it can't write to it.

> I did miss the bit of the crashkernel reservation.
> I will fix this in next version.

I think all that is needed is to make the low-region nmap, and describe it via /proc/iomem
with a name where nothing will try and use it by accident.


Thanks,

James
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 67e4cb7..32949bf 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -28,6 +28,9 @@ 
 
 #define KEXEC_ARCH KEXEC_ARCH_AARCH64
 
+/* 2M alignment for crash kernel regions */
+#define CRASH_ALIGN	SZ_2M
+
 #ifndef __ASSEMBLY__
 
 /**
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 413d566..82cd9a0 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -243,6 +243,9 @@  static void __init request_standard_resources(void)
 			request_resource(res, &kernel_data);
 #ifdef CONFIG_KEXEC_CORE
 		/* Userspace will find "Crash kernel" region in /proc/iomem. */
+		if (crashk_low_res.end && crashk_low_res.start >= res->start &&
+		    crashk_low_res.end <= res->end)
+			request_resource(res, &crashk_low_res);
 		if (crashk_res.end && crashk_res.start >= res->start &&
 		    crashk_res.end <= res->end)
 			request_resource(res, &crashk_res);
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index d2adffb..3fcd739 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -74,20 +74,37 @@  phys_addr_t arm64_dma_phys_limit __ro_after_init;
 static void __init reserve_crashkernel(void)
 {
 	unsigned long long crash_base, crash_size;
+	bool high = false;
 	int ret;
 
 	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
 				&crash_size, &crash_base);
 	/* no crashkernel= or invalid value specified */
-	if (ret || !crash_size)
-		return;
+	if (ret || !crash_size) {
+		/* crashkernel=X,high */
+		ret = parse_crashkernel_high(boot_command_line,
+				memblock_phys_mem_size(),
+				&crash_size, &crash_base);
+		if (ret || !crash_size)
+			return;
+		high = true;
+	}
 
 	crash_size = PAGE_ALIGN(crash_size);
 
 	if (crash_base == 0) {
-		/* Current arm64 boot protocol requires 2MB alignment */
-		crash_base = memblock_find_in_range(0, ARCH_LOW_ADDRESS_LIMIT,
-				crash_size, SZ_2M);
+		/*
+		 * Try low memory first and fall back to high memory
+		 * unless "crashkernel=size[KMG],high" is specified.
+		 */
+		if (!high)
+			crash_base = memblock_find_in_range(0,
+					ARCH_LOW_ADDRESS_LIMIT,
+					crash_size, CRASH_ALIGN);
+		if (!crash_base)
+			crash_base = memblock_find_in_range(0,
+					memblock_end_of_DRAM(),
+					crash_size, CRASH_ALIGN);
 		if (crash_base == 0) {
 			pr_warn("cannot allocate crashkernel (size:0x%llx)\n",
 				crash_size);
@@ -105,13 +122,18 @@  static void __init reserve_crashkernel(void)
 			return;
 		}
 
-		if (!IS_ALIGNED(crash_base, SZ_2M)) {
+		if (!IS_ALIGNED(crash_base, CRASH_ALIGN)) {
 			pr_warn("cannot reserve crashkernel: base address is not 2MB aligned\n");
 			return;
 		}
 	}
 	memblock_reserve(crash_base, crash_size);
 
+	if (crash_base >= SZ_4G && reserve_crashkernel_low()) {
+		memblock_free(crash_base, crash_size);
+		return;
+	}
+
 	pr_info("crashkernel reserved: 0x%016llx - 0x%016llx (%lld MB)\n",
 		crash_base, crash_base + crash_size, crash_size >> 20);