Message ID | 20220613080932.663-6-thunder.leizhen@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: kdump: Function supplement and performance optimization | expand |
Hi, On 06/13/22 at 04:09pm, Zhen Lei wrote: > If the crashkernel has both high memory above DMA zones and low memory > in DMA zones, kexec always loads the content such as Image and dtb to the > high memory instead of the low memory. This means that only high memory > requires write protection based on page-level mapping. The allocation of > high memory does not depend on the DMA boundary. So we can reserve the > high memory first even if the crashkernel reservation is deferred. > > This means that the block mapping can still be performed on other kernel > linear address spaces, the TLB miss rate can be reduced and the system > performance will be improved. Ugh, this looks a little ugly, honestly. If that's for sure arm64 can't split large page mapping of linear region, this patch is one way to optimize linear mapping. Given kdump setting is necessary on arm64 server, the booting speed is truly impacted heavily. However, I would suggest letting it as is with below reasons: 1) The code will complicate the crashkernel reservatoin code which is already difficult to understand. 2) It can only optimize the two cases, first is CONFIG_ZONE_DMA|DMA32 disabled, the other is crashkernel=,high is specified. While both two cases are corner case, most of systems have CONFIG_ZONE_DMA|DMA32 enabled, and most of systems have crashkernel=xM which is enough. Having them optimized won't bring benefit to most of systems. 3) Besides, the crashkernel=,high can be handled earlier because arm64 alwasys have memblock.bottom_up == false currently, thus we don't need worry arbout the lower limit of crashkernel,high reservation for now. If memblock.bottom_up is set true in the future, this patch doesn't work any more. ... crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN, crash_base, crash_max); So, in my opinion, we can leave the current NON_BLOCK|SECT mapping as is caused by crashkernel reserving, since no regression is brought. And meantime, turning to check if there's any way to make the contiguous linear mapping and later splitting work. The patch 4, 5 in this patchset doesn't make much sense to me, frankly speaking. Thanks Baoquan > > Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> > --- > arch/arm64/mm/init.c | 71 ++++++++++++++++++++++++++++++++++++++++---- > 1 file changed, 65 insertions(+), 6 deletions(-) > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c > index fb24efbc46f5ef4..ae0bae2cafe6ab0 100644 > --- a/arch/arm64/mm/init.c > +++ b/arch/arm64/mm/init.c > @@ -141,15 +141,44 @@ static void __init reserve_crashkernel(int dma_state) > unsigned long long crash_max = CRASH_ADDR_LOW_MAX; > char *cmdline = boot_command_line; > int dma_enabled = IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32); > - int ret; > + int ret, skip_res = 0, skip_low_res = 0; > bool fixed_base; > > if (!IS_ENABLED(CONFIG_KEXEC_CORE)) > return; > > - if ((!dma_enabled && (dma_state != DMA_PHYS_LIMIT_UNKNOWN)) || > - (dma_enabled && (dma_state != DMA_PHYS_LIMIT_KNOWN))) > - return; > + /* > + * In the following table: > + * X,high means crashkernel=X,high > + * unknown means dma_state = DMA_PHYS_LIMIT_UNKNOWN > + * known means dma_state = DMA_PHYS_LIMIT_KNOWN > + * > + * The first two columns indicate the status, and the last two > + * columns indicate the phase in which crash high or low memory > + * needs to be reserved. > + * --------------------------------------------------- > + * | DMA enabled | X,high used | unknown | known | > + * --------------------------------------------------- > + * | N N | low | NOP | > + * | Y N | NOP | low | > + * | N Y | high/low | NOP | > + * | Y Y | high | low | > + * --------------------------------------------------- > + * > + * But in this function, the crash high memory allocation of > + * crashkernel=Y,high and the crash low memory allocation of > + * crashkernel=X[@offset] for crashk_res are mixed at one place. > + * So the table above need to be adjusted as below: > + * --------------------------------------------------- > + * | DMA enabled | X,high used | unknown | known | > + * --------------------------------------------------- > + * | N N | res | NOP | > + * | Y N | NOP | res | > + * | N Y |res/low_res| NOP | > + * | Y Y | res | low_res | > + * --------------------------------------------------- > + * > + */ > > /* crashkernel=X[@offset] */ > ret = parse_crashkernel(cmdline, memblock_phys_mem_size(), > @@ -169,10 +198,33 @@ static void __init reserve_crashkernel(int dma_state) > else if (ret) > return; > > + /* See the third row of the second table above, NOP */ > + if (!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN)) > + return; > + > + /* See the fourth row of the second table above */ > + if (dma_enabled) { > + if (dma_state == DMA_PHYS_LIMIT_UNKNOWN) > + skip_low_res = 1; > + else > + skip_res = 1; > + } > + > crash_max = CRASH_ADDR_HIGH_MAX; > } else if (ret || !crash_size) { > /* The specified value is invalid */ > return; > + } else { > + /* See the 1-2 rows of the second table above, NOP */ > + if ((!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN)) || > + (dma_enabled && (dma_state == DMA_PHYS_LIMIT_UNKNOWN))) > + return; > + } > + > + if (skip_res) { > + crash_base = crashk_res.start; > + crash_size = crashk_res.end - crashk_res.start + 1; > + goto check_low; > } > > fixed_base = !!crash_base; > @@ -202,9 +254,18 @@ static void __init reserve_crashkernel(int dma_state) > return; > } > > + crashk_res.start = crash_base; > + crashk_res.end = crash_base + crash_size - 1; > + > +check_low: > + if (skip_low_res) > + return; > + > if ((crash_base >= CRASH_ADDR_LOW_MAX) && > crash_low_size && reserve_crashkernel_low(crash_low_size)) { > memblock_phys_free(crash_base, crash_size); > + crashk_res.start = 0; > + crashk_res.end = 0; > return; > } > > @@ -219,8 +280,6 @@ static void __init reserve_crashkernel(int dma_state) > if (crashk_low_res.end) > kmemleak_ignore_phys(crashk_low_res.start); > > - crashk_res.start = crash_base; > - crashk_res.end = crash_base + crash_size - 1; > insert_resource(&iomem_resource, &crashk_res); > } > > -- > 2.25.1 >
On 2022/6/21 13:33, Baoquan He wrote: > Hi, > > On 06/13/22 at 04:09pm, Zhen Lei wrote: >> If the crashkernel has both high memory above DMA zones and low memory >> in DMA zones, kexec always loads the content such as Image and dtb to the >> high memory instead of the low memory. This means that only high memory >> requires write protection based on page-level mapping. The allocation of >> high memory does not depend on the DMA boundary. So we can reserve the >> high memory first even if the crashkernel reservation is deferred. >> >> This means that the block mapping can still be performed on other kernel >> linear address spaces, the TLB miss rate can be reduced and the system >> performance will be improved. > Ugh, this looks a little ugly, honestly. > > If that's for sure arm64 can't split large page mapping of linear > region, this patch is one way to optimize linear mapping. Given kdump > setting is necessary on arm64 server, the booting speed is truly > impacted heavily. Is there some conclusion or discussion that arm64 can't split large page mapping? Could the crashkernel reservation (and Kfence pool) be splited dynamically? I found Mark replay "arm64: remove page granularity limitation from KFENCE"[1], "We also avoid live changes from block<->table mappings, since the archtitecture gives us very weak guarantees there and generally requires a Break-Before-Make sequence (though IIRC this was tightened up somewhat, so maybe going one way is supposed to work). Unless it's really necessary, I'd rather not split these block mappings while they're live." Hi Mark and Catalin, could you give some comment, many thanks. [1] https://lore.kernel.org/lkml/20210920101938.GA13863@C02TD0UTHF1T.local/T/#m1a7f974593f5545cbcfc0d21560df4e7926b1381 > > However, I would suggest letting it as is with below reasons: > > 1) The code will complicate the crashkernel reservatoin code which > is already difficult to understand. > 2) It can only optimize the two cases, first is CONFIG_ZONE_DMA|DMA32 > disabled, the other is crashkernel=,high is specified. While both > two cases are corner case, most of systems have CONFIG_ZONE_DMA|DMA32 > enabled, and most of systems have crashkernel=xM which is enough. > Having them optimized won't bring benefit to most of systems. > 3) Besides, the crashkernel=,high can be handled earlier because > arm64 alwasys have memblock.bottom_up == false currently, thus we > don't need worry arbout the lower limit of crashkernel,high > reservation for now. If memblock.bottom_up is set true in the future, > this patch doesn't work any more. > > > ... > crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN, > crash_base, crash_max); > > So, in my opinion, we can leave the current NON_BLOCK|SECT mapping as > is caused by crashkernel reserving, since no regression is brought. > And meantime, turning to check if there's any way to make the contiguous > linear mapping and later splitting work. The patch 4, 5 in this patchset > doesn't make much sense to me, frankly speaking. > > Thanks > Baoquan
On 2022/6/21 13:33, Baoquan He wrote: > Hi, > > On 06/13/22 at 04:09pm, Zhen Lei wrote: >> If the crashkernel has both high memory above DMA zones and low memory >> in DMA zones, kexec always loads the content such as Image and dtb to the >> high memory instead of the low memory. This means that only high memory >> requires write protection based on page-level mapping. The allocation of >> high memory does not depend on the DMA boundary. So we can reserve the >> high memory first even if the crashkernel reservation is deferred. >> >> This means that the block mapping can still be performed on other kernel >> linear address spaces, the TLB miss rate can be reduced and the system >> performance will be improved. > > Ugh, this looks a little ugly, honestly. > > If that's for sure arm64 can't split large page mapping of linear > region, this patch is one way to optimize linear mapping. Given kdump > setting is necessary on arm64 server, the booting speed is truly > impacted heavily. There is also a performance impact when running. > > However, I would suggest letting it as is with below reasons: > > 1) The code will complicate the crashkernel reservatoin code which > is already difficult to understand. Yeah, I feel it, too. > 2) It can only optimize the two cases, first is CONFIG_ZONE_DMA|DMA32 > disabled, the other is crashkernel=,high is specified. While both > two cases are corner case, most of systems have CONFIG_ZONE_DMA|DMA32 > enabled, and most of systems have crashkernel=xM which is enough. > Having them optimized won't bring benefit to most of systems. The case of CONFIG_ZONE_DMA|DMA32 disabled have been resolved by commit 031495635b46 ("arm64: Do not defer reserve_crashkernel() for platforms with no DMA memory zones"). Currently the performance problem to be optimized is that DMA is enabled. > 3) Besides, the crashkernel=,high can be handled earlier because > arm64 alwasys have memblock.bottom_up == false currently, thus we > don't need worry arbout the lower limit of crashkernel,high > reservation for now. If memblock.bottom_up is set true in the future, > this patch doesn't work any more. > > > ... > crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN, > crash_base, crash_max); > > So, in my opinion, we can leave the current NON_BLOCK|SECT mapping as > is caused by crashkernel reserving, since no regression is brought. > And meantime, turning to check if there's any way to make the contiguous > linear mapping and later splitting work. The patch 4, 5 in this patchset > doesn't make much sense to me, frankly speaking. OK. As discussed earlier, I can rethink if there is a better way to patch 4-5, and this time focus on patch 1-2. In this way, all the functions are complete, and only optimization is left. > > Thanks > Baoquan > >> >> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> >> --- >> arch/arm64/mm/init.c | 71 ++++++++++++++++++++++++++++++++++++++++---- >> 1 file changed, 65 insertions(+), 6 deletions(-) >> >> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c >> index fb24efbc46f5ef4..ae0bae2cafe6ab0 100644 >> --- a/arch/arm64/mm/init.c >> +++ b/arch/arm64/mm/init.c >> @@ -141,15 +141,44 @@ static void __init reserve_crashkernel(int dma_state) >> unsigned long long crash_max = CRASH_ADDR_LOW_MAX; >> char *cmdline = boot_command_line; >> int dma_enabled = IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32); >> - int ret; >> + int ret, skip_res = 0, skip_low_res = 0; >> bool fixed_base; >> >> if (!IS_ENABLED(CONFIG_KEXEC_CORE)) >> return; >> >> - if ((!dma_enabled && (dma_state != DMA_PHYS_LIMIT_UNKNOWN)) || >> - (dma_enabled && (dma_state != DMA_PHYS_LIMIT_KNOWN))) >> - return; >> + /* >> + * In the following table: >> + * X,high means crashkernel=X,high >> + * unknown means dma_state = DMA_PHYS_LIMIT_UNKNOWN >> + * known means dma_state = DMA_PHYS_LIMIT_KNOWN >> + * >> + * The first two columns indicate the status, and the last two >> + * columns indicate the phase in which crash high or low memory >> + * needs to be reserved. >> + * --------------------------------------------------- >> + * | DMA enabled | X,high used | unknown | known | >> + * --------------------------------------------------- >> + * | N N | low | NOP | >> + * | Y N | NOP | low | >> + * | N Y | high/low | NOP | >> + * | Y Y | high | low | >> + * --------------------------------------------------- >> + * >> + * But in this function, the crash high memory allocation of >> + * crashkernel=Y,high and the crash low memory allocation of >> + * crashkernel=X[@offset] for crashk_res are mixed at one place. >> + * So the table above need to be adjusted as below: >> + * --------------------------------------------------- >> + * | DMA enabled | X,high used | unknown | known | >> + * --------------------------------------------------- >> + * | N N | res | NOP | >> + * | Y N | NOP | res | >> + * | N Y |res/low_res| NOP | >> + * | Y Y | res | low_res | >> + * --------------------------------------------------- >> + * >> + */ >> >> /* crashkernel=X[@offset] */ >> ret = parse_crashkernel(cmdline, memblock_phys_mem_size(), >> @@ -169,10 +198,33 @@ static void __init reserve_crashkernel(int dma_state) >> else if (ret) >> return; >> >> + /* See the third row of the second table above, NOP */ >> + if (!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN)) >> + return; >> + >> + /* See the fourth row of the second table above */ >> + if (dma_enabled) { >> + if (dma_state == DMA_PHYS_LIMIT_UNKNOWN) >> + skip_low_res = 1; >> + else >> + skip_res = 1; >> + } >> + >> crash_max = CRASH_ADDR_HIGH_MAX; >> } else if (ret || !crash_size) { >> /* The specified value is invalid */ >> return; >> + } else { >> + /* See the 1-2 rows of the second table above, NOP */ >> + if ((!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN)) || >> + (dma_enabled && (dma_state == DMA_PHYS_LIMIT_UNKNOWN))) >> + return; >> + } >> + >> + if (skip_res) { >> + crash_base = crashk_res.start; >> + crash_size = crashk_res.end - crashk_res.start + 1; >> + goto check_low; >> } >> >> fixed_base = !!crash_base; >> @@ -202,9 +254,18 @@ static void __init reserve_crashkernel(int dma_state) >> return; >> } >> >> + crashk_res.start = crash_base; >> + crashk_res.end = crash_base + crash_size - 1; >> + >> +check_low: >> + if (skip_low_res) >> + return; >> + >> if ((crash_base >= CRASH_ADDR_LOW_MAX) && >> crash_low_size && reserve_crashkernel_low(crash_low_size)) { >> memblock_phys_free(crash_base, crash_size); >> + crashk_res.start = 0; >> + crashk_res.end = 0; >> return; >> } >> >> @@ -219,8 +280,6 @@ static void __init reserve_crashkernel(int dma_state) >> if (crashk_low_res.end) >> kmemleak_ignore_phys(crashk_low_res.start); >> >> - crashk_res.start = crash_base; >> - crashk_res.end = crash_base + crash_size - 1; >> insert_resource(&iomem_resource, &crashk_res); >> } >> >> -- >> 2.25.1 >> > > . >
On 06/21/22 at 02:24pm, Kefeng Wang wrote: > > On 2022/6/21 13:33, Baoquan He wrote: > > Hi, > > > > On 06/13/22 at 04:09pm, Zhen Lei wrote: > > > If the crashkernel has both high memory above DMA zones and low memory > > > in DMA zones, kexec always loads the content such as Image and dtb to the > > > high memory instead of the low memory. This means that only high memory > > > requires write protection based on page-level mapping. The allocation of > > > high memory does not depend on the DMA boundary. So we can reserve the > > > high memory first even if the crashkernel reservation is deferred. > > > > > > This means that the block mapping can still be performed on other kernel > > > linear address spaces, the TLB miss rate can be reduced and the system > > > performance will be improved. > > Ugh, this looks a little ugly, honestly. > > > > If that's for sure arm64 can't split large page mapping of linear > > region, this patch is one way to optimize linear mapping. Given kdump > > setting is necessary on arm64 server, the booting speed is truly > > impacted heavily. > > Is there some conclusion or discussion that arm64 can't split large page > mapping? Yes, please see below commit log. commit d27cfa1fc823 ("arm64: mm: set the contiguous bit for kernel mappings where appropriate") > > Could the crashkernel reservation (and Kfence pool) be splited dynamically? For crashkernel region, we have arch_kexec_protect_crashkres() to secure the region, and crash_shrink_memory() could be called to shrink it. While crahshkernel region could be crossig part of a block mapping or section mapping and the mapping need be splitted, that will cause TLB conflicts. > > I found Mark replay "arm64: remove page granularity limitation from > KFENCE"[1], > > "We also avoid live changes from block<->table mappings, since the > archtitecture gives us very weak guarantees there and generally requires > a Break-Before-Make sequence (though IIRC this was tightened up > somewhat, so maybe going one way is supposed to work). Unless it's > really necessary, I'd rather not split these block mappings while > they're live." > > Hi Mark and Catalin, could you give some comment, many thanks. > > [1] https://lore.kernel.org/lkml/20210920101938.GA13863@C02TD0UTHF1T.local/T/#m1a7f974593f5545cbcfc0d21560df4e7926b1381 > > > > > > However, I would suggest letting it as is with below reasons: > > > > 1) The code will complicate the crashkernel reservatoin code which > > is already difficult to understand. > > 2) It can only optimize the two cases, first is CONFIG_ZONE_DMA|DMA32 > > disabled, the other is crashkernel=,high is specified. While both > > two cases are corner case, most of systems have CONFIG_ZONE_DMA|DMA32 > > enabled, and most of systems have crashkernel=xM which is enough. > > Having them optimized won't bring benefit to most of systems. > > 3) Besides, the crashkernel=,high can be handled earlier because > > arm64 alwasys have memblock.bottom_up == false currently, thus we > > don't need worry arbout the lower limit of crashkernel,high > > reservation for now. If memblock.bottom_up is set true in the future, > > this patch doesn't work any more. > > > > > > ... > > crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN, > > crash_base, crash_max); > > > > So, in my opinion, we can leave the current NON_BLOCK|SECT mapping as > > is caused by crashkernel reserving, since no regression is brought. > > And meantime, turning to check if there's any way to make the contiguous > > linear mapping and later splitting work. The patch 4, 5 in this patchset > > doesn't make much sense to me, frankly speaking. > > > > Thanks > > Baoquan >
On 06/21/22 at 03:56pm, Leizhen (ThunderTown) wrote: > > > On 2022/6/21 13:33, Baoquan He wrote: > > Hi, > > > > On 06/13/22 at 04:09pm, Zhen Lei wrote: > >> If the crashkernel has both high memory above DMA zones and low memory > >> in DMA zones, kexec always loads the content such as Image and dtb to the > >> high memory instead of the low memory. This means that only high memory > >> requires write protection based on page-level mapping. The allocation of > >> high memory does not depend on the DMA boundary. So we can reserve the > >> high memory first even if the crashkernel reservation is deferred. > >> > >> This means that the block mapping can still be performed on other kernel > >> linear address spaces, the TLB miss rate can be reduced and the system > >> performance will be improved. > > > > Ugh, this looks a little ugly, honestly. > > > > If that's for sure arm64 can't split large page mapping of linear > > region, this patch is one way to optimize linear mapping. Given kdump > > setting is necessary on arm64 server, the booting speed is truly > > impacted heavily. > > There is also a performance impact when running. Yes, indeed, the TLB flush will happen more often. > > > > > However, I would suggest letting it as is with below reasons: > > > > 1) The code will complicate the crashkernel reservatoin code which > > is already difficult to understand. > > Yeah, I feel it, too. > > > 2) It can only optimize the two cases, first is CONFIG_ZONE_DMA|DMA32 > > disabled, the other is crashkernel=,high is specified. While both > > two cases are corner case, most of systems have CONFIG_ZONE_DMA|DMA32 > > enabled, and most of systems have crashkernel=xM which is enough. > > Having them optimized won't bring benefit to most of systems. > > The case of CONFIG_ZONE_DMA|DMA32 disabled have been resolved by > commit 031495635b46 ("arm64: Do not defer reserve_crashkernel() for platforms with no DMA memory zones"). > Currently the performance problem to be optimized is that DMA is enabled. Yes, the disabled CONFIG_ZONE_DMA|DMA32 case has avoided the problem since its boundary is decided already at that time. Crashkenrel=,high can slso avoid this benefitting from the top done memblock allocating. However, the crashkerne=xM which now gets the fallback support is the main syntax we will use, that still has the problem. > > > > 3) Besides, the crashkernel=,high can be handled earlier because > > arm64 alwasys have memblock.bottom_up == false currently, thus we > > don't need worry arbout the lower limit of crashkernel,high > > reservation for now. If memblock.bottom_up is set true in the future, > > this patch doesn't work any more. > > > > > > ... > > crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN, > > crash_base, crash_max); > > > > So, in my opinion, we can leave the current NON_BLOCK|SECT mapping as > > is caused by crashkernel reserving, since no regression is brought. > > And meantime, turning to check if there's any way to make the contiguous > > linear mapping and later splitting work. The patch 4, 5 in this patchset > > doesn't make much sense to me, frankly speaking. > > OK. As discussed earlier, I can rethink if there is a better way to patch 4-5, > and this time focus on patch 1-2. In this way, all the functions are complete, > and only optimization is left. Sounds nice, thx. > > > >> > >> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> > >> --- > >> arch/arm64/mm/init.c | 71 ++++++++++++++++++++++++++++++++++++++++---- > >> 1 file changed, 65 insertions(+), 6 deletions(-) > >> > >> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c > >> index fb24efbc46f5ef4..ae0bae2cafe6ab0 100644 > >> --- a/arch/arm64/mm/init.c > >> +++ b/arch/arm64/mm/init.c > >> @@ -141,15 +141,44 @@ static void __init reserve_crashkernel(int dma_state) > >> unsigned long long crash_max = CRASH_ADDR_LOW_MAX; > >> char *cmdline = boot_command_line; > >> int dma_enabled = IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32); > >> - int ret; > >> + int ret, skip_res = 0, skip_low_res = 0; > >> bool fixed_base; > >> > >> if (!IS_ENABLED(CONFIG_KEXEC_CORE)) > >> return; > >> > >> - if ((!dma_enabled && (dma_state != DMA_PHYS_LIMIT_UNKNOWN)) || > >> - (dma_enabled && (dma_state != DMA_PHYS_LIMIT_KNOWN))) > >> - return; > >> + /* > >> + * In the following table: > >> + * X,high means crashkernel=X,high > >> + * unknown means dma_state = DMA_PHYS_LIMIT_UNKNOWN > >> + * known means dma_state = DMA_PHYS_LIMIT_KNOWN > >> + * > >> + * The first two columns indicate the status, and the last two > >> + * columns indicate the phase in which crash high or low memory > >> + * needs to be reserved. > >> + * --------------------------------------------------- > >> + * | DMA enabled | X,high used | unknown | known | > >> + * --------------------------------------------------- > >> + * | N N | low | NOP | > >> + * | Y N | NOP | low | > >> + * | N Y | high/low | NOP | > >> + * | Y Y | high | low | > >> + * --------------------------------------------------- > >> + * > >> + * But in this function, the crash high memory allocation of > >> + * crashkernel=Y,high and the crash low memory allocation of > >> + * crashkernel=X[@offset] for crashk_res are mixed at one place. > >> + * So the table above need to be adjusted as below: > >> + * --------------------------------------------------- > >> + * | DMA enabled | X,high used | unknown | known | > >> + * --------------------------------------------------- > >> + * | N N | res | NOP | > >> + * | Y N | NOP | res | > >> + * | N Y |res/low_res| NOP | > >> + * | Y Y | res | low_res | > >> + * --------------------------------------------------- > >> + * > >> + */ > >> > >> /* crashkernel=X[@offset] */ > >> ret = parse_crashkernel(cmdline, memblock_phys_mem_size(), > >> @@ -169,10 +198,33 @@ static void __init reserve_crashkernel(int dma_state) > >> else if (ret) > >> return; > >> > >> + /* See the third row of the second table above, NOP */ > >> + if (!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN)) > >> + return; > >> + > >> + /* See the fourth row of the second table above */ > >> + if (dma_enabled) { > >> + if (dma_state == DMA_PHYS_LIMIT_UNKNOWN) > >> + skip_low_res = 1; > >> + else > >> + skip_res = 1; > >> + } > >> + > >> crash_max = CRASH_ADDR_HIGH_MAX; > >> } else if (ret || !crash_size) { > >> /* The specified value is invalid */ > >> return; > >> + } else { > >> + /* See the 1-2 rows of the second table above, NOP */ > >> + if ((!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN)) || > >> + (dma_enabled && (dma_state == DMA_PHYS_LIMIT_UNKNOWN))) > >> + return; > >> + } > >> + > >> + if (skip_res) { > >> + crash_base = crashk_res.start; > >> + crash_size = crashk_res.end - crashk_res.start + 1; > >> + goto check_low; > >> } > >> > >> fixed_base = !!crash_base; > >> @@ -202,9 +254,18 @@ static void __init reserve_crashkernel(int dma_state) > >> return; > >> } > >> > >> + crashk_res.start = crash_base; > >> + crashk_res.end = crash_base + crash_size - 1; > >> + > >> +check_low: > >> + if (skip_low_res) > >> + return; > >> + > >> if ((crash_base >= CRASH_ADDR_LOW_MAX) && > >> crash_low_size && reserve_crashkernel_low(crash_low_size)) { > >> memblock_phys_free(crash_base, crash_size); > >> + crashk_res.start = 0; > >> + crashk_res.end = 0; > >> return; > >> } > >> > >> @@ -219,8 +280,6 @@ static void __init reserve_crashkernel(int dma_state) > >> if (crashk_low_res.end) > >> kmemleak_ignore_phys(crashk_low_res.start); > >> > >> - crashk_res.start = crash_base; > >> - crashk_res.end = crash_base + crash_size - 1; > >> insert_resource(&iomem_resource, &crashk_res); > >> } > >> > >> -- > >> 2.25.1 > >> > > > > . > > > > -- > Regards, > Zhen Lei >
On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote: > On 2022/6/21 13:33, Baoquan He wrote: > > On 06/13/22 at 04:09pm, Zhen Lei wrote: > > > If the crashkernel has both high memory above DMA zones and low memory > > > in DMA zones, kexec always loads the content such as Image and dtb to the > > > high memory instead of the low memory. This means that only high memory > > > requires write protection based on page-level mapping. The allocation of > > > high memory does not depend on the DMA boundary. So we can reserve the > > > high memory first even if the crashkernel reservation is deferred. > > > > > > This means that the block mapping can still be performed on other kernel > > > linear address spaces, the TLB miss rate can be reduced and the system > > > performance will be improved. > > > > Ugh, this looks a little ugly, honestly. > > > > If that's for sure arm64 can't split large page mapping of linear > > region, this patch is one way to optimize linear mapping. Given kdump > > setting is necessary on arm64 server, the booting speed is truly > > impacted heavily. > > Is there some conclusion or discussion that arm64 can't split large page > mapping? > > Could the crashkernel reservation (and Kfence pool) be splited dynamically? > > I found Mark replay "arm64: remove page granularity limitation from > KFENCE"[1], > > "We also avoid live changes from block<->table mappings, since the > archtitecture gives us very weak guarantees there and generally requires > a Break-Before-Make sequence (though IIRC this was tightened up > somewhat, so maybe going one way is supposed to work). Unless it's > really necessary, I'd rather not split these block mappings while > they're live." The problem with splitting is that you can end up with two entries in the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict abort (but can be worse like loss of coherency). Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at all, the software would have to unmap the range, TLBI, remap. With FEAT_BBM (level 2), we can do this without tearing the mapping down but we still need to handle the potential TLB conflict abort. The handler only needs a TLBI but if it touches the memory range being changed it risks faulting again. With vmap stacks and the kernel image mapped in the vmalloc space, we have a small window where this could be handled but we probably can't go into the C part of the exception handling (tracing etc. may access a kmalloc'ed object for example). Another option is to do a stop_machine() (if multi-processor at that point), disable the MMUs, modify the page tables, re-enable the MMU but it's also complicated.
Hi Catalin, On 06/21/22 at 07:04pm, Catalin Marinas wrote: > On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote: > > On 2022/6/21 13:33, Baoquan He wrote: > > > On 06/13/22 at 04:09pm, Zhen Lei wrote: > > > > If the crashkernel has both high memory above DMA zones and low memory > > > > in DMA zones, kexec always loads the content such as Image and dtb to the > > > > high memory instead of the low memory. This means that only high memory > > > > requires write protection based on page-level mapping. The allocation of > > > > high memory does not depend on the DMA boundary. So we can reserve the > > > > high memory first even if the crashkernel reservation is deferred. > > > > > > > > This means that the block mapping can still be performed on other kernel > > > > linear address spaces, the TLB miss rate can be reduced and the system > > > > performance will be improved. > > > > > > Ugh, this looks a little ugly, honestly. > > > > > > If that's for sure arm64 can't split large page mapping of linear > > > region, this patch is one way to optimize linear mapping. Given kdump > > > setting is necessary on arm64 server, the booting speed is truly > > > impacted heavily. > > > > Is there some conclusion or discussion that arm64 can't split large page > > mapping? > > > > Could the crashkernel reservation (and Kfence pool) be splited dynamically? > > > > I found Mark replay "arm64: remove page granularity limitation from > > KFENCE"[1], > > > > "We also avoid live changes from block<->table mappings, since the > > archtitecture gives us very weak guarantees there and generally requires > > a Break-Before-Make sequence (though IIRC this was tightened up > > somewhat, so maybe going one way is supposed to work). Unless it's > > really necessary, I'd rather not split these block mappings while > > they're live." > > The problem with splitting is that you can end up with two entries in > the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another > for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict > abort (but can be worse like loss of coherency). Thanks for this explanation. Is this a drawback of arm64 design? X86 code do the same thing w/o issue, is there way to overcome this on arm64 from hardware or software side? I ever got a arm64 server with huge memory, w or w/o crashkernel setting have different bootup time. And the more often TLB miss and flush will cause performance cost. It is really a pity if we have very powerful arm64 cpu and system capacity, but bottlenecked by this drawback. > > Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at > all, the software would have to unmap the range, TLBI, remap. With > FEAT_BBM (level 2), we can do this without tearing the mapping down but > we still need to handle the potential TLB conflict abort. The handler > only needs a TLBI but if it touches the memory range being changed it > risks faulting again. With vmap stacks and the kernel image mapped in > the vmalloc space, we have a small window where this could be handled > but we probably can't go into the C part of the exception handling > (tracing etc. may access a kmalloc'ed object for example). > > Another option is to do a stop_machine() (if multi-processor at that > point), disable the MMUs, modify the page tables, re-enable the MMU but > it's also complicated. > > -- > Catalin >
On 2022/6/22 2:04, Catalin Marinas wrote: > On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote: >> On 2022/6/21 13:33, Baoquan He wrote: >>> On 06/13/22 at 04:09pm, Zhen Lei wrote: >>>> If the crashkernel has both high memory above DMA zones and low memory >>>> in DMA zones, kexec always loads the content such as Image and dtb to the >>>> high memory instead of the low memory. This means that only high memory >>>> requires write protection based on page-level mapping. The allocation of >>>> high memory does not depend on the DMA boundary. So we can reserve the >>>> high memory first even if the crashkernel reservation is deferred. >>>> >>>> This means that the block mapping can still be performed on other kernel >>>> linear address spaces, the TLB miss rate can be reduced and the system >>>> performance will be improved. >>> Ugh, this looks a little ugly, honestly. >>> >>> If that's for sure arm64 can't split large page mapping of linear >>> region, this patch is one way to optimize linear mapping. Given kdump >>> setting is necessary on arm64 server, the booting speed is truly >>> impacted heavily. >> Is there some conclusion or discussion that arm64 can't split large page >> mapping? >> >> Could the crashkernel reservation (and Kfence pool) be splited dynamically? >> >> I found Mark replay "arm64: remove page granularity limitation from >> KFENCE"[1], >> >> "We also avoid live changes from block<->table mappings, since the >> archtitecture gives us very weak guarantees there and generally requires >> a Break-Before-Make sequence (though IIRC this was tightened up >> somewhat, so maybe going one way is supposed to work). Unless it's >> really necessary, I'd rather not split these block mappings while >> they're live." > The problem with splitting is that you can end up with two entries in > the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another > for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict > abort (but can be worse like loss of coherency). Thanks for your explanation, > Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at > all, the software would have to unmap the range, TLBI, remap. With > FEAT_BBM (level 2), we can do this without tearing the mapping down but > we still need to handle the potential TLB conflict abort. The handler > only needs a TLBI but if it touches the memory range being changed it > risks faulting again. With vmap stacks and the kernel image mapped in > the vmalloc space, we have a small window where this could be handled > but we probably can't go into the C part of the exception handling > (tracing etc. may access a kmalloc'ed object for example). So if without FEAT_BBM,we can only guarantee BBM sequence via "unmap the range, TLBI, remap" or the following option, and with FEAT_BBM (level 2), we could have easy way to avoid TLB conflict for some vmalloc space, but still hard to deal with other scence? > > Another option is to do a stop_machine() (if multi-processor at that > point), disable the MMUs, modify the page tables, re-enable the MMU but > it's also complicated. >
On Wed, Jun 22, 2022 at 08:03:21PM +0800, Kefeng Wang wrote: > On 2022/6/22 2:04, Catalin Marinas wrote: > > On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote: > > > On 2022/6/21 13:33, Baoquan He wrote: > > > > On 06/13/22 at 04:09pm, Zhen Lei wrote: > > > > > If the crashkernel has both high memory above DMA zones and low memory > > > > > in DMA zones, kexec always loads the content such as Image and dtb to the > > > > > high memory instead of the low memory. This means that only high memory > > > > > requires write protection based on page-level mapping. The allocation of > > > > > high memory does not depend on the DMA boundary. So we can reserve the > > > > > high memory first even if the crashkernel reservation is deferred. > > > > > > > > > > This means that the block mapping can still be performed on other kernel > > > > > linear address spaces, the TLB miss rate can be reduced and the system > > > > > performance will be improved. > > > > Ugh, this looks a little ugly, honestly. > > > > > > > > If that's for sure arm64 can't split large page mapping of linear > > > > region, this patch is one way to optimize linear mapping. Given kdump > > > > setting is necessary on arm64 server, the booting speed is truly > > > > impacted heavily. > > > Is there some conclusion or discussion that arm64 can't split large page > > > mapping? > > > > > > Could the crashkernel reservation (and Kfence pool) be splited dynamically? > > > > > > I found Mark replay "arm64: remove page granularity limitation from > > > KFENCE"[1], > > > > > > "We also avoid live changes from block<->table mappings, since the > > > archtitecture gives us very weak guarantees there and generally requires > > > a Break-Before-Make sequence (though IIRC this was tightened up > > > somewhat, so maybe going one way is supposed to work). Unless it's > > > really necessary, I'd rather not split these block mappings while > > > they're live." > > The problem with splitting is that you can end up with two entries in > > the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another > > for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict > > abort (but can be worse like loss of coherency). > Thanks for your explanation, > > Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at > > all, the software would have to unmap the range, TLBI, remap. With > > FEAT_BBM (level 2), we can do this without tearing the mapping down but > > we still need to handle the potential TLB conflict abort. The handler > > only needs a TLBI but if it touches the memory range being changed it > > risks faulting again. With vmap stacks and the kernel image mapped in > > the vmalloc space, we have a small window where this could be handled > > but we probably can't go into the C part of the exception handling > > (tracing etc. may access a kmalloc'ed object for example). > > So if without FEAT_BBM,we can only guarantee BBM sequence via > "unmap the range, TLBI, remap" or the following option, Yes, that's the break-before-make sequence. > and with FEAT_BBM (level 2), we could have easy way to avoid TLB > conflict for some vmalloc space, but still hard to deal with other > scence? It's not too hard in theory. Basically there's a small risk of getting a TLB conflict abort for the mappings you change without a BBM sequence (I think it's nearly non-existed when going from large block to smaller pages, though the architecture states that it's still possible). Since we only want to do this for the linear map and the kernel and stack are in the vmalloc space, we can handle such trap as an safety measure (it just needs a TLBI). It may help to tweak a model to force it to generate such conflict aborts, otherwise we'd not be able to test the code. It's possible that such trap is raised at EL2 if a guest caused the conflict abort (the architecture left this as IMP DEF). The hypervisors may need to be taught to do a TLBI VMALLS12E1 instead of killing the guest. I haven't checked what KVM does.
On Wed, Jun 22, 2022 at 04:35:16PM +0800, Baoquan He wrote: > On 06/21/22 at 07:04pm, Catalin Marinas wrote: > > The problem with splitting is that you can end up with two entries in > > the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another > > for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict > > abort (but can be worse like loss of coherency). > > Thanks for this explanation. Is this a drawback of arm64 design? X86 > code do the same thing w/o issue, is there way to overcome this on > arm64 from hardware or software side? It is a drawback of the arm64 implementations. Having multiple TLB entries for the same VA would need additional logic in hardware to detect, so the microarchitects have pushed back. In ARMv8.4, some balanced was reached with FEAT_BBM so that the only visible side-effect is a potential TLB conflict abort that could be resolved by software. > I ever got a arm64 server with huge memory, w or w/o crashkernel setting > have different bootup time. And the more often TLB miss and flush will > cause performance cost. It is really a pity if we have very powerful > arm64 cpu and system capacity, but bottlenecked by this drawback. Is it only the boot time affected or the runtime performance as well?
On 2022/6/23 18:27, Catalin Marinas wrote: > On Wed, Jun 22, 2022 at 08:03:21PM +0800, Kefeng Wang wrote: >> On 2022/6/22 2:04, Catalin Marinas wrote: >>> On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote: >>>> On 2022/6/21 13:33, Baoquan He wrote: >>>>> On 06/13/22 at 04:09pm, Zhen Lei wrote: >>>>>> If the crashkernel has both high memory above DMA zones and low memory >>>>>> in DMA zones, kexec always loads the content such as Image and dtb to the >>>>>> high memory instead of the low memory. This means that only high memory >>>>>> requires write protection based on page-level mapping. The allocation of >>>>>> high memory does not depend on the DMA boundary. So we can reserve the >>>>>> high memory first even if the crashkernel reservation is deferred. >>>>>> >>>>>> This means that the block mapping can still be performed on other kernel >>>>>> linear address spaces, the TLB miss rate can be reduced and the system >>>>>> performance will be improved. >>>>> Ugh, this looks a little ugly, honestly. >>>>> >>>>> If that's for sure arm64 can't split large page mapping of linear >>>>> region, this patch is one way to optimize linear mapping. Given kdump >>>>> setting is necessary on arm64 server, the booting speed is truly >>>>> impacted heavily. >>>> Is there some conclusion or discussion that arm64 can't split large page >>>> mapping? >>>> >>>> Could the crashkernel reservation (and Kfence pool) be splited dynamically? >>>> >>>> I found Mark replay "arm64: remove page granularity limitation from >>>> KFENCE"[1], >>>> >>>> "We also avoid live changes from block<->table mappings, since the >>>> archtitecture gives us very weak guarantees there and generally requires >>>> a Break-Before-Make sequence (though IIRC this was tightened up >>>> somewhat, so maybe going one way is supposed to work). Unless it's >>>> really necessary, I'd rather not split these block mappings while >>>> they're live." >>> The problem with splitting is that you can end up with two entries in >>> the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another >>> for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict >>> abort (but can be worse like loss of coherency). >> Thanks for your explanation, >>> Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at >>> all, the software would have to unmap the range, TLBI, remap. With >>> FEAT_BBM (level 2), we can do this without tearing the mapping down but >>> we still need to handle the potential TLB conflict abort. The handler >>> only needs a TLBI but if it touches the memory range being changed it >>> risks faulting again. With vmap stacks and the kernel image mapped in >>> the vmalloc space, we have a small window where this could be handled >>> but we probably can't go into the C part of the exception handling >>> (tracing etc. may access a kmalloc'ed object for example). >> So if without FEAT_BBM,we can only guarantee BBM sequence via >> "unmap the range, TLBI, remap" or the following option, > Yes, that's the break-before-make sequence. > >> and with FEAT_BBM (level 2), we could have easy way to avoid TLB >> conflict for some vmalloc space, but still hard to deal with other >> scence? > It's not too hard in theory. Basically there's a small risk of getting a > TLB conflict abort for the mappings you change without a BBM sequence (I > think it's nearly non-existed when going from large block to smaller > pages, though the architecture states that it's still possible). Since > we only want to do this for the linear map and the kernel and stack are > in the vmalloc space, we can handle such trap as an safety measure (it > just needs a TLBI). It may help to tweak a model to force it to generate > such conflict aborts, otherwise we'd not be able to test the code. > > It's possible that such trap is raised at EL2 if a guest caused the > conflict abort (the architecture left this as IMP DEF). The hypervisors > may need to be taught to do a TLBI VMALLS12E1 instead of killing the > guest. I haven't checked what KVM does. Got it,many thanks.
On 06/23/22 at 03:07pm, Catalin Marinas wrote: > On Wed, Jun 22, 2022 at 04:35:16PM +0800, Baoquan He wrote: > > On 06/21/22 at 07:04pm, Catalin Marinas wrote: > > > The problem with splitting is that you can end up with two entries in > > > the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another > > > for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict > > > abort (but can be worse like loss of coherency). > > > > Thanks for this explanation. Is this a drawback of arm64 design? X86 > > code do the same thing w/o issue, is there way to overcome this on > > arm64 from hardware or software side? > > It is a drawback of the arm64 implementations. Having multiple TLB > entries for the same VA would need additional logic in hardware to > detect, so the microarchitects have pushed back. In ARMv8.4, some > balanced was reached with FEAT_BBM so that the only visible side-effect > is a potential TLB conflict abort that could be resolved by software. I see, thx. > > > I ever got a arm64 server with huge memory, w or w/o crashkernel setting > > have different bootup time. And the more often TLB miss and flush will > > cause performance cost. It is really a pity if we have very powerful > > arm64 cpu and system capacity, but bottlenecked by this drawback. > > Is it only the boot time affected or the runtime performance as well? Sorry for late reply. What I observerd is the boot time serious latecy with huge memory. Since the timestamp is not available at that time, we can't tell the number. I didn't notice the runtime performance.
On 2022/6/27 10:52, Baoquan He wrote: > On 06/23/22 at 03:07pm, Catalin Marinas wrote: >> On Wed, Jun 22, 2022 at 04:35:16PM +0800, Baoquan He wrote: >>> On 06/21/22 at 07:04pm, Catalin Marinas wrote: >>>> The problem with splitting is that you can end up with two entries in >>>> the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another >>>> for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict >>>> abort (but can be worse like loss of coherency). >>> >>> Thanks for this explanation. Is this a drawback of arm64 design? X86 >>> code do the same thing w/o issue, is there way to overcome this on >>> arm64 from hardware or software side? >> >> It is a drawback of the arm64 implementations. Having multiple TLB >> entries for the same VA would need additional logic in hardware to >> detect, so the microarchitects have pushed back. In ARMv8.4, some >> balanced was reached with FEAT_BBM so that the only visible side-effect >> is a potential TLB conflict abort that could be resolved by software. > > I see, thx. > >> >>> I ever got a arm64 server with huge memory, w or w/o crashkernel setting >>> have different bootup time. And the more often TLB miss and flush will >>> cause performance cost. It is really a pity if we have very powerful >>> arm64 cpu and system capacity, but bottlenecked by this drawback. >> >> Is it only the boot time affected or the runtime performance as well? > > Sorry for late reply. What I observerd is the boot time serious latecy > with huge memory. Since the timestamp is not available at that time, > we can't tell the number. I didn't notice the runtime performance. There's some data here, and I see you're not on the cc list. https://lore.kernel.org/linux-mm/1656241815-28494-1-git-send-email-guanghuifeng@linux.alibaba.com/T/ > > . >
On 06/27/22 at 05:17pm, Leizhen (ThunderTown) wrote: > > > On 2022/6/27 10:52, Baoquan He wrote: > > On 06/23/22 at 03:07pm, Catalin Marinas wrote: > >> On Wed, Jun 22, 2022 at 04:35:16PM +0800, Baoquan He wrote: > >>> On 06/21/22 at 07:04pm, Catalin Marinas wrote: > >>>> The problem with splitting is that you can end up with two entries in > >>>> the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another > >>>> for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict > >>>> abort (but can be worse like loss of coherency). > >>> > >>> Thanks for this explanation. Is this a drawback of arm64 design? X86 > >>> code do the same thing w/o issue, is there way to overcome this on > >>> arm64 from hardware or software side? > >> > >> It is a drawback of the arm64 implementations. Having multiple TLB > >> entries for the same VA would need additional logic in hardware to > >> detect, so the microarchitects have pushed back. In ARMv8.4, some > >> balanced was reached with FEAT_BBM so that the only visible side-effect > >> is a potential TLB conflict abort that could be resolved by software. > > > > I see, thx. > > > >> > >>> I ever got a arm64 server with huge memory, w or w/o crashkernel setting > >>> have different bootup time. And the more often TLB miss and flush will > >>> cause performance cost. It is really a pity if we have very powerful > >>> arm64 cpu and system capacity, but bottlenecked by this drawback. > >> > >> Is it only the boot time affected or the runtime performance as well? > > > > Sorry for late reply. What I observerd is the boot time serious latecy > > with huge memory. Since the timestamp is not available at that time, > > we can't tell the number. I didn't notice the runtime performance. > > There's some data here, and I see you're not on the cc list. > > https://lore.kernel.org/linux-mm/1656241815-28494-1-git-send-email-guanghuifeng@linux.alibaba.com/T/ Thanks, Zhen Lei. I also saw the patch. That seems to be a good way, since there's only one process running at that time. Not sure if there's still risk of multiple TLB entries for the same VA existing.
On 2022/6/27 18:17, Baoquan He wrote: > On 06/27/22 at 05:17pm, Leizhen (ThunderTown) wrote: >> >> >> On 2022/6/27 10:52, Baoquan He wrote: >>> On 06/23/22 at 03:07pm, Catalin Marinas wrote: >>>> On Wed, Jun 22, 2022 at 04:35:16PM +0800, Baoquan He wrote: >>>>> On 06/21/22 at 07:04pm, Catalin Marinas wrote: >>>>>> The problem with splitting is that you can end up with two entries in >>>>>> the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another >>>>>> for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict >>>>>> abort (but can be worse like loss of coherency). >>>>> >>>>> Thanks for this explanation. Is this a drawback of arm64 design? X86 >>>>> code do the same thing w/o issue, is there way to overcome this on >>>>> arm64 from hardware or software side? >>>> >>>> It is a drawback of the arm64 implementations. Having multiple TLB >>>> entries for the same VA would need additional logic in hardware to >>>> detect, so the microarchitects have pushed back. In ARMv8.4, some >>>> balanced was reached with FEAT_BBM so that the only visible side-effect >>>> is a potential TLB conflict abort that could be resolved by software. >>> >>> I see, thx. >>> >>>> >>>>> I ever got a arm64 server with huge memory, w or w/o crashkernel setting >>>>> have different bootup time. And the more often TLB miss and flush will >>>>> cause performance cost. It is really a pity if we have very powerful >>>>> arm64 cpu and system capacity, but bottlenecked by this drawback. >>>> >>>> Is it only the boot time affected or the runtime performance as well? >>> >>> Sorry for late reply. What I observerd is the boot time serious latecy >>> with huge memory. Since the timestamp is not available at that time, >>> we can't tell the number. I didn't notice the runtime performance. >> >> There's some data here, and I see you're not on the cc list. >> >> https://lore.kernel.org/linux-mm/1656241815-28494-1-git-send-email-guanghuifeng@linux.alibaba.com/T/ > > Thanks, Zhen Lei. I also saw the patch. That seems to be a good way, Yes. > since there's only one process running at that time. Not sure if there's > still risk of multiple TLB entries for the same VA existing. > > . >
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index fb24efbc46f5ef4..ae0bae2cafe6ab0 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -141,15 +141,44 @@ static void __init reserve_crashkernel(int dma_state) unsigned long long crash_max = CRASH_ADDR_LOW_MAX; char *cmdline = boot_command_line; int dma_enabled = IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32); - int ret; + int ret, skip_res = 0, skip_low_res = 0; bool fixed_base; if (!IS_ENABLED(CONFIG_KEXEC_CORE)) return; - if ((!dma_enabled && (dma_state != DMA_PHYS_LIMIT_UNKNOWN)) || - (dma_enabled && (dma_state != DMA_PHYS_LIMIT_KNOWN))) - return; + /* + * In the following table: + * X,high means crashkernel=X,high + * unknown means dma_state = DMA_PHYS_LIMIT_UNKNOWN + * known means dma_state = DMA_PHYS_LIMIT_KNOWN + * + * The first two columns indicate the status, and the last two + * columns indicate the phase in which crash high or low memory + * needs to be reserved. + * --------------------------------------------------- + * | DMA enabled | X,high used | unknown | known | + * --------------------------------------------------- + * | N N | low | NOP | + * | Y N | NOP | low | + * | N Y | high/low | NOP | + * | Y Y | high | low | + * --------------------------------------------------- + * + * But in this function, the crash high memory allocation of + * crashkernel=Y,high and the crash low memory allocation of + * crashkernel=X[@offset] for crashk_res are mixed at one place. + * So the table above need to be adjusted as below: + * --------------------------------------------------- + * | DMA enabled | X,high used | unknown | known | + * --------------------------------------------------- + * | N N | res | NOP | + * | Y N | NOP | res | + * | N Y |res/low_res| NOP | + * | Y Y | res | low_res | + * --------------------------------------------------- + * + */ /* crashkernel=X[@offset] */ ret = parse_crashkernel(cmdline, memblock_phys_mem_size(), @@ -169,10 +198,33 @@ static void __init reserve_crashkernel(int dma_state) else if (ret) return; + /* See the third row of the second table above, NOP */ + if (!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN)) + return; + + /* See the fourth row of the second table above */ + if (dma_enabled) { + if (dma_state == DMA_PHYS_LIMIT_UNKNOWN) + skip_low_res = 1; + else + skip_res = 1; + } + crash_max = CRASH_ADDR_HIGH_MAX; } else if (ret || !crash_size) { /* The specified value is invalid */ return; + } else { + /* See the 1-2 rows of the second table above, NOP */ + if ((!dma_enabled && (dma_state == DMA_PHYS_LIMIT_KNOWN)) || + (dma_enabled && (dma_state == DMA_PHYS_LIMIT_UNKNOWN))) + return; + } + + if (skip_res) { + crash_base = crashk_res.start; + crash_size = crashk_res.end - crashk_res.start + 1; + goto check_low; } fixed_base = !!crash_base; @@ -202,9 +254,18 @@ static void __init reserve_crashkernel(int dma_state) return; } + crashk_res.start = crash_base; + crashk_res.end = crash_base + crash_size - 1; + +check_low: + if (skip_low_res) + return; + if ((crash_base >= CRASH_ADDR_LOW_MAX) && crash_low_size && reserve_crashkernel_low(crash_low_size)) { memblock_phys_free(crash_base, crash_size); + crashk_res.start = 0; + crashk_res.end = 0; return; } @@ -219,8 +280,6 @@ static void __init reserve_crashkernel(int dma_state) if (crashk_low_res.end) kmemleak_ignore_phys(crashk_low_res.start); - crashk_res.start = crash_base; - crashk_res.end = crash_base + crash_size - 1; insert_resource(&iomem_resource, &crashk_res); }
If the crashkernel has both high memory above DMA zones and low memory in DMA zones, kexec always loads the content such as Image and dtb to the high memory instead of the low memory. This means that only high memory requires write protection based on page-level mapping. The allocation of high memory does not depend on the DMA boundary. So we can reserve the high memory first even if the crashkernel reservation is deferred. This means that the block mapping can still be performed on other kernel linear address spaces, the TLB miss rate can be reduced and the system performance will be improved. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> --- arch/arm64/mm/init.c | 71 ++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 65 insertions(+), 6 deletions(-)