Message ID | 20230925234837.86786-3-mike.kravetz@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Batch hugetlb vmemmap modification operations | expand |
On 26.09.2023 01:48, Mike Kravetz wrote: > Allocation of a hugetlb page for the hugetlb pool is done by the routine > alloc_pool_huge_page. This routine will allocate contiguous pages from > a low level allocator, prep the pages for usage as a hugetlb page and > then add the resulting hugetlb page to the pool. > > In the 'prep' stage, optional vmemmap optimization is done. For > performance reasons we want to perform vmemmap optimization on multiple > hugetlb pages at once. To do this, restructure the hugetlb pool > allocation code such that vmemmap optimization can be isolated and later > batched. > > The code to allocate hugetlb pages from bootmem was also modified to > allow batching. > > No functional changes, only code restructure. > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> > Reviewed-by: Muchun Song <songmuchun@bytedance.com> > --- Hi, looks like this patch prevents today's next from booting on at least one Qualcomm ARM64 platform. Reverting it makes the device boot again. Konrad
On 09/27/23 13:26, Konrad Dybcio wrote: > > > On 26.09.2023 01:48, Mike Kravetz wrote: > > Allocation of a hugetlb page for the hugetlb pool is done by the routine > > alloc_pool_huge_page. This routine will allocate contiguous pages from > > a low level allocator, prep the pages for usage as a hugetlb page and > > then add the resulting hugetlb page to the pool. > > > > In the 'prep' stage, optional vmemmap optimization is done. For > > performance reasons we want to perform vmemmap optimization on multiple > > hugetlb pages at once. To do this, restructure the hugetlb pool > > allocation code such that vmemmap optimization can be isolated and later > > batched. > > > > The code to allocate hugetlb pages from bootmem was also modified to > > allow batching. > > > > No functional changes, only code restructure. > > > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> > > Reviewed-by: Muchun Song <songmuchun@bytedance.com> > > --- > Hi, looks like this patch prevents today's next from booting > on at least one Qualcomm ARM64 platform. Reverting it makes > the device boot again. Can you share the config used and any other specific information such as kernel command line. I can not reproduce on the arm64 platforms I have. Been trying various config combinations without success. Although, there are lots of possibilities. Also, taking a closer look at the changes. So far, nothing is obvious.
On 9/29/23 22:57, Mike Kravetz wrote: > On 09/27/23 13:26, Konrad Dybcio wrote: >> >> >> On 26.09.2023 01:48, Mike Kravetz wrote: >>> Allocation of a hugetlb page for the hugetlb pool is done by the routine >>> alloc_pool_huge_page. This routine will allocate contiguous pages from >>> a low level allocator, prep the pages for usage as a hugetlb page and >>> then add the resulting hugetlb page to the pool. >>> >>> In the 'prep' stage, optional vmemmap optimization is done. For >>> performance reasons we want to perform vmemmap optimization on multiple >>> hugetlb pages at once. To do this, restructure the hugetlb pool >>> allocation code such that vmemmap optimization can be isolated and later >>> batched. >>> >>> The code to allocate hugetlb pages from bootmem was also modified to >>> allow batching. >>> >>> No functional changes, only code restructure. >>> >>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> >>> Reviewed-by: Muchun Song <songmuchun@bytedance.com> >>> --- >> Hi, looks like this patch prevents today's next from booting >> on at least one Qualcomm ARM64 platform. Reverting it makes >> the device boot again. > > Can you share the config used and any other specific information such as > kernel command line. Later this week. Konrad
On 10/02/23 11:57, Konrad Dybcio wrote: > > > On 9/29/23 22:57, Mike Kravetz wrote: > > On 09/27/23 13:26, Konrad Dybcio wrote: > > > > > > > > > On 26.09.2023 01:48, Mike Kravetz wrote: > > > > Allocation of a hugetlb page for the hugetlb pool is done by the routine > > > > alloc_pool_huge_page. This routine will allocate contiguous pages from > > > > a low level allocator, prep the pages for usage as a hugetlb page and > > > > then add the resulting hugetlb page to the pool. > > > > > > > > In the 'prep' stage, optional vmemmap optimization is done. For > > > > performance reasons we want to perform vmemmap optimization on multiple > > > > hugetlb pages at once. To do this, restructure the hugetlb pool > > > > allocation code such that vmemmap optimization can be isolated and later > > > > batched. > > > > > > > > The code to allocate hugetlb pages from bootmem was also modified to > > > > allow batching. > > > > > > > > No functional changes, only code restructure. > > > > > > > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> > > > > Reviewed-by: Muchun Song <songmuchun@bytedance.com> > > > > --- > > > Hi, looks like this patch prevents today's next from booting > > > on at least one Qualcomm ARM64 platform. Reverting it makes > > > the device boot again. > > > > Can you share the config used and any other specific information such as > > kernel command line. > Later this week. As mentioned, I have been unable to reproduce on arm64 platforms I can access. I have tried various config and boot options. While doing so, I came across one issue impacting kernels compiled without CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP defined. This is not something that would prevent booting. I will send out an updated version series in the hope that any other issues may be discovered.
On 6.10.2023 05:08, Mike Kravetz wrote: > On 10/02/23 11:57, Konrad Dybcio wrote: >> >> >> On 9/29/23 22:57, Mike Kravetz wrote: >>> On 09/27/23 13:26, Konrad Dybcio wrote: >>>> >>>> >>>> On 26.09.2023 01:48, Mike Kravetz wrote: >>>>> Allocation of a hugetlb page for the hugetlb pool is done by the routine >>>>> alloc_pool_huge_page. This routine will allocate contiguous pages from >>>>> a low level allocator, prep the pages for usage as a hugetlb page and >>>>> then add the resulting hugetlb page to the pool. >>>>> >>>>> In the 'prep' stage, optional vmemmap optimization is done. For >>>>> performance reasons we want to perform vmemmap optimization on multiple >>>>> hugetlb pages at once. To do this, restructure the hugetlb pool >>>>> allocation code such that vmemmap optimization can be isolated and later >>>>> batched. >>>>> >>>>> The code to allocate hugetlb pages from bootmem was also modified to >>>>> allow batching. >>>>> >>>>> No functional changes, only code restructure. >>>>> >>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> >>>>> Reviewed-by: Muchun Song <songmuchun@bytedance.com> >>>>> --- >>>> Hi, looks like this patch prevents today's next from booting >>>> on at least one Qualcomm ARM64 platform. Reverting it makes >>>> the device boot again. >>> >>> Can you share the config used and any other specific information such as >>> kernel command line. >> Later this week. > > As mentioned, I have been unable to reproduce on arm64 platforms I can > access. I have tried various config and boot options. While doing so, > I came across one issue impacting kernels compiled without > CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP defined. This is not something > that would prevent booting. > > I will send out an updated version series in the hope that any other > issues may be discovered. I'm pushing the "later this week" by answering near end of calendar day, Friday, but it seems like this patch in v7 still prevents the device from booting.. You can find my defconfig at the link below. https://gist.github.com/konradybcio/d865f8dc9b12a98ba3875ec5a9aac42e Konrad
On 10/06/23 23:39, Konrad Dybcio wrote: > On 6.10.2023 05:08, Mike Kravetz wrote: > > On 10/02/23 11:57, Konrad Dybcio wrote: > >> > >> > >> On 9/29/23 22:57, Mike Kravetz wrote: > >>> On 09/27/23 13:26, Konrad Dybcio wrote: > >>>> > >>>> > >>>> On 26.09.2023 01:48, Mike Kravetz wrote: > >>>>> Allocation of a hugetlb page for the hugetlb pool is done by the routine > >>>>> alloc_pool_huge_page. This routine will allocate contiguous pages from > >>>>> a low level allocator, prep the pages for usage as a hugetlb page and > >>>>> then add the resulting hugetlb page to the pool. > >>>>> > >>>>> In the 'prep' stage, optional vmemmap optimization is done. For > >>>>> performance reasons we want to perform vmemmap optimization on multiple > >>>>> hugetlb pages at once. To do this, restructure the hugetlb pool > >>>>> allocation code such that vmemmap optimization can be isolated and later > >>>>> batched. > >>>>> > >>>>> The code to allocate hugetlb pages from bootmem was also modified to > >>>>> allow batching. > >>>>> > >>>>> No functional changes, only code restructure. > >>>>> > >>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> > >>>>> Reviewed-by: Muchun Song <songmuchun@bytedance.com> > >>>>> --- > >>>> Hi, looks like this patch prevents today's next from booting > >>>> on at least one Qualcomm ARM64 platform. Reverting it makes > >>>> the device boot again. > >>> > >>> Can you share the config used and any other specific information such as > >>> kernel command line. > >> Later this week. > > > > As mentioned, I have been unable to reproduce on arm64 platforms I can > > access. I have tried various config and boot options. While doing so, > > I came across one issue impacting kernels compiled without > > CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP defined. This is not something > > that would prevent booting. > > > > I will send out an updated version series in the hope that any other > > issues may be discovered. > I'm pushing the "later this week" by answering near end of calendar > day, Friday, but it seems like this patch in v7 still prevents the > device from booting.. > > You can find my defconfig at the link below. > > https://gist.github.com/konradybcio/d865f8dc9b12a98ba3875ec5a9aac42e > Thanks! I assume there is no further information such as any console output? Did any of you other arm64 platforms have this issue? Just trying to get as much information as possible to get to root cause.
Hi, Konrad, Just wondering, is your arm64 system a VM instance, or a bare metal? thanks! -jane On 10/6/2023 2:39 PM, Konrad Dybcio wrote: > On 6.10.2023 05:08, Mike Kravetz wrote: >> On 10/02/23 11:57, Konrad Dybcio wrote: >>> >>> >>> On 9/29/23 22:57, Mike Kravetz wrote: >>>> On 09/27/23 13:26, Konrad Dybcio wrote: >>>> [..] >> I will send out an updated version series in the hope that any other >> issues may be discovered. > I'm pushing the "later this week" by answering near end of calendar > day, Friday, but it seems like this patch in v7 still prevents the > device from booting.. > > You can find my defconfig at the link below. > > https://gist.github.com/konradybcio/d865f8dc9b12a98ba3875ec5a9aac42e > > Konrad >
On 10/06/23 15:35, Mike Kravetz wrote: > On 10/06/23 23:39, Konrad Dybcio wrote: > > On 6.10.2023 05:08, Mike Kravetz wrote: > > > On 10/02/23 11:57, Konrad Dybcio wrote: > > >> > > >> > > >> On 9/29/23 22:57, Mike Kravetz wrote: > > >>> On 09/27/23 13:26, Konrad Dybcio wrote: > > >>>> > > >>>> > > >>>> On 26.09.2023 01:48, Mike Kravetz wrote: > > >>>>> Allocation of a hugetlb page for the hugetlb pool is done by the routine > > >>>>> alloc_pool_huge_page. This routine will allocate contiguous pages from > > >>>>> a low level allocator, prep the pages for usage as a hugetlb page and > > >>>>> then add the resulting hugetlb page to the pool. > > >>>>> > > >>>>> In the 'prep' stage, optional vmemmap optimization is done. For > > >>>>> performance reasons we want to perform vmemmap optimization on multiple > > >>>>> hugetlb pages at once. To do this, restructure the hugetlb pool > > >>>>> allocation code such that vmemmap optimization can be isolated and later > > >>>>> batched. > > >>>>> > > >>>>> The code to allocate hugetlb pages from bootmem was also modified to > > >>>>> allow batching. > > >>>>> > > >>>>> No functional changes, only code restructure. > > >>>>> > > >>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> > > >>>>> Reviewed-by: Muchun Song <songmuchun@bytedance.com> > > >>>>> --- > > >>>> Hi, looks like this patch prevents today's next from booting > > >>>> on at least one Qualcomm ARM64 platform. Reverting it makes > > >>>> the device boot again. > > >>> > > >>> Can you share the config used and any other specific information such as > > >>> kernel command line. > > >> Later this week. > > > > > > As mentioned, I have been unable to reproduce on arm64 platforms I can > > > access. I have tried various config and boot options. While doing so, > > > I came across one issue impacting kernels compiled without > > > CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP defined. This is not something > > > that would prevent booting. > > > > > > I will send out an updated version series in the hope that any other > > > issues may be discovered. > > I'm pushing the "later this week" by answering near end of calendar > > day, Friday, but it seems like this patch in v7 still prevents the > > device from booting.. > > > > You can find my defconfig at the link below. > > > > https://gist.github.com/konradybcio/d865f8dc9b12a98ba3875ec5a9aac42e > > > > Thanks! > > I assume there is no further information such as any console output? > Did any of you other arm64 platforms have this issue? > > Just trying to get as much information as possible to get to root cause. I have not had success isolating the issue with your config file. Since the only code changes in this patch deal with allocating hugetlb pages, I assume this is what you are doing? Can you let me know how you are performing the allocations? I assume it is on the kernel command line as these would be processed earliest in boot. If you are not allocating hugetlb pages, then I need to think of what else may be happening. Anshuman, any chance you (or someone else with access to arm64 platforms) could throw this on any platforms you have access to for a quick test?
On 9.10.2023 05:29, Mike Kravetz wrote: > On 10/06/23 15:35, Mike Kravetz wrote: >> On 10/06/23 23:39, Konrad Dybcio wrote: >>> On 6.10.2023 05:08, Mike Kravetz wrote: >>>> On 10/02/23 11:57, Konrad Dybcio wrote: >>>>> >>>>> >>>>> On 9/29/23 22:57, Mike Kravetz wrote: >>>>>> On 09/27/23 13:26, Konrad Dybcio wrote: >>>>>>> >>>>>>> >>>>>>> On 26.09.2023 01:48, Mike Kravetz wrote: >>>>>>>> Allocation of a hugetlb page for the hugetlb pool is done by the routine >>>>>>>> alloc_pool_huge_page. This routine will allocate contiguous pages from >>>>>>>> a low level allocator, prep the pages for usage as a hugetlb page and >>>>>>>> then add the resulting hugetlb page to the pool. >>>>>>>> >>>>>>>> In the 'prep' stage, optional vmemmap optimization is done. For >>>>>>>> performance reasons we want to perform vmemmap optimization on multiple >>>>>>>> hugetlb pages at once. To do this, restructure the hugetlb pool >>>>>>>> allocation code such that vmemmap optimization can be isolated and later >>>>>>>> batched. >>>>>>>> >>>>>>>> The code to allocate hugetlb pages from bootmem was also modified to >>>>>>>> allow batching. >>>>>>>> >>>>>>>> No functional changes, only code restructure. >>>>>>>> >>>>>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> >>>>>>>> Reviewed-by: Muchun Song <songmuchun@bytedance.com> >>>>>>>> --- >>>>>>> Hi, looks like this patch prevents today's next from booting >>>>>>> on at least one Qualcomm ARM64 platform. Reverting it makes >>>>>>> the device boot again. >>>>>> >>>>>> Can you share the config used and any other specific information such as >>>>>> kernel command line. >>>>> Later this week. >>>> >>>> As mentioned, I have been unable to reproduce on arm64 platforms I can >>>> access. I have tried various config and boot options. While doing so, >>>> I came across one issue impacting kernels compiled without >>>> CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP defined. This is not something >>>> that would prevent booting. >>>> >>>> I will send out an updated version series in the hope that any other >>>> issues may be discovered. >>> I'm pushing the "later this week" by answering near end of calendar >>> day, Friday, but it seems like this patch in v7 still prevents the >>> device from booting.. >>> >>> You can find my defconfig at the link below. >>> >>> https://gist.github.com/konradybcio/d865f8dc9b12a98ba3875ec5a9aac42e >>> >> >> Thanks! >> >> I assume there is no further information such as any console output? >> Did any of you other arm64 platforms have this issue? >> >> Just trying to get as much information as possible to get to root cause. > > I have not had success isolating the issue with your config file. > > Since the only code changes in this patch deal with allocating hugetlb > pages, I assume this is what you are doing? Can you let me know how you > are performing the allocations? I assume it is on the kernel command > line as these would be processed earliest in boot. > > If you are not allocating hugetlb pages, then I need to think of what > else may be happening. > > Anshuman, any chance you (or someone else with access to arm64 platforms) > could throw this on any platforms you have access to for a quick test? I managed to get a boot log: https://pastebin.com/GwurpCw9 This is using arch/arm64/boot/dts/qcom/sm8550-mtp.dts for reference Konrad
On 7.10.2023 03:51, Jane Chu wrote: > Hi, Konrad, > > Just wondering, is your arm64 system a VM instance, or a bare metal? That's a tricky question :) Qualcomm platforms expose much of the hardware in a manner that's similar to a VM, there's an extensive irreplaceable hypervisor in place and the user can only boot Linux at EL1.. So, I guess the answer here is "somewhat bare metal" :/ Konrad > > thanks! > -jane > > > On 10/6/2023 2:39 PM, Konrad Dybcio wrote: >> On 6.10.2023 05:08, Mike Kravetz wrote: >>> On 10/02/23 11:57, Konrad Dybcio wrote: >>>> >>>> >>>> On 9/29/23 22:57, Mike Kravetz wrote: >>>>> On 09/27/23 13:26, Konrad Dybcio wrote: >>>>> > [..] >>> I will send out an updated version series in the hope that any other >>> issues may be discovered. >> I'm pushing the "later this week" by answering near end of calendar >> day, Friday, but it seems like this patch in v7 still prevents the >> device from booting.. >> >> You can find my defconfig at the link below. >> >> https://gist.github.com/konradybcio/d865f8dc9b12a98ba3875ec5a9aac42e >> >> Konrad >>
On 10/09/23 12:11, Konrad Dybcio wrote: > On 9.10.2023 05:29, Mike Kravetz wrote: > > On 10/06/23 15:35, Mike Kravetz wrote: > >> On 10/06/23 23:39, Konrad Dybcio wrote: > >>> On 6.10.2023 05:08, Mike Kravetz wrote: > >>>> On 10/02/23 11:57, Konrad Dybcio wrote: > >>>>> On 9/29/23 22:57, Mike Kravetz wrote: > >>>>>> On 09/27/23 13:26, Konrad Dybcio wrote: > >>>>>>> On 26.09.2023 01:48, Mike Kravetz wrote: > > I managed to get a boot log: > > https://pastebin.com/GwurpCw9 > > This is using arch/arm64/boot/dts/qcom/sm8550-mtp.dts for reference > Early on in boot log before the panic, I see this in the log: [ 0.000000] efi: UEFI not found. [ 0.000000] [Firmware Bug]: Kernel image misaligned at boot, please fix your bootloader! Isn't that misalignment pretty serious? Or, is is possible to run with that? There are no hugetlb pages allocated at boot time: [ 0.000000] Kernel command line: PMOS_NO_OUTPUT_REDIRECT console=ttyMSM0 earlycon clk_ignore_unused pd_ignore_unused androidboot.bootdevice=1d84000.ufshc androidboot.fstab_suffix=default androidboot.boot_devices=soc/1d84000.ufshc androidboot.serialno=ab855d8d androidboot.baseband=msm So, the routine where we are panic'ing (gather_bootmem_prealloc) should be a noop. The first thing it does is: list_for_each_entry(m, &huge_boot_pages, list) { ... } However, huge_boot_pages should be empty as initialized here: __initdata LIST_HEAD(huge_boot_pages); At the end of the routine, we call prep_and_add_bootmem_folios to process the local list created withing that above loop: LIST_HEAD(folio_list); This should also be empty and a noop. Is it possible that the misaligned kernel image could make these lists appear as non-empty?
On 10/09/23 08:04, Mike Kravetz wrote: > On 10/09/23 12:11, Konrad Dybcio wrote: > > On 9.10.2023 05:29, Mike Kravetz wrote: > > > On 10/06/23 15:35, Mike Kravetz wrote: > > >> On 10/06/23 23:39, Konrad Dybcio wrote: > > >>> On 6.10.2023 05:08, Mike Kravetz wrote: > > >>>> On 10/02/23 11:57, Konrad Dybcio wrote: > > >>>>> On 9/29/23 22:57, Mike Kravetz wrote: > > >>>>>> On 09/27/23 13:26, Konrad Dybcio wrote: > > >>>>>>> On 26.09.2023 01:48, Mike Kravetz wrote: > > > > I managed to get a boot log: > > > > https://pastebin.com/GwurpCw9 > > > > This is using arch/arm64/boot/dts/qcom/sm8550-mtp.dts for reference > > > > Early on in boot log before the panic, I see this in the log: > > [ 0.000000] efi: UEFI not found. > [ 0.000000] [Firmware Bug]: Kernel image misaligned at boot, please fix your bootloader! > > Isn't that misalignment pretty serious? Or, is is possible to run with that? > > There are no hugetlb pages allocated at boot time: > > [ 0.000000] Kernel command line: PMOS_NO_OUTPUT_REDIRECT console=ttyMSM0 earlycon clk_ignore_unused pd_ignore_unused androidboot.bootdevice=1d84000.ufshc androidboot.fstab_suffix=default androidboot.boot_devices=soc/1d84000.ufshc androidboot.serialno=ab855d8d androidboot.baseband=msm > > So, the routine where we are panic'ing (gather_bootmem_prealloc) should be a > noop. The first thing it does is: > list_for_each_entry(m, &huge_boot_pages, list) { > ... > } > > However, huge_boot_pages should be empty as initialized here: > __initdata LIST_HEAD(huge_boot_pages); > > At the end of the routine, we call prep_and_add_bootmem_folios to > process the local list created withing that above loop: > > LIST_HEAD(folio_list); > > This should also be empty and a noop. > > Is it possible that the misaligned kernel image could make these lists > appear as non-empty? Actually, just saw this: https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/ Will take a look, although as mentioned above prep_and_add_bootmem_folios on an empty list should be a noop.
On 10/9/23 17:04, Mike Kravetz wrote: > On 10/09/23 12:11, Konrad Dybcio wrote: >> On 9.10.2023 05:29, Mike Kravetz wrote: >>> On 10/06/23 15:35, Mike Kravetz wrote: >>>> On 10/06/23 23:39, Konrad Dybcio wrote: >>>>> On 6.10.2023 05:08, Mike Kravetz wrote: >>>>>> On 10/02/23 11:57, Konrad Dybcio wrote: >>>>>>> On 9/29/23 22:57, Mike Kravetz wrote: >>>>>>>> On 09/27/23 13:26, Konrad Dybcio wrote: >>>>>>>>> On 26.09.2023 01:48, Mike Kravetz wrote: >> >> I managed to get a boot log: >> >> https://pastebin.com/GwurpCw9 >> >> This is using arch/arm64/boot/dts/qcom/sm8550-mtp.dts for reference >> > > Early on in boot log before the panic, I see this in the log: > > [ 0.000000] efi: UEFI not found. > [ 0.000000] [Firmware Bug]: Kernel image misaligned at boot, please fix your bootloader! > > Isn't that misalignment pretty serious? Or, is is possible to run with that? That has never caused any issues and sadly I can't do anything about it. > > There are no hugetlb pages allocated at boot time: > > [ 0.000000] Kernel command line: PMOS_NO_OUTPUT_REDIRECT console=ttyMSM0 earlycon clk_ignore_unused pd_ignore_unused androidboot.bootdevice=1d84000.ufshc androidboot.fstab_suffix=default androidboot.boot_devices=soc/1d84000.ufshc androidboot.serialno=ab855d8d androidboot.baseband=msm > > So, the routine where we are panic'ing (gather_bootmem_prealloc) should be a > noop. The first thing it does is: > list_for_each_entry(m, &huge_boot_pages, list) { > ... > } > > However, huge_boot_pages should be empty as initialized here: > __initdata LIST_HEAD(huge_boot_pages); > > At the end of the routine, we call prep_and_add_bootmem_folios to > process the local list created withing that above loop: > > LIST_HEAD(folio_list); > > This should also be empty and a noop. > > Is it possible that the misaligned kernel image could make these lists > appear as non-empty? I don't think I have an answer for this Konrad
On 10/9/23 17:15, Mike Kravetz wrote: > On 10/09/23 08:04, Mike Kravetz wrote: >> On 10/09/23 12:11, Konrad Dybcio wrote: >>> On 9.10.2023 05:29, Mike Kravetz wrote: >>>> On 10/06/23 15:35, Mike Kravetz wrote: >>>>> On 10/06/23 23:39, Konrad Dybcio wrote: >>>>>> On 6.10.2023 05:08, Mike Kravetz wrote: >>>>>>> On 10/02/23 11:57, Konrad Dybcio wrote: >>>>>>>> On 9/29/23 22:57, Mike Kravetz wrote: >>>>>>>>> On 09/27/23 13:26, Konrad Dybcio wrote: >>>>>>>>>> On 26.09.2023 01:48, Mike Kravetz wrote: >>> >>> I managed to get a boot log: >>> >>> https://pastebin.com/GwurpCw9 >>> >>> This is using arch/arm64/boot/dts/qcom/sm8550-mtp.dts for reference >>> >> >> Early on in boot log before the panic, I see this in the log: >> >> [ 0.000000] efi: UEFI not found. >> [ 0.000000] [Firmware Bug]: Kernel image misaligned at boot, please fix your bootloader! >> >> Isn't that misalignment pretty serious? Or, is is possible to run with that? >> >> There are no hugetlb pages allocated at boot time: >> >> [ 0.000000] Kernel command line: PMOS_NO_OUTPUT_REDIRECT console=ttyMSM0 earlycon clk_ignore_unused pd_ignore_unused androidboot.bootdevice=1d84000.ufshc androidboot.fstab_suffix=default androidboot.boot_devices=soc/1d84000.ufshc androidboot.serialno=ab855d8d androidboot.baseband=msm >> >> So, the routine where we are panic'ing (gather_bootmem_prealloc) should be a >> noop. The first thing it does is: >> list_for_each_entry(m, &huge_boot_pages, list) { >> ... >> } >> >> However, huge_boot_pages should be empty as initialized here: >> __initdata LIST_HEAD(huge_boot_pages); >> >> At the end of the routine, we call prep_and_add_bootmem_folios to >> process the local list created withing that above loop: >> >> LIST_HEAD(folio_list); >> >> This should also be empty and a noop. >> >> Is it possible that the misaligned kernel image could make these lists >> appear as non-empty? > > Actually, just saw this: > > https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/ > > Will take a look, although as mentioned above prep_and_add_bootmem_folios on > an empty list should be a noop. I'll try it out atop the series tomorrow or so. Konrad
On Mon, 9 Oct 2023 08:15:13 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: > > This should also be empty and a noop. > > > > Is it possible that the misaligned kernel image could make these lists > > appear as non-empty? > > Actually, just saw this: > > https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/ > > Will take a look, although as mentioned above prep_and_add_bootmem_folios on > an empty list should be a noop. Konrad, are you able to test Usama's patch? Thanks. From: Usama Arif <usama.arif@bytedance.com> Subject: mm: hugetlb: only prep and add allocated folios for non-gigantic pages Date: Mon, 9 Oct 2023 15:56:05 +0100 Calling prep_and_add_allocated_folios when allocating gigantic pages at boot time causes the kernel to crash as folio_list is empty and iterating it causes a NULL pointer dereference. Call this only for non-gigantic pages when folio_list has entries. Link: https://lkml.kernel.org/r/20231009145605.2150897-1-usama.arif@bytedance.com Fixes: bfb41d6b2fe148 ("hugetlb: restructure pool allocations") Signed-off-by: Usama Arif <usama.arif@bytedance.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Konrad Dybcio <konradybcio@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/hugetlb.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- a/mm/hugetlb.c~hugetlb-restructure-pool-allocations-fix +++ a/mm/hugetlb.c @@ -3307,7 +3307,8 @@ static void __init hugetlb_hstate_alloc_ } /* list will be empty if hstate_is_gigantic */ - prep_and_add_allocated_folios(h, &folio_list); + if (!hstate_is_gigantic(h)) + prep_and_add_allocated_folios(h, &folio_list); if (i < h->max_huge_pages) { char buf[32];
On 10/09/23 23:09, Konrad Dybcio wrote: > > > On 10/9/23 17:15, Mike Kravetz wrote: > > On 10/09/23 08:04, Mike Kravetz wrote: > > > On 10/09/23 12:11, Konrad Dybcio wrote: > > > > On 9.10.2023 05:29, Mike Kravetz wrote: > > > > > On 10/06/23 15:35, Mike Kravetz wrote: > > > > > > On 10/06/23 23:39, Konrad Dybcio wrote: > > > > > > > On 6.10.2023 05:08, Mike Kravetz wrote: > > > > > > > > On 10/02/23 11:57, Konrad Dybcio wrote: > > > > > > > > > On 9/29/23 22:57, Mike Kravetz wrote: > > > > > > > > > > On 09/27/23 13:26, Konrad Dybcio wrote: > > > > > > > > > > > On 26.09.2023 01:48, Mike Kravetz wrote: > > > > > > > > I managed to get a boot log: > > > > > > > > https://pastebin.com/GwurpCw9 > > > > > > > > This is using arch/arm64/boot/dts/qcom/sm8550-mtp.dts for reference > > > > > > > > > > Early on in boot log before the panic, I see this in the log: > > > > > > [ 0.000000] efi: UEFI not found. > > > [ 0.000000] [Firmware Bug]: Kernel image misaligned at boot, please fix your bootloader! > > > > > > Isn't that misalignment pretty serious? Or, is is possible to run with that? > > > > > > There are no hugetlb pages allocated at boot time: > > > > > > [ 0.000000] Kernel command line: PMOS_NO_OUTPUT_REDIRECT console=ttyMSM0 earlycon clk_ignore_unused pd_ignore_unused androidboot.bootdevice=1d84000.ufshc androidboot.fstab_suffix=default androidboot.boot_devices=soc/1d84000.ufshc androidboot.serialno=ab855d8d androidboot.baseband=msm > > > > > > So, the routine where we are panic'ing (gather_bootmem_prealloc) should be a > > > noop. The first thing it does is: > > > list_for_each_entry(m, &huge_boot_pages, list) { > > > ... > > > } > > > > > > However, huge_boot_pages should be empty as initialized here: > > > __initdata LIST_HEAD(huge_boot_pages); > > > > > > At the end of the routine, we call prep_and_add_bootmem_folios to > > > process the local list created withing that above loop: > > > > > > LIST_HEAD(folio_list); > > > > > > This should also be empty and a noop. > > > > > > Is it possible that the misaligned kernel image could make these lists > > > appear as non-empty? > > > > Actually, just saw this: > > > > https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/ > > > > Will take a look, although as mentioned above prep_and_add_bootmem_folios on > > an empty list should be a noop. > I'll try it out atop the series tomorrow or so. I just replied to Usama's patch. This may have more to do with IRQ enablement.
On 10/10/23 02:07, Andrew Morton wrote: > On Mon, 9 Oct 2023 08:15:13 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: > >>> This should also be empty and a noop. >>> >>> Is it possible that the misaligned kernel image could make these lists >>> appear as non-empty? >> >> Actually, just saw this: >> >> https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/ >> >> Will take a look, although as mentioned above prep_and_add_bootmem_folios on >> an empty list should be a noop. > > Konrad, are you able to test Usama's patch? Thanks. I legitimately spent a sad amount of time trying to regain access to the remote board farm. Previously I could hit the bug on SM8550, but now I can't do it on SM8450, SM8350 and SM8250 (previous gens), with the same config.. I have no idea when I'll be able to get access to SM8550 again. I did test it on the QCM6490 Fairphone 5 that I initially reported this on, and neither booting next-20231010 (with your patchset applied) nor adding the below patch on top of it seems to work. I can't get serial output from this device though to find out what it's unhappy about :/ Konrad
On 10/10/23 23:30, Konrad Dybcio wrote: > > > On 10/10/23 02:07, Andrew Morton wrote: > > On Mon, 9 Oct 2023 08:15:13 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > > > This should also be empty and a noop. > > > > > > > > Is it possible that the misaligned kernel image could make these lists > > > > appear as non-empty? > > > > > > Actually, just saw this: > > > > > > https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/ > > > > > > Will take a look, although as mentioned above prep_and_add_bootmem_folios on > > > an empty list should be a noop. > > > > Konrad, are you able to test Usama's patch? Thanks. > I legitimately spent a sad amount of time trying to regain access to the > remote board farm. Previously I could hit the bug on SM8550, but now I can't > do it on SM8450, SM8350 and SM8250 (previous gens), with the same config.. I > have no idea when I'll be able to get access to SM8550 again. > > I did test it on the QCM6490 Fairphone 5 that I initially reported this on, > and neither booting next-20231010 (with your patchset applied) nor adding > the below patch on top of it seems to work. I can't get serial output from > this device though to find out what it's unhappy about :/ Sorry for causing you to spend so much time on this. As mentioned in the reply to Usama's patch, the root cause seems to be the locking. So, the real change to test is the locking changes in that thread; not Usama's patch.
On 10/10/23 23:45, Mike Kravetz wrote: > On 10/10/23 23:30, Konrad Dybcio wrote: >> >> >> On 10/10/23 02:07, Andrew Morton wrote: >>> On Mon, 9 Oct 2023 08:15:13 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: >>> >>>>> This should also be empty and a noop. >>>>> >>>>> Is it possible that the misaligned kernel image could make these lists >>>>> appear as non-empty? >>>> >>>> Actually, just saw this: >>>> >>>> https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/ >>>> >>>> Will take a look, although as mentioned above prep_and_add_bootmem_folios on >>>> an empty list should be a noop. >>> >>> Konrad, are you able to test Usama's patch? Thanks. >> I legitimately spent a sad amount of time trying to regain access to the >> remote board farm. Previously I could hit the bug on SM8550, but now I can't >> do it on SM8450, SM8350 and SM8250 (previous gens), with the same config.. I >> have no idea when I'll be able to get access to SM8550 again. >> >> I did test it on the QCM6490 Fairphone 5 that I initially reported this on, >> and neither booting next-20231010 (with your patchset applied) nor adding >> the below patch on top of it seems to work. I can't get serial output from >> this device though to find out what it's unhappy about :/ > > Sorry for causing you to spend so much time on this. No worries, that was my explanation for why it took me so long to respond again.. > > As mentioned in the reply to Usama's patch, the root cause seems to be > the locking. So, the real change to test is the locking changes in > that thread; not Usama's patch. Ack Konrad
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 47159b9de633..64f50f3844fc 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1970,16 +1970,21 @@ static void __prep_account_new_huge_page(struct hstate *h, int nid) h->nr_huge_pages_node[nid]++; } -static void __prep_new_hugetlb_folio(struct hstate *h, struct folio *folio) +static void init_new_hugetlb_folio(struct hstate *h, struct folio *folio) { folio_set_hugetlb(folio); - hugetlb_vmemmap_optimize(h, &folio->page); INIT_LIST_HEAD(&folio->lru); hugetlb_set_folio_subpool(folio, NULL); set_hugetlb_cgroup(folio, NULL); set_hugetlb_cgroup_rsvd(folio, NULL); } +static void __prep_new_hugetlb_folio(struct hstate *h, struct folio *folio) +{ + init_new_hugetlb_folio(h, folio); + hugetlb_vmemmap_optimize(h, &folio->page); +} + static void prep_new_hugetlb_folio(struct hstate *h, struct folio *folio, int nid) { __prep_new_hugetlb_folio(h, folio); @@ -2190,16 +2195,9 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h, return page_folio(page); } -/* - * Common helper to allocate a fresh hugetlb page. All specific allocators - * should use this function to get new hugetlb pages - * - * Note that returned page is 'frozen': ref count of head page and all tail - * pages is zero. - */ -static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h, - gfp_t gfp_mask, int nid, nodemask_t *nmask, - nodemask_t *node_alloc_noretry) +static struct folio *__alloc_fresh_hugetlb_folio(struct hstate *h, + gfp_t gfp_mask, int nid, nodemask_t *nmask, + nodemask_t *node_alloc_noretry) { struct folio *folio; bool retry = false; @@ -2212,6 +2210,7 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h, nid, nmask, node_alloc_noretry); if (!folio) return NULL; + if (hstate_is_gigantic(h)) { if (!prep_compound_gigantic_folio(folio, huge_page_order(h))) { /* @@ -2226,32 +2225,80 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h, return NULL; } } - prep_new_hugetlb_folio(h, folio, folio_nid(folio)); return folio; } +static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h, + gfp_t gfp_mask, int nid, nodemask_t *nmask, + nodemask_t *node_alloc_noretry) +{ + struct folio *folio; + + folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, + node_alloc_noretry); + if (folio) + init_new_hugetlb_folio(h, folio); + return folio; +} + /* - * Allocates a fresh page to the hugetlb allocator pool in the node interleaved - * manner. + * Common helper to allocate a fresh hugetlb page. All specific allocators + * should use this function to get new hugetlb pages + * + * Note that returned page is 'frozen': ref count of head page and all tail + * pages is zero. */ -static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, - nodemask_t *node_alloc_noretry) +static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h, + gfp_t gfp_mask, int nid, nodemask_t *nmask, + nodemask_t *node_alloc_noretry) { struct folio *folio; - int nr_nodes, node; + + folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, + node_alloc_noretry); + if (!folio) + return NULL; + + prep_new_hugetlb_folio(h, folio, folio_nid(folio)); + return folio; +} + +static void prep_and_add_allocated_folios(struct hstate *h, + struct list_head *folio_list) +{ + struct folio *folio, *tmp_f; + + /* Add all new pool pages to free lists in one lock cycle */ + spin_lock_irq(&hugetlb_lock); + list_for_each_entry_safe(folio, tmp_f, folio_list, lru) { + __prep_account_new_huge_page(h, folio_nid(folio)); + enqueue_hugetlb_folio(h, folio); + } + spin_unlock_irq(&hugetlb_lock); +} + +/* + * Allocates a fresh hugetlb page in a node interleaved manner. The page + * will later be added to the appropriate hugetlb pool. + */ +static struct folio *alloc_pool_huge_folio(struct hstate *h, + nodemask_t *nodes_allowed, + nodemask_t *node_alloc_noretry) +{ gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; + int nr_nodes, node; for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) { - folio = alloc_fresh_hugetlb_folio(h, gfp_mask, node, + struct folio *folio; + + folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node, nodes_allowed, node_alloc_noretry); - if (folio) { - free_huge_folio(folio); /* free it into the hugepage allocator */ - return 1; - } + if (folio) + return folio; } - return 0; + return NULL; } /* @@ -3264,25 +3311,35 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio, */ static void __init gather_bootmem_prealloc(void) { + LIST_HEAD(folio_list); struct huge_bootmem_page *m; + struct hstate *h, *prev_h = NULL; list_for_each_entry(m, &huge_boot_pages, list) { struct page *page = virt_to_page(m); struct folio *folio = (void *)page; - struct hstate *h = m->hstate; + + h = m->hstate; + /* + * It is possible to have multiple huge page sizes (hstates) + * in this list. If so, process each size separately. + */ + if (h != prev_h && prev_h != NULL) + prep_and_add_allocated_folios(prev_h, &folio_list); + prev_h = h; VM_BUG_ON(!hstate_is_gigantic(h)); WARN_ON(folio_ref_count(folio) != 1); hugetlb_folio_init_vmemmap(folio, h, HUGETLB_VMEMMAP_RESERVE_PAGES); - prep_new_hugetlb_folio(h, folio, folio_nid(folio)); + __prep_new_hugetlb_folio(h, folio); /* If HVO fails, initialize all tail struct pages */ if (!HPageVmemmapOptimized(&folio->page)) hugetlb_folio_init_tail_vmemmap(folio, HUGETLB_VMEMMAP_RESERVE_PAGES, pages_per_huge_page(h)); - free_huge_folio(folio); /* add to the hugepage allocator */ + list_add(&folio->lru, &folio_list); /* * We need to restore the 'stolen' pages to totalram_pages @@ -3292,6 +3349,8 @@ static void __init gather_bootmem_prealloc(void) adjust_managed_page_count(page, pages_per_huge_page(h)); cond_resched(); } + + prep_and_add_allocated_folios(h, &folio_list); } static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid) @@ -3325,9 +3384,22 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid) h->max_huge_pages_node[nid] = i; } +/* + * NOTE: this routine is called in different contexts for gigantic and + * non-gigantic pages. + * - For gigantic pages, this is called early in the boot process and + * pages are allocated from memblock allocated or something similar. + * Gigantic pages are actually added to pools later with the routine + * gather_bootmem_prealloc. + * - For non-gigantic pages, this is called later in the boot process after + * all of mm is up and functional. Pages are allocated from buddy and + * then added to hugetlb pools. + */ static void __init hugetlb_hstate_alloc_pages(struct hstate *h) { unsigned long i; + struct folio *folio; + LIST_HEAD(folio_list); nodemask_t *node_alloc_noretry; bool node_specific_alloc = false; @@ -3369,14 +3441,25 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h) for (i = 0; i < h->max_huge_pages; ++i) { if (hstate_is_gigantic(h)) { + /* + * gigantic pages not added to list as they are not + * added to pools now. + */ if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE)) break; - } else if (!alloc_pool_huge_page(h, - &node_states[N_MEMORY], - node_alloc_noretry)) - break; + } else { + folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY], + node_alloc_noretry); + if (!folio) + break; + list_add(&folio->lru, &folio_list); + } cond_resched(); } + + /* list will be empty if hstate_is_gigantic */ + prep_and_add_allocated_folios(h, &folio_list); + if (i < h->max_huge_pages) { char buf[32]; @@ -3510,7 +3593,9 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed, static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, nodemask_t *nodes_allowed) { - unsigned long min_count, ret; + unsigned long min_count; + unsigned long allocated; + struct folio *folio; LIST_HEAD(page_list); NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL); @@ -3587,7 +3672,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, break; } - while (count > persistent_huge_pages(h)) { + allocated = 0; + while (count > (persistent_huge_pages(h) + allocated)) { /* * If this allocation races such that we no longer need the * page, free_huge_folio will handle it by freeing the page @@ -3598,15 +3684,32 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, /* yield cpu to avoid soft lockup */ cond_resched(); - ret = alloc_pool_huge_page(h, nodes_allowed, + folio = alloc_pool_huge_folio(h, nodes_allowed, node_alloc_noretry); - spin_lock_irq(&hugetlb_lock); - if (!ret) + if (!folio) { + prep_and_add_allocated_folios(h, &page_list); + spin_lock_irq(&hugetlb_lock); goto out; + } + + list_add(&folio->lru, &page_list); + allocated++; /* Bail for signals. Probably ctrl-c from user */ - if (signal_pending(current)) + if (signal_pending(current)) { + prep_and_add_allocated_folios(h, &page_list); + spin_lock_irq(&hugetlb_lock); goto out; + } + + spin_lock_irq(&hugetlb_lock); + } + + /* Add allocated pages to the pool */ + if (!list_empty(&page_list)) { + spin_unlock_irq(&hugetlb_lock); + prep_and_add_allocated_folios(h, &page_list); + spin_lock_irq(&hugetlb_lock); } /* @@ -3632,8 +3735,6 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, * Collect pages to be removed on list without dropping lock */ while (min_count < persistent_huge_pages(h)) { - struct folio *folio; - folio = remove_pool_hugetlb_folio(h, nodes_allowed, 0); if (!folio) break;