Message ID | 487751e1ccec8fcd32e25a06ce000617e96d7ae1.1613595269.git.andreyknvl@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, kasan: don't poison boot memory | expand |
On 17.02.21 21:56, Andrey Konovalov wrote: > During boot, all non-reserved memblock memory is exposed to the buddy > allocator. Poisoning all that memory with KASAN lengthens boot time, > especially on systems with large amount of RAM. This patch makes > page_alloc to not call kasan_free_pages() on all new memory. > > __free_pages_core() is used when exposing fresh memory during system > boot and when onlining memory during hotplug. This patch adds a new > FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok() through > free_pages_prepare() from __free_pages_core(). > > This has little impact on KASAN memory tracking. > > Assuming that there are no references to newly exposed pages before they > are ever allocated, there won't be any intended (but buggy) accesses to > that memory that KASAN would normally detect. > > However, with this patch, KASAN stops detecting wild and large > out-of-bounds accesses that happen to land on a fresh memory page that > was never allocated. This is taken as an acceptable trade-off. > > All memory allocated normally when the boot is over keeps getting > poisoned as usual. > > Signed-off-by: Andrey Konovalov <andreyknvl@google.com> > Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d Not sure this is the right thing to do, see https://lkml.kernel.org/r/bcf8925d-0949-3fe1-baa8-cc536c529860@oracle.com Reversing the order in which memory gets allocated + used during boot (in a patch by me) might have revealed an invalid memory access during boot. I suspect that that issue would no longer get detected with your patch, as the invalid memory access would simply not get detected. Now, I cannot prove that :)
On Thu, Feb 18, 2021 at 9:55 AM David Hildenbrand <david@redhat.com> wrote: > > On 17.02.21 21:56, Andrey Konovalov wrote: > > During boot, all non-reserved memblock memory is exposed to the buddy > > allocator. Poisoning all that memory with KASAN lengthens boot time, > > especially on systems with large amount of RAM. This patch makes > > page_alloc to not call kasan_free_pages() on all new memory. > > > > __free_pages_core() is used when exposing fresh memory during system > > boot and when onlining memory during hotplug. This patch adds a new > > FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok() through > > free_pages_prepare() from __free_pages_core(). > > > > This has little impact on KASAN memory tracking. > > > > Assuming that there are no references to newly exposed pages before they > > are ever allocated, there won't be any intended (but buggy) accesses to > > that memory that KASAN would normally detect. > > > > However, with this patch, KASAN stops detecting wild and large > > out-of-bounds accesses that happen to land on a fresh memory page that > > was never allocated. This is taken as an acceptable trade-off. > > > > All memory allocated normally when the boot is over keeps getting > > poisoned as usual. > > > > Signed-off-by: Andrey Konovalov <andreyknvl@google.com> > > Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d > > Not sure this is the right thing to do, see > > https://lkml.kernel.org/r/bcf8925d-0949-3fe1-baa8-cc536c529860@oracle.com > > Reversing the order in which memory gets allocated + used during boot > (in a patch by me) might have revealed an invalid memory access during boot. > > I suspect that that issue would no longer get detected with your patch, > as the invalid memory access would simply not get detected. Now, I > cannot prove that :) This looks like a good example. Ok, what we can do is: 1. For KASAN_GENERIC: leave everything as is to be able to detect these boot-time bugs. 2. For KASAN_SW_TAGS: remove boot-time poisoning via kasan_free_pages(), but use the "invalid" tag as the default shadow value. The end result should be the same: bad accesses will be detected. For unallocated memory as it has the default "invalid" tag, and for allocated memory as it's poisoned properly when allocated/freed. 3. For KASAN_HW_TAGS: just remove boot-time poisoning via kasan_free_pages(). As the memory tags have a random unspecified value, we'll still have a 15/16 chance to detect a memory corruption. This also makes sense from the performance perspective: KASAN_GENERIC isn't meant to be running in production, so having a larger perf impact is acceptable. The other two modes will be faster.
On 18.02.21 20:40, Andrey Konovalov wrote: > On Thu, Feb 18, 2021 at 9:55 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 17.02.21 21:56, Andrey Konovalov wrote: >>> During boot, all non-reserved memblock memory is exposed to the buddy >>> allocator. Poisoning all that memory with KASAN lengthens boot time, >>> especially on systems with large amount of RAM. This patch makes >>> page_alloc to not call kasan_free_pages() on all new memory. >>> >>> __free_pages_core() is used when exposing fresh memory during system >>> boot and when onlining memory during hotplug. This patch adds a new >>> FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok() through >>> free_pages_prepare() from __free_pages_core(). >>> >>> This has little impact on KASAN memory tracking. >>> >>> Assuming that there are no references to newly exposed pages before they >>> are ever allocated, there won't be any intended (but buggy) accesses to >>> that memory that KASAN would normally detect. >>> >>> However, with this patch, KASAN stops detecting wild and large >>> out-of-bounds accesses that happen to land on a fresh memory page that >>> was never allocated. This is taken as an acceptable trade-off. >>> >>> All memory allocated normally when the boot is over keeps getting >>> poisoned as usual. >>> >>> Signed-off-by: Andrey Konovalov <andreyknvl@google.com> >>> Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d >> >> Not sure this is the right thing to do, see >> >> https://lkml.kernel.org/r/bcf8925d-0949-3fe1-baa8-cc536c529860@oracle.com >> >> Reversing the order in which memory gets allocated + used during boot >> (in a patch by me) might have revealed an invalid memory access during boot. >> >> I suspect that that issue would no longer get detected with your patch, >> as the invalid memory access would simply not get detected. Now, I >> cannot prove that :) > > This looks like a good example. > > Ok, what we can do is: > > 1. For KASAN_GENERIC: leave everything as is to be able to detect > these boot-time bugs. > > 2. For KASAN_SW_TAGS: remove boot-time poisoning via > kasan_free_pages(), but use the "invalid" tag as the default shadow > value. The end result should be the same: bad accesses will be > detected. For unallocated memory as it has the default "invalid" tag, > and for allocated memory as it's poisoned properly when > allocated/freed. > > 3. For KASAN_HW_TAGS: just remove boot-time poisoning via > kasan_free_pages(). As the memory tags have a random unspecified > value, we'll still have a 15/16 chance to detect a memory corruption. > > This also makes sense from the performance perspective: KASAN_GENERIC > isn't meant to be running in production, so having a larger perf > impact is acceptable. The other two modes will be faster. Sounds in principle sane to me. Side note: I am not sure if anybody runs KASAN in production. Memory is expensive. Feel free to prove me wrong, I'd be very interest in actual users.
On Thu, Feb 18, 2021 at 8:46 PM David Hildenbrand <david@redhat.com> wrote: > > > 1. For KASAN_GENERIC: leave everything as is to be able to detect > > these boot-time bugs. > > > > 2. For KASAN_SW_TAGS: remove boot-time poisoning via > > kasan_free_pages(), but use the "invalid" tag as the default shadow > > value. The end result should be the same: bad accesses will be > > detected. For unallocated memory as it has the default "invalid" tag, > > and for allocated memory as it's poisoned properly when > > allocated/freed. > > > > 3. For KASAN_HW_TAGS: just remove boot-time poisoning via > > kasan_free_pages(). As the memory tags have a random unspecified > > value, we'll still have a 15/16 chance to detect a memory corruption. > > > > This also makes sense from the performance perspective: KASAN_GENERIC > > isn't meant to be running in production, so having a larger perf > > impact is acceptable. The other two modes will be faster. > > Sounds in principle sane to me. I'll post a v2 soon, thanks! > Side note: I am not sure if anybody runs KASAN in production. Memory is > expensive. Feel free to prove me wrong, I'd be very interest in actual > users. We run KASAN_SW_TAGS on some dogfood testing devices, and KASAN_HW_TAGS is being developed with the goal to be running in production.
On 2/18/2021 3:55 AM, David Hildenbrand wrote: > On 17.02.21 21:56, Andrey Konovalov wrote: >> During boot, all non-reserved memblock memory is exposed to the buddy >> allocator. Poisoning all that memory with KASAN lengthens boot time, >> especially on systems with large amount of RAM. This patch makes >> page_alloc to not call kasan_free_pages() on all new memory. >> >> __free_pages_core() is used when exposing fresh memory during system >> boot and when onlining memory during hotplug. This patch adds a new >> FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok() through >> free_pages_prepare() from __free_pages_core(). >> >> This has little impact on KASAN memory tracking. >> >> Assuming that there are no references to newly exposed pages before they >> are ever allocated, there won't be any intended (but buggy) accesses to >> that memory that KASAN would normally detect. >> >> However, with this patch, KASAN stops detecting wild and large >> out-of-bounds accesses that happen to land on a fresh memory page that >> was never allocated. This is taken as an acceptable trade-off. >> >> All memory allocated normally when the boot is over keeps getting >> poisoned as usual. >> >> Signed-off-by: Andrey Konovalov <andreyknvl@google.com> >> Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d > > Not sure this is the right thing to do, see > > https://lkml.kernel.org/r/bcf8925d-0949-3fe1-baa8-cc536c529860@oracle.com > > Reversing the order in which memory gets allocated + used during boot > (in a patch by me) might have revealed an invalid memory access during > boot. > > I suspect that that issue would no longer get detected with your > patch, as the invalid memory access would simply not get detected. > Now, I cannot prove that :) Since David's patch we're having trouble with the iBFT ACPI table, which is mapped in via kmap() - see acpi_map() in "drivers/acpi/osl.c". KASAN detects that it is being used after free when ibft_init() accesses the iBFT table, but as of yet we can't find where it get's freed (we've instrumented calls to kunmap()). Thank you, George
On Fri, Feb 19, 2021 at 1:06 AM George Kennedy <george.kennedy@oracle.com> wrote: > > > > On 2/18/2021 3:55 AM, David Hildenbrand wrote: > > On 17.02.21 21:56, Andrey Konovalov wrote: > >> During boot, all non-reserved memblock memory is exposed to the buddy > >> allocator. Poisoning all that memory with KASAN lengthens boot time, > >> especially on systems with large amount of RAM. This patch makes > >> page_alloc to not call kasan_free_pages() on all new memory. > >> > >> __free_pages_core() is used when exposing fresh memory during system > >> boot and when onlining memory during hotplug. This patch adds a new > >> FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok() through > >> free_pages_prepare() from __free_pages_core(). > >> > >> This has little impact on KASAN memory tracking. > >> > >> Assuming that there are no references to newly exposed pages before they > >> are ever allocated, there won't be any intended (but buggy) accesses to > >> that memory that KASAN would normally detect. > >> > >> However, with this patch, KASAN stops detecting wild and large > >> out-of-bounds accesses that happen to land on a fresh memory page that > >> was never allocated. This is taken as an acceptable trade-off. > >> > >> All memory allocated normally when the boot is over keeps getting > >> poisoned as usual. > >> > >> Signed-off-by: Andrey Konovalov <andreyknvl@google.com> > >> Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d > > > > Not sure this is the right thing to do, see > > > > https://lkml.kernel.org/r/bcf8925d-0949-3fe1-baa8-cc536c529860@oracle.com > > > > Reversing the order in which memory gets allocated + used during boot > > (in a patch by me) might have revealed an invalid memory access during > > boot. > > > > I suspect that that issue would no longer get detected with your > > patch, as the invalid memory access would simply not get detected. > > Now, I cannot prove that :) > > Since David's patch we're having trouble with the iBFT ACPI table, which > is mapped in via kmap() - see acpi_map() in "drivers/acpi/osl.c". KASAN > detects that it is being used after free when ibft_init() accesses the > iBFT table, but as of yet we can't find where it get's freed (we've > instrumented calls to kunmap()). Maybe it doesn't get freed, but what you see is a wild or a large out-of-bounds access. Since KASAN marks all memory as freed during the memblock->page_alloc transition, such bugs can manifest as use-after-frees.
On 2/18/2021 7:09 PM, Andrey Konovalov wrote: > On Fri, Feb 19, 2021 at 1:06 AM George Kennedy > <george.kennedy@oracle.com> wrote: >> >> >> On 2/18/2021 3:55 AM, David Hildenbrand wrote: >>> On 17.02.21 21:56, Andrey Konovalov wrote: >>>> During boot, all non-reserved memblock memory is exposed to the buddy >>>> allocator. Poisoning all that memory with KASAN lengthens boot time, >>>> especially on systems with large amount of RAM. This patch makes >>>> page_alloc to not call kasan_free_pages() on all new memory. >>>> >>>> __free_pages_core() is used when exposing fresh memory during system >>>> boot and when onlining memory during hotplug. This patch adds a new >>>> FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok() through >>>> free_pages_prepare() from __free_pages_core(). >>>> >>>> This has little impact on KASAN memory tracking. >>>> >>>> Assuming that there are no references to newly exposed pages before they >>>> are ever allocated, there won't be any intended (but buggy) accesses to >>>> that memory that KASAN would normally detect. >>>> >>>> However, with this patch, KASAN stops detecting wild and large >>>> out-of-bounds accesses that happen to land on a fresh memory page that >>>> was never allocated. This is taken as an acceptable trade-off. >>>> >>>> All memory allocated normally when the boot is over keeps getting >>>> poisoned as usual. >>>> >>>> Signed-off-by: Andrey Konovalov <andreyknvl@google.com> >>>> Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d >>> Not sure this is the right thing to do, see >>> >>> https://lkml.kernel.org/r/bcf8925d-0949-3fe1-baa8-cc536c529860@oracle.com >>> >>> Reversing the order in which memory gets allocated + used during boot >>> (in a patch by me) might have revealed an invalid memory access during >>> boot. >>> >>> I suspect that that issue would no longer get detected with your >>> patch, as the invalid memory access would simply not get detected. >>> Now, I cannot prove that :) >> Since David's patch we're having trouble with the iBFT ACPI table, which >> is mapped in via kmap() - see acpi_map() in "drivers/acpi/osl.c". KASAN >> detects that it is being used after free when ibft_init() accesses the >> iBFT table, but as of yet we can't find where it get's freed (we've >> instrumented calls to kunmap()). > Maybe it doesn't get freed, but what you see is a wild or a large > out-of-bounds access. Since KASAN marks all memory as freed during the > memblock->page_alloc transition, such bugs can manifest as > use-after-frees. It gets freed and re-used. By the time the iBFT table is accessed by ibft_init() the page has been over-written. Setting page flags like the following before the call to kmap() prevents the iBFT table page from being freed: diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c index 0418feb..41c1bbd 100644 --- a/drivers/acpi/osl.c +++ b/drivers/acpi/osl.c @@ -287,9 +287,14 @@ static void __iomem *acpi_map(acpi_physical_address pg_off, unsigned long pg_sz) pfn = pg_off >> PAGE_SHIFT; if (should_use_kmap(pfn)) { + struct page *page = pfn_to_page(pfn); + if (pg_sz > PAGE_SIZE) return NULL; - return (void __iomem __force *)kmap(pfn_to_page(pfn)); + + page->flags |= ((1UL << PG_unevictable) | (1UL << PG_reserved) | (1UL << PG_locked)); + + return (void __iomem __force *)kmap(page); } else return acpi_os_ioremap(pg_off, pg_sz); } Just not sure of the correct way to set the page flags. George
On 2/19/2021 11:45 AM, George Kennedy wrote: > > > On 2/18/2021 7:09 PM, Andrey Konovalov wrote: >> On Fri, Feb 19, 2021 at 1:06 AM George Kennedy >> <george.kennedy@oracle.com> wrote: >>> >>> >>> On 2/18/2021 3:55 AM, David Hildenbrand wrote: >>>> On 17.02.21 21:56, Andrey Konovalov wrote: >>>>> During boot, all non-reserved memblock memory is exposed to the buddy >>>>> allocator. Poisoning all that memory with KASAN lengthens boot time, >>>>> especially on systems with large amount of RAM. This patch makes >>>>> page_alloc to not call kasan_free_pages() on all new memory. >>>>> >>>>> __free_pages_core() is used when exposing fresh memory during system >>>>> boot and when onlining memory during hotplug. This patch adds a new >>>>> FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok() through >>>>> free_pages_prepare() from __free_pages_core(). >>>>> >>>>> This has little impact on KASAN memory tracking. >>>>> >>>>> Assuming that there are no references to newly exposed pages >>>>> before they >>>>> are ever allocated, there won't be any intended (but buggy) >>>>> accesses to >>>>> that memory that KASAN would normally detect. >>>>> >>>>> However, with this patch, KASAN stops detecting wild and large >>>>> out-of-bounds accesses that happen to land on a fresh memory page >>>>> that >>>>> was never allocated. This is taken as an acceptable trade-off. >>>>> >>>>> All memory allocated normally when the boot is over keeps getting >>>>> poisoned as usual. >>>>> >>>>> Signed-off-by: Andrey Konovalov <andreyknvl@google.com> >>>>> Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d >>>> Not sure this is the right thing to do, see >>>> >>>> https://lkml.kernel.org/r/bcf8925d-0949-3fe1-baa8-cc536c529860@oracle.com >>>> >>>> >>>> Reversing the order in which memory gets allocated + used during boot >>>> (in a patch by me) might have revealed an invalid memory access during >>>> boot. >>>> >>>> I suspect that that issue would no longer get detected with your >>>> patch, as the invalid memory access would simply not get detected. >>>> Now, I cannot prove that :) >>> Since David's patch we're having trouble with the iBFT ACPI table, >>> which >>> is mapped in via kmap() - see acpi_map() in "drivers/acpi/osl.c". KASAN >>> detects that it is being used after free when ibft_init() accesses the >>> iBFT table, but as of yet we can't find where it get's freed (we've >>> instrumented calls to kunmap()). >> Maybe it doesn't get freed, but what you see is a wild or a large >> out-of-bounds access. Since KASAN marks all memory as freed during the >> memblock->page_alloc transition, such bugs can manifest as >> use-after-frees. > > It gets freed and re-used. By the time the iBFT table is accessed by > ibft_init() the page has been over-written. > > Setting page flags like the following before the call to kmap() > prevents the iBFT table page from being freed: Cleaned up version: diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c index 0418feb..8f0a8e7 100644 --- a/drivers/acpi/osl.c +++ b/drivers/acpi/osl.c @@ -287,9 +287,12 @@ static void __iomem *acpi_map(acpi_physical_address pg_off, unsigned long pg_sz) pfn = pg_off >> PAGE_SHIFT; if (should_use_kmap(pfn)) { + struct page *page = pfn_to_page(pfn); + if (pg_sz > PAGE_SIZE) return NULL; - return (void __iomem __force *)kmap(pfn_to_page(pfn)); + SetPageReserved(page); + return (void __iomem __force *)kmap(page); } else return acpi_os_ioremap(pg_off, pg_sz); } @@ -299,9 +302,12 @@ static void acpi_unmap(acpi_physical_address pg_off, void __iomem *vaddr) unsigned long pfn; pfn = pg_off >> PAGE_SHIFT; - if (should_use_kmap(pfn)) - kunmap(pfn_to_page(pfn)); - else + if (should_use_kmap(pfn)) { + struct page *page = pfn_to_page(pfn); + + ClearPageReserved(page); + kunmap(page); + } else iounmap(vaddr); } David, the above works, but wondering why it is now necessary. kunmap() is not hit. What other ways could a page mapped via kmap() be unmapped? Thank you, George > > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c > index 0418feb..41c1bbd 100644 > --- a/drivers/acpi/osl.c > +++ b/drivers/acpi/osl.c > @@ -287,9 +287,14 @@ static void __iomem > *acpi_map(acpi_physical_address pg_off, unsigned long pg_sz) > > pfn = pg_off >> PAGE_SHIFT; > if (should_use_kmap(pfn)) { > + struct page *page = pfn_to_page(pfn); > + > if (pg_sz > PAGE_SIZE) > return NULL; > - return (void __iomem __force *)kmap(pfn_to_page(pfn)); > + > + page->flags |= ((1UL << PG_unevictable) | (1UL << > PG_reserved) | (1UL << PG_locked)); > + > + return (void __iomem __force *)kmap(page); > } else > return acpi_os_ioremap(pg_off, pg_sz); > } > > Just not sure of the correct way to set the page flags. > > George >
On 20.02.21 00:04, George Kennedy wrote: > > > On 2/19/2021 11:45 AM, George Kennedy wrote: >> >> >> On 2/18/2021 7:09 PM, Andrey Konovalov wrote: >>> On Fri, Feb 19, 2021 at 1:06 AM George Kennedy >>> <george.kennedy@oracle.com> wrote: >>>> >>>> >>>> On 2/18/2021 3:55 AM, David Hildenbrand wrote: >>>>> On 17.02.21 21:56, Andrey Konovalov wrote: >>>>>> During boot, all non-reserved memblock memory is exposed to the buddy >>>>>> allocator. Poisoning all that memory with KASAN lengthens boot time, >>>>>> especially on systems with large amount of RAM. This patch makes >>>>>> page_alloc to not call kasan_free_pages() on all new memory. >>>>>> >>>>>> __free_pages_core() is used when exposing fresh memory during system >>>>>> boot and when onlining memory during hotplug. This patch adds a new >>>>>> FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok() through >>>>>> free_pages_prepare() from __free_pages_core(). >>>>>> >>>>>> This has little impact on KASAN memory tracking. >>>>>> >>>>>> Assuming that there are no references to newly exposed pages >>>>>> before they >>>>>> are ever allocated, there won't be any intended (but buggy) >>>>>> accesses to >>>>>> that memory that KASAN would normally detect. >>>>>> >>>>>> However, with this patch, KASAN stops detecting wild and large >>>>>> out-of-bounds accesses that happen to land on a fresh memory page >>>>>> that >>>>>> was never allocated. This is taken as an acceptable trade-off. >>>>>> >>>>>> All memory allocated normally when the boot is over keeps getting >>>>>> poisoned as usual. >>>>>> >>>>>> Signed-off-by: Andrey Konovalov <andreyknvl@google.com> >>>>>> Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d >>>>> Not sure this is the right thing to do, see >>>>> >>>>> https://lkml.kernel.org/r/bcf8925d-0949-3fe1-baa8-cc536c529860@oracle.com >>>>> >>>>> >>>>> Reversing the order in which memory gets allocated + used during boot >>>>> (in a patch by me) might have revealed an invalid memory access during >>>>> boot. >>>>> >>>>> I suspect that that issue would no longer get detected with your >>>>> patch, as the invalid memory access would simply not get detected. >>>>> Now, I cannot prove that :) >>>> Since David's patch we're having trouble with the iBFT ACPI table, >>>> which >>>> is mapped in via kmap() - see acpi_map() in "drivers/acpi/osl.c". KASAN >>>> detects that it is being used after free when ibft_init() accesses the >>>> iBFT table, but as of yet we can't find where it get's freed (we've >>>> instrumented calls to kunmap()). >>> Maybe it doesn't get freed, but what you see is a wild or a large >>> out-of-bounds access. Since KASAN marks all memory as freed during the >>> memblock->page_alloc transition, such bugs can manifest as >>> use-after-frees. >> >> It gets freed and re-used. By the time the iBFT table is accessed by >> ibft_init() the page has been over-written. >> >> Setting page flags like the following before the call to kmap() >> prevents the iBFT table page from being freed: > > Cleaned up version: > > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c > index 0418feb..8f0a8e7 100644 > --- a/drivers/acpi/osl.c > +++ b/drivers/acpi/osl.c > @@ -287,9 +287,12 @@ static void __iomem *acpi_map(acpi_physical_address > pg_off, unsigned long pg_sz) > > pfn = pg_off >> PAGE_SHIFT; > if (should_use_kmap(pfn)) { > + struct page *page = pfn_to_page(pfn); > + > if (pg_sz > PAGE_SIZE) > return NULL; > - return (void __iomem __force *)kmap(pfn_to_page(pfn)); > + SetPageReserved(page); > + return (void __iomem __force *)kmap(page); > } else > return acpi_os_ioremap(pg_off, pg_sz); > } > @@ -299,9 +302,12 @@ static void acpi_unmap(acpi_physical_address > pg_off, void __iomem *vaddr) > unsigned long pfn; > > pfn = pg_off >> PAGE_SHIFT; > - if (should_use_kmap(pfn)) > - kunmap(pfn_to_page(pfn)); > - else > + if (should_use_kmap(pfn)) { > + struct page *page = pfn_to_page(pfn); > + > + ClearPageReserved(page); > + kunmap(page); > + } else > iounmap(vaddr); > } > > David, the above works, but wondering why it is now necessary. kunmap() > is not hit. What other ways could a page mapped via kmap() be unmapped? > Let me look into the code ... I have little experience with ACPI details, so bear with me. I assume that acpi_map()/acpi_unmap() map some firmware blob that is provided via firmware/bios/... to us. should_use_kmap() tells us whether a) we have a "struct page" and should kmap() that one b) we don't have a "struct page" and should ioremap. As it is a blob, the firmware should always reserve that memory region via memblock (e.g., memblock_reserve()), such that we either 1) don't create a memmap ("struct page") at all (-> case b) ) 2) if we have to create e memmap, we mark the page PG_reserved and *never* expose it to the buddy (-> case a) ) Are you telling me that in this case we might have a memmap for the HW blob that is *not* PG_reserved? In that case it most probably got exposed to the buddy where it can happily get allocated/freed. The latent BUG would be that that blob gets exposed to the system like ordinary RAM, and not reserved via memblock early during boot. Assuming that blob has a low physical address, with my patch it will get allocated/used a lot earlier - which would mean we trigger this latent BUG now more easily. There have been similar latent BUGs on ARM boards that my patch discovered where special RAM regions did not get marked as reserved via the device tree properly. Now, this is just a wild guess :) Can you dump the page when mapping (before PageReserved()) and when unmapping, to see what the state of that memmap is?
On 2/22/2021 4:52 AM, David Hildenbrand wrote: > On 20.02.21 00:04, George Kennedy wrote: >> >> >> On 2/19/2021 11:45 AM, George Kennedy wrote: >>> >>> >>> On 2/18/2021 7:09 PM, Andrey Konovalov wrote: >>>> On Fri, Feb 19, 2021 at 1:06 AM George Kennedy >>>> <george.kennedy@oracle.com> wrote: >>>>> >>>>> >>>>> On 2/18/2021 3:55 AM, David Hildenbrand wrote: >>>>>> On 17.02.21 21:56, Andrey Konovalov wrote: >>>>>>> During boot, all non-reserved memblock memory is exposed to the >>>>>>> buddy >>>>>>> allocator. Poisoning all that memory with KASAN lengthens boot >>>>>>> time, >>>>>>> especially on systems with large amount of RAM. This patch makes >>>>>>> page_alloc to not call kasan_free_pages() on all new memory. >>>>>>> >>>>>>> __free_pages_core() is used when exposing fresh memory during >>>>>>> system >>>>>>> boot and when onlining memory during hotplug. This patch adds a new >>>>>>> FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok() >>>>>>> through >>>>>>> free_pages_prepare() from __free_pages_core(). >>>>>>> >>>>>>> This has little impact on KASAN memory tracking. >>>>>>> >>>>>>> Assuming that there are no references to newly exposed pages >>>>>>> before they >>>>>>> are ever allocated, there won't be any intended (but buggy) >>>>>>> accesses to >>>>>>> that memory that KASAN would normally detect. >>>>>>> >>>>>>> However, with this patch, KASAN stops detecting wild and large >>>>>>> out-of-bounds accesses that happen to land on a fresh memory page >>>>>>> that >>>>>>> was never allocated. This is taken as an acceptable trade-off. >>>>>>> >>>>>>> All memory allocated normally when the boot is over keeps getting >>>>>>> poisoned as usual. >>>>>>> >>>>>>> Signed-off-by: Andrey Konovalov <andreyknvl@google.com> >>>>>>> Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d >>>>>> Not sure this is the right thing to do, see >>>>>> >>>>>> https://lkml.kernel.org/r/bcf8925d-0949-3fe1-baa8-cc536c529860@oracle.com >>>>>> >>>>>> >>>>>> >>>>>> Reversing the order in which memory gets allocated + used during >>>>>> boot >>>>>> (in a patch by me) might have revealed an invalid memory access >>>>>> during >>>>>> boot. >>>>>> >>>>>> I suspect that that issue would no longer get detected with your >>>>>> patch, as the invalid memory access would simply not get detected. >>>>>> Now, I cannot prove that :) >>>>> Since David's patch we're having trouble with the iBFT ACPI table, >>>>> which >>>>> is mapped in via kmap() - see acpi_map() in "drivers/acpi/osl.c". >>>>> KASAN >>>>> detects that it is being used after free when ibft_init() accesses >>>>> the >>>>> iBFT table, but as of yet we can't find where it get's freed (we've >>>>> instrumented calls to kunmap()). >>>> Maybe it doesn't get freed, but what you see is a wild or a large >>>> out-of-bounds access. Since KASAN marks all memory as freed during the >>>> memblock->page_alloc transition, such bugs can manifest as >>>> use-after-frees. >>> >>> It gets freed and re-used. By the time the iBFT table is accessed by >>> ibft_init() the page has been over-written. >>> >>> Setting page flags like the following before the call to kmap() >>> prevents the iBFT table page from being freed: >> >> Cleaned up version: >> >> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c >> index 0418feb..8f0a8e7 100644 >> --- a/drivers/acpi/osl.c >> +++ b/drivers/acpi/osl.c >> @@ -287,9 +287,12 @@ static void __iomem *acpi_map(acpi_physical_address >> pg_off, unsigned long pg_sz) >> >> pfn = pg_off >> PAGE_SHIFT; >> if (should_use_kmap(pfn)) { >> + struct page *page = pfn_to_page(pfn); >> + >> if (pg_sz > PAGE_SIZE) >> return NULL; >> - return (void __iomem __force *)kmap(pfn_to_page(pfn)); >> + SetPageReserved(page); >> + return (void __iomem __force *)kmap(page); >> } else >> return acpi_os_ioremap(pg_off, pg_sz); >> } >> @@ -299,9 +302,12 @@ static void acpi_unmap(acpi_physical_address >> pg_off, void __iomem *vaddr) >> unsigned long pfn; >> >> pfn = pg_off >> PAGE_SHIFT; >> - if (should_use_kmap(pfn)) >> - kunmap(pfn_to_page(pfn)); >> - else >> + if (should_use_kmap(pfn)) { >> + struct page *page = pfn_to_page(pfn); >> + >> + ClearPageReserved(page); >> + kunmap(page); >> + } else >> iounmap(vaddr); >> } >> >> David, the above works, but wondering why it is now necessary. kunmap() >> is not hit. What other ways could a page mapped via kmap() be unmapped? >> > > Let me look into the code ... I have little experience with ACPI > details, so bear with me. > > I assume that acpi_map()/acpi_unmap() map some firmware blob that is > provided via firmware/bios/... to us. > > should_use_kmap() tells us whether > a) we have a "struct page" and should kmap() that one > b) we don't have a "struct page" and should ioremap. > > As it is a blob, the firmware should always reserve that memory region > via memblock (e.g., memblock_reserve()), such that we either > 1) don't create a memmap ("struct page") at all (-> case b) ) > 2) if we have to create e memmap, we mark the page PG_reserved and > *never* expose it to the buddy (-> case a) ) > > > Are you telling me that in this case we might have a memmap for the HW > blob that is *not* PG_reserved? In that case it most probably got > exposed to the buddy where it can happily get allocated/freed. > > The latent BUG would be that that blob gets exposed to the system like > ordinary RAM, and not reserved via memblock early during boot. > Assuming that blob has a low physical address, with my patch it will > get allocated/used a lot earlier - which would mean we trigger this > latent BUG now more easily. > > There have been similar latent BUGs on ARM boards that my patch > discovered where special RAM regions did not get marked as reserved > via the device tree properly. > > Now, this is just a wild guess :) Can you dump the page when mapping > (before PageReserved()) and when unmapping, to see what the state of > that memmap is? Thank you David for the explanation and your help on this, dump_page() before PageReserved and before kmap() in the above patch: [ 1.116480] ACPI: Core revision 20201113 [ 1.117628] XXX acpi_map: about to call kmap()... [ 1.118561] page:ffffea0002f914c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xbe453 [ 1.120381] flags: 0xfffffc0000000() [ 1.121116] raw: 000fffffc0000000 ffffea0002f914c8 ffffea0002f914c8 0000000000000000 [ 1.122638] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 [ 1.124146] page dumped because: acpi_map pre SetPageReserved I also added dump_page() before unmapping, but it is not hit. The following for the same pfn now shows up I believe as a result of setting PageReserved: [ 28.098208] BUG: Bad page state in process modprobe pfn:be453 [ 28.098394] page:ffffea0002f914c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0xbe453 [ 28.098394] flags: 0xfffffc0001000(reserved) [ 28.098394] raw: 000fffffc0001000 dead000000000100 dead000000000122 0000000000000000 [ 28.098394] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 [ 28.098394] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag(s) set [ 28.098394] page_owner info is not present (never set?) [ 28.098394] Modules linked in: [ 28.098394] CPU: 2 PID: 204 Comm: modprobe Not tainted 5.11.0-3dbd5e3 #66 [ 28.098394] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 28.098394] Call Trace: [ 28.098394] dump_stack+0xdb/0x120 [ 28.098394] bad_page.cold.108+0xc6/0xcb [ 28.098394] check_new_page_bad+0x47/0xa0 [ 28.098394] get_page_from_freelist+0x30cd/0x5730 [ 28.098394] ? __isolate_free_page+0x4f0/0x4f0 [ 28.098394] ? init_object+0x7e/0x90 [ 28.098394] __alloc_pages_nodemask+0x2d8/0x650 [ 28.098394] ? write_comp_data+0x2f/0x90 [ 28.098394] ? __alloc_pages_slowpath.constprop.103+0x2110/0x2110 [ 28.098394] ? __sanitizer_cov_trace_pc+0x21/0x50 [ 28.098394] alloc_pages_vma+0xe2/0x560 [ 28.098394] do_fault+0x194/0x12c0 [ 28.098394] ? write_comp_data+0x2f/0x90 [ 28.098394] __handle_mm_fault+0x1650/0x26c0 [ 28.098394] ? copy_page_range+0x1350/0x1350 [ 28.098394] ? write_comp_data+0x2f/0x90 [ 28.098394] ? write_comp_data+0x2f/0x90 [ 28.098394] handle_mm_fault+0x1f9/0x810 [ 28.098394] ? write_comp_data+0x2f/0x90 [ 28.098394] do_user_addr_fault+0x6f7/0xca0 [ 28.098394] exc_page_fault+0xaf/0x1a0 [ 28.098394] asm_exc_page_fault+0x1e/0x30 [ 28.098394] RIP: 0010:__clear_user+0x30/0x60 What would be the correct way to reserve the page so that the above would not be hit? BTW, this is running with Konrad's patch that pairs acpi_get_table & acpi_put_table for the iBFT table which should result in an eventual call to acpi_unmap() and kunmap(), though that does not occur. Could be a possible acpi page refcount issue that will have to be looked into. George
On 22.02.21 16:13, George Kennedy wrote: > > > On 2/22/2021 4:52 AM, David Hildenbrand wrote: >> On 20.02.21 00:04, George Kennedy wrote: >>> >>> >>> On 2/19/2021 11:45 AM, George Kennedy wrote: >>>> >>>> >>>> On 2/18/2021 7:09 PM, Andrey Konovalov wrote: >>>>> On Fri, Feb 19, 2021 at 1:06 AM George Kennedy >>>>> <george.kennedy@oracle.com> wrote: >>>>>> >>>>>> >>>>>> On 2/18/2021 3:55 AM, David Hildenbrand wrote: >>>>>>> On 17.02.21 21:56, Andrey Konovalov wrote: >>>>>>>> During boot, all non-reserved memblock memory is exposed to the >>>>>>>> buddy >>>>>>>> allocator. Poisoning all that memory with KASAN lengthens boot >>>>>>>> time, >>>>>>>> especially on systems with large amount of RAM. This patch makes >>>>>>>> page_alloc to not call kasan_free_pages() on all new memory. >>>>>>>> >>>>>>>> __free_pages_core() is used when exposing fresh memory during >>>>>>>> system >>>>>>>> boot and when onlining memory during hotplug. This patch adds a new >>>>>>>> FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok() >>>>>>>> through >>>>>>>> free_pages_prepare() from __free_pages_core(). >>>>>>>> >>>>>>>> This has little impact on KASAN memory tracking. >>>>>>>> >>>>>>>> Assuming that there are no references to newly exposed pages >>>>>>>> before they >>>>>>>> are ever allocated, there won't be any intended (but buggy) >>>>>>>> accesses to >>>>>>>> that memory that KASAN would normally detect. >>>>>>>> >>>>>>>> However, with this patch, KASAN stops detecting wild and large >>>>>>>> out-of-bounds accesses that happen to land on a fresh memory page >>>>>>>> that >>>>>>>> was never allocated. This is taken as an acceptable trade-off. >>>>>>>> >>>>>>>> All memory allocated normally when the boot is over keeps getting >>>>>>>> poisoned as usual. >>>>>>>> >>>>>>>> Signed-off-by: Andrey Konovalov <andreyknvl@google.com> >>>>>>>> Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d >>>>>>> Not sure this is the right thing to do, see >>>>>>> >>>>>>> https://lkml.kernel.org/r/bcf8925d-0949-3fe1-baa8-cc536c529860@oracle.com >>>>>>> >>>>>>> >>>>>>> >>>>>>> Reversing the order in which memory gets allocated + used during >>>>>>> boot >>>>>>> (in a patch by me) might have revealed an invalid memory access >>>>>>> during >>>>>>> boot. >>>>>>> >>>>>>> I suspect that that issue would no longer get detected with your >>>>>>> patch, as the invalid memory access would simply not get detected. >>>>>>> Now, I cannot prove that :) >>>>>> Since David's patch we're having trouble with the iBFT ACPI table, >>>>>> which >>>>>> is mapped in via kmap() - see acpi_map() in "drivers/acpi/osl.c". >>>>>> KASAN >>>>>> detects that it is being used after free when ibft_init() accesses >>>>>> the >>>>>> iBFT table, but as of yet we can't find where it get's freed (we've >>>>>> instrumented calls to kunmap()). >>>>> Maybe it doesn't get freed, but what you see is a wild or a large >>>>> out-of-bounds access. Since KASAN marks all memory as freed during the >>>>> memblock->page_alloc transition, such bugs can manifest as >>>>> use-after-frees. >>>> >>>> It gets freed and re-used. By the time the iBFT table is accessed by >>>> ibft_init() the page has been over-written. >>>> >>>> Setting page flags like the following before the call to kmap() >>>> prevents the iBFT table page from being freed: >>> >>> Cleaned up version: >>> >>> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c >>> index 0418feb..8f0a8e7 100644 >>> --- a/drivers/acpi/osl.c >>> +++ b/drivers/acpi/osl.c >>> @@ -287,9 +287,12 @@ static void __iomem *acpi_map(acpi_physical_address >>> pg_off, unsigned long pg_sz) >>> >>> pfn = pg_off >> PAGE_SHIFT; >>> if (should_use_kmap(pfn)) { >>> + struct page *page = pfn_to_page(pfn); >>> + >>> if (pg_sz > PAGE_SIZE) >>> return NULL; >>> - return (void __iomem __force *)kmap(pfn_to_page(pfn)); >>> + SetPageReserved(page); >>> + return (void __iomem __force *)kmap(page); >>> } else >>> return acpi_os_ioremap(pg_off, pg_sz); >>> } >>> @@ -299,9 +302,12 @@ static void acpi_unmap(acpi_physical_address >>> pg_off, void __iomem *vaddr) >>> unsigned long pfn; >>> >>> pfn = pg_off >> PAGE_SHIFT; >>> - if (should_use_kmap(pfn)) >>> - kunmap(pfn_to_page(pfn)); >>> - else >>> + if (should_use_kmap(pfn)) { >>> + struct page *page = pfn_to_page(pfn); >>> + >>> + ClearPageReserved(page); >>> + kunmap(page); >>> + } else >>> iounmap(vaddr); >>> } >>> >>> David, the above works, but wondering why it is now necessary. kunmap() >>> is not hit. What other ways could a page mapped via kmap() be unmapped? >>> >> >> Let me look into the code ... I have little experience with ACPI >> details, so bear with me. >> >> I assume that acpi_map()/acpi_unmap() map some firmware blob that is >> provided via firmware/bios/... to us. >> >> should_use_kmap() tells us whether >> a) we have a "struct page" and should kmap() that one >> b) we don't have a "struct page" and should ioremap. >> >> As it is a blob, the firmware should always reserve that memory region >> via memblock (e.g., memblock_reserve()), such that we either >> 1) don't create a memmap ("struct page") at all (-> case b) ) >> 2) if we have to create e memmap, we mark the page PG_reserved and >> *never* expose it to the buddy (-> case a) ) >> >> >> Are you telling me that in this case we might have a memmap for the HW >> blob that is *not* PG_reserved? In that case it most probably got >> exposed to the buddy where it can happily get allocated/freed. >> >> The latent BUG would be that that blob gets exposed to the system like >> ordinary RAM, and not reserved via memblock early during boot. >> Assuming that blob has a low physical address, with my patch it will >> get allocated/used a lot earlier - which would mean we trigger this >> latent BUG now more easily. >> >> There have been similar latent BUGs on ARM boards that my patch >> discovered where special RAM regions did not get marked as reserved >> via the device tree properly. >> >> Now, this is just a wild guess :) Can you dump the page when mapping >> (before PageReserved()) and when unmapping, to see what the state of >> that memmap is? > > Thank you David for the explanation and your help on this, > > dump_page() before PageReserved and before kmap() in the above patch: > > [ 1.116480] ACPI: Core revision 20201113 > [ 1.117628] XXX acpi_map: about to call kmap()... > [ 1.118561] page:ffffea0002f914c0 refcount:0 mapcount:0 > mapping:0000000000000000 index:0x0 pfn:0xbe453 > [ 1.120381] flags: 0xfffffc0000000() > [ 1.121116] raw: 000fffffc0000000 ffffea0002f914c8 ffffea0002f914c8 > 0000000000000000 > [ 1.122638] raw: 0000000000000000 0000000000000000 00000000ffffffff > 0000000000000000 > [ 1.124146] page dumped because: acpi_map pre SetPageReserved > > I also added dump_page() before unmapping, but it is not hit. The > following for the same pfn now shows up I believe as a result of setting > PageReserved: > > [ 28.098208] BUG:Bad page state in process mo dprobe pfn:be453 > [ 28.098394] page:ffffea0002f914c0 refcount:0 mapcount:0 > mapping:0000000000000000 index:0x1 pfn:0xbe453 > [ 28.098394] flags: 0xfffffc0001000(reserved) > [ 28.098394] raw: 000fffffc0001000 dead000000000100 dead000000000122 > 0000000000000000 > [ 28.098394] raw: 0000000000000001 0000000000000000 00000000ffffffff > 0000000000000000 > [ 28.098394] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag(s) set > [ 28.098394] page_owner info is not present (never set?) > [ 28.098394] Modules linked in: > [ 28.098394] CPU: 2 PID: 204 Comm: modprobe Not tainted 5.11.0-3dbd5e3 #66 > [ 28.098394] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS 0.0.0 02/06/2015 > [ 28.098394] Call Trace: > [ 28.098394] dump_stack+0xdb/0x120 > [ 28.098394] bad_page.cold.108+0xc6/0xcb > [ 28.098394] check_new_page_bad+0x47/0xa0 > [ 28.098394] get_page_from_freelist+0x30cd/0x5730 > [ 28.098394] ? __isolate_free_page+0x4f0/0x4f0 > [ 28.098394] ? init_object+0x7e/0x90 > [ 28.098394] __alloc_pages_nodemask+0x2d8/0x650 > [ 28.098394] ? write_comp_data+0x2f/0x90 > [ 28.098394] ? __alloc_pages_slowpath.constprop.103+0x2110/0x2110 > [ 28.098394] ? __sanitizer_cov_trace_pc+0x21/0x50 > [ 28.098394] alloc_pages_vma+0xe2/0x560 > [ 28.098394] do_fault+0x194/0x12c0 > [ 28.098394] ? write_comp_data+0x2f/0x90 > [ 28.098394] __handle_mm_fault+0x1650/0x26c0 > [ 28.098394] ? copy_page_range+0x1350/0x1350 > [ 28.098394] ? write_comp_data+0x2f/0x90 > [ 28.098394] ? write_comp_data+0x2f/0x90 > [ 28.098394] handle_mm_fault+0x1f9/0x810 > [ 28.098394] ? write_comp_data+0x2f/0x90 > [ 28.098394] do_user_addr_fault+0x6f7/0xca0 > [ 28.098394] exc_page_fault+0xaf/0x1a0 > [ 28.098394] asm_exc_page_fault+0x1e/0x30 > [ 28.098394] RIP: 0010:__clear_user+0x30/0x60 I think the PAGE_FLAGS_CHECK_AT_PREP check in this instance means that someone is trying to allocate that page with the PG_reserved bit set. This means that the page actually was exposed to the buddy. However, when you SetPageReserved(), I don't think that PG_buddy is set and the refcount is 0. That could indicate that the page is on the buddy PCP list. Could be that it is getting reused a couple of times. The PFN 0xbe453 looks a little strange, though. Do we expect ACPI tables close to 3 GiB ? No idea. Could it be that you are trying to map a wrong table? Just a guess. > > What would be the correct way to reserve the page so that the above > would not be hit? I would have assumed that if this is a binary blob, that someone (which I think would be acpi code) reserved via memblock_reserve() early during boot. E.g., see drivers/acpi/tables.c:acpi_table_upgrade()->memblock_reserve().
On 22.02.21 17:13, David Hildenbrand wrote: > On 22.02.21 16:13, George Kennedy wrote: >> >> >> On 2/22/2021 4:52 AM, David Hildenbrand wrote: >>> On 20.02.21 00:04, George Kennedy wrote: >>>> >>>> >>>> On 2/19/2021 11:45 AM, George Kennedy wrote: >>>>> >>>>> >>>>> On 2/18/2021 7:09 PM, Andrey Konovalov wrote: >>>>>> On Fri, Feb 19, 2021 at 1:06 AM George Kennedy >>>>>> <george.kennedy@oracle.com> wrote: >>>>>>> >>>>>>> >>>>>>> On 2/18/2021 3:55 AM, David Hildenbrand wrote: >>>>>>>> On 17.02.21 21:56, Andrey Konovalov wrote: >>>>>>>>> During boot, all non-reserved memblock memory is exposed to the >>>>>>>>> buddy >>>>>>>>> allocator. Poisoning all that memory with KASAN lengthens boot >>>>>>>>> time, >>>>>>>>> especially on systems with large amount of RAM. This patch makes >>>>>>>>> page_alloc to not call kasan_free_pages() on all new memory. >>>>>>>>> >>>>>>>>> __free_pages_core() is used when exposing fresh memory during >>>>>>>>> system >>>>>>>>> boot and when onlining memory during hotplug. This patch adds a new >>>>>>>>> FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok() >>>>>>>>> through >>>>>>>>> free_pages_prepare() from __free_pages_core(). >>>>>>>>> >>>>>>>>> This has little impact on KASAN memory tracking. >>>>>>>>> >>>>>>>>> Assuming that there are no references to newly exposed pages >>>>>>>>> before they >>>>>>>>> are ever allocated, there won't be any intended (but buggy) >>>>>>>>> accesses to >>>>>>>>> that memory that KASAN would normally detect. >>>>>>>>> >>>>>>>>> However, with this patch, KASAN stops detecting wild and large >>>>>>>>> out-of-bounds accesses that happen to land on a fresh memory page >>>>>>>>> that >>>>>>>>> was never allocated. This is taken as an acceptable trade-off. >>>>>>>>> >>>>>>>>> All memory allocated normally when the boot is over keeps getting >>>>>>>>> poisoned as usual. >>>>>>>>> >>>>>>>>> Signed-off-by: Andrey Konovalov <andreyknvl@google.com> >>>>>>>>> Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d >>>>>>>> Not sure this is the right thing to do, see >>>>>>>> >>>>>>>> https://lkml.kernel.org/r/bcf8925d-0949-3fe1-baa8-cc536c529860@oracle.com >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Reversing the order in which memory gets allocated + used during >>>>>>>> boot >>>>>>>> (in a patch by me) might have revealed an invalid memory access >>>>>>>> during >>>>>>>> boot. >>>>>>>> >>>>>>>> I suspect that that issue would no longer get detected with your >>>>>>>> patch, as the invalid memory access would simply not get detected. >>>>>>>> Now, I cannot prove that :) >>>>>>> Since David's patch we're having trouble with the iBFT ACPI table, >>>>>>> which >>>>>>> is mapped in via kmap() - see acpi_map() in "drivers/acpi/osl.c". >>>>>>> KASAN >>>>>>> detects that it is being used after free when ibft_init() accesses >>>>>>> the >>>>>>> iBFT table, but as of yet we can't find where it get's freed (we've >>>>>>> instrumented calls to kunmap()). >>>>>> Maybe it doesn't get freed, but what you see is a wild or a large >>>>>> out-of-bounds access. Since KASAN marks all memory as freed during the >>>>>> memblock->page_alloc transition, such bugs can manifest as >>>>>> use-after-frees. >>>>> >>>>> It gets freed and re-used. By the time the iBFT table is accessed by >>>>> ibft_init() the page has been over-written. >>>>> >>>>> Setting page flags like the following before the call to kmap() >>>>> prevents the iBFT table page from being freed: >>>> >>>> Cleaned up version: >>>> >>>> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c >>>> index 0418feb..8f0a8e7 100644 >>>> --- a/drivers/acpi/osl.c >>>> +++ b/drivers/acpi/osl.c >>>> @@ -287,9 +287,12 @@ static void __iomem *acpi_map(acpi_physical_address >>>> pg_off, unsigned long pg_sz) >>>> >>>> pfn = pg_off >> PAGE_SHIFT; >>>> if (should_use_kmap(pfn)) { >>>> + struct page *page = pfn_to_page(pfn); >>>> + >>>> if (pg_sz > PAGE_SIZE) >>>> return NULL; >>>> - return (void __iomem __force *)kmap(pfn_to_page(pfn)); >>>> + SetPageReserved(page); >>>> + return (void __iomem __force *)kmap(page); >>>> } else >>>> return acpi_os_ioremap(pg_off, pg_sz); >>>> } >>>> @@ -299,9 +302,12 @@ static void acpi_unmap(acpi_physical_address >>>> pg_off, void __iomem *vaddr) >>>> unsigned long pfn; >>>> >>>> pfn = pg_off >> PAGE_SHIFT; >>>> - if (should_use_kmap(pfn)) >>>> - kunmap(pfn_to_page(pfn)); >>>> - else >>>> + if (should_use_kmap(pfn)) { >>>> + struct page *page = pfn_to_page(pfn); >>>> + >>>> + ClearPageReserved(page); >>>> + kunmap(page); >>>> + } else >>>> iounmap(vaddr); >>>> } >>>> >>>> David, the above works, but wondering why it is now necessary. kunmap() >>>> is not hit. What other ways could a page mapped via kmap() be unmapped? >>>> >>> >>> Let me look into the code ... I have little experience with ACPI >>> details, so bear with me. >>> >>> I assume that acpi_map()/acpi_unmap() map some firmware blob that is >>> provided via firmware/bios/... to us. >>> >>> should_use_kmap() tells us whether >>> a) we have a "struct page" and should kmap() that one >>> b) we don't have a "struct page" and should ioremap. >>> >>> As it is a blob, the firmware should always reserve that memory region >>> via memblock (e.g., memblock_reserve()), such that we either >>> 1) don't create a memmap ("struct page") at all (-> case b) ) >>> 2) if we have to create e memmap, we mark the page PG_reserved and >>> *never* expose it to the buddy (-> case a) ) >>> >>> >>> Are you telling me that in this case we might have a memmap for the HW >>> blob that is *not* PG_reserved? In that case it most probably got >>> exposed to the buddy where it can happily get allocated/freed. >>> >>> The latent BUG would be that that blob gets exposed to the system like >>> ordinary RAM, and not reserved via memblock early during boot. >>> Assuming that blob has a low physical address, with my patch it will >>> get allocated/used a lot earlier - which would mean we trigger this >>> latent BUG now more easily. >>> >>> There have been similar latent BUGs on ARM boards that my patch >>> discovered where special RAM regions did not get marked as reserved >>> via the device tree properly. >>> >>> Now, this is just a wild guess :) Can you dump the page when mapping >>> (before PageReserved()) and when unmapping, to see what the state of >>> that memmap is? >> >> Thank you David for the explanation and your help on this, >> >> dump_page() before PageReserved and before kmap() in the above patch: >> >> [ 1.116480] ACPI: Core revision 20201113 >> [ 1.117628] XXX acpi_map: about to call kmap()... >> [ 1.118561] page:ffffea0002f914c0 refcount:0 mapcount:0 >> mapping:0000000000000000 index:0x0 pfn:0xbe453 >> [ 1.120381] flags: 0xfffffc0000000() >> [ 1.121116] raw: 000fffffc0000000 ffffea0002f914c8 ffffea0002f914c8 >> 0000000000000000 >> [ 1.122638] raw: 0000000000000000 0000000000000000 00000000ffffffff >> 0000000000000000 >> [ 1.124146] page dumped because: acpi_map pre SetPageReserved >> >> I also added dump_page() before unmapping, but it is not hit. The >> following for the same pfn now shows up I believe as a result of setting >> PageReserved: >> >> [ 28.098208] BUG:Bad page state in process mo dprobe pfn:be453 >> [ 28.098394] page:ffffea0002f914c0 refcount:0 mapcount:0 >> mapping:0000000000000000 index:0x1 pfn:0xbe453 >> [ 28.098394] flags: 0xfffffc0001000(reserved) >> [ 28.098394] raw: 000fffffc0001000 dead000000000100 dead000000000122 >> 0000000000000000 >> [ 28.098394] raw: 0000000000000001 0000000000000000 00000000ffffffff >> 0000000000000000 >> [ 28.098394] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag(s) set >> [ 28.098394] page_owner info is not present (never set?) >> [ 28.098394] Modules linked in: >> [ 28.098394] CPU: 2 PID: 204 Comm: modprobe Not tainted 5.11.0-3dbd5e3 #66 >> [ 28.098394] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), >> BIOS 0.0.0 02/06/2015 >> [ 28.098394] Call Trace: >> [ 28.098394] dump_stack+0xdb/0x120 >> [ 28.098394] bad_page.cold.108+0xc6/0xcb >> [ 28.098394] check_new_page_bad+0x47/0xa0 >> [ 28.098394] get_page_from_freelist+0x30cd/0x5730 >> [ 28.098394] ? __isolate_free_page+0x4f0/0x4f0 >> [ 28.098394] ? init_object+0x7e/0x90 >> [ 28.098394] __alloc_pages_nodemask+0x2d8/0x650 >> [ 28.098394] ? write_comp_data+0x2f/0x90 >> [ 28.098394] ? __alloc_pages_slowpath.constprop.103+0x2110/0x2110 >> [ 28.098394] ? __sanitizer_cov_trace_pc+0x21/0x50 >> [ 28.098394] alloc_pages_vma+0xe2/0x560 >> [ 28.098394] do_fault+0x194/0x12c0 >> [ 28.098394] ? write_comp_data+0x2f/0x90 >> [ 28.098394] __handle_mm_fault+0x1650/0x26c0 >> [ 28.098394] ? copy_page_range+0x1350/0x1350 >> [ 28.098394] ? write_comp_data+0x2f/0x90 >> [ 28.098394] ? write_comp_data+0x2f/0x90 >> [ 28.098394] handle_mm_fault+0x1f9/0x810 >> [ 28.098394] ? write_comp_data+0x2f/0x90 >> [ 28.098394] do_user_addr_fault+0x6f7/0xca0 >> [ 28.098394] exc_page_fault+0xaf/0x1a0 >> [ 28.098394] asm_exc_page_fault+0x1e/0x30 >> [ 28.098394] RIP: 0010:__clear_user+0x30/0x60 > > I think the PAGE_FLAGS_CHECK_AT_PREP check in this instance means that > someone is trying to allocate that page with the PG_reserved bit set. > This means that the page actually was exposed to the buddy. > > However, when you SetPageReserved(), I don't think that PG_buddy is set > and the refcount is 0. That could indicate that the page is on the buddy > PCP list. Could be that it is getting reused a couple of times. > > The PFN 0xbe453 looks a little strange, though. Do we expect ACPI tables > close to 3 GiB ? No idea. Could it be that you are trying to map a wrong > table? Just a guess. ... but I assume ibft_check_device() would bail out on an invalid checksum. So the question is, why is this page not properly marked as reserved already.
On Mon, Feb 22, 2021 at 05:39:29PM +0100, David Hildenbrand wrote: > On 22.02.21 17:13, David Hildenbrand wrote: > > On 22.02.21 16:13, George Kennedy wrote: > > > > > > > > > On 2/22/2021 4:52 AM, David Hildenbrand wrote: > > > > On 20.02.21 00:04, George Kennedy wrote: > > > > > > > > > > > > > > > On 2/19/2021 11:45 AM, George Kennedy wrote: > > > > > > > > > > > > > > > > > > On 2/18/2021 7:09 PM, Andrey Konovalov wrote: > > > > > > > On Fri, Feb 19, 2021 at 1:06 AM George Kennedy > > > > > > > <george.kennedy@oracle.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > On 2/18/2021 3:55 AM, David Hildenbrand wrote: > > > > > > > > > On 17.02.21 21:56, Andrey Konovalov wrote: > > > > > > > > > > During boot, all non-reserved memblock memory is exposed to the > > > > > > > > > > buddy > > > > > > > > > > allocator. Poisoning all that memory with KASAN lengthens boot > > > > > > > > > > time, > > > > > > > > > > especially on systems with large amount of RAM. This patch makes > > > > > > > > > > page_alloc to not call kasan_free_pages() on all new memory. > > > > > > > > > > > > > > > > > > > > __free_pages_core() is used when exposing fresh memory during > > > > > > > > > > system > > > > > > > > > > boot and when onlining memory during hotplug. This patch adds a new > > > > > > > > > > FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok() > > > > > > > > > > through > > > > > > > > > > free_pages_prepare() from __free_pages_core(). > > > > > > > > > > > > > > > > > > > > This has little impact on KASAN memory tracking. > > > > > > > > > > > > > > > > > > > > Assuming that there are no references to newly exposed pages > > > > > > > > > > before they > > > > > > > > > > are ever allocated, there won't be any intended (but buggy) > > > > > > > > > > accesses to > > > > > > > > > > that memory that KASAN would normally detect. > > > > > > > > > > > > > > > > > > > > However, with this patch, KASAN stops detecting wild and large > > > > > > > > > > out-of-bounds accesses that happen to land on a fresh memory page > > > > > > > > > > that > > > > > > > > > > was never allocated. This is taken as an acceptable trade-off. > > > > > > > > > > > > > > > > > > > > All memory allocated normally when the boot is over keeps getting > > > > > > > > > > poisoned as usual. > > > > > > > > > > > > > > > > > > > > Signed-off-by: Andrey Konovalov <andreyknvl@google.com> > > > > > > > > > > Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d > > > > > > > > > Not sure this is the right thing to do, see > > > > > > > > > > > > > > > > > > https://lkml.kernel.org/r/bcf8925d-0949-3fe1-baa8-cc536c529860@oracle.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Reversing the order in which memory gets allocated + used during > > > > > > > > > boot > > > > > > > > > (in a patch by me) might have revealed an invalid memory access > > > > > > > > > during > > > > > > > > > boot. > > > > > > > > > > > > > > > > > > I suspect that that issue would no longer get detected with your > > > > > > > > > patch, as the invalid memory access would simply not get detected. > > > > > > > > > Now, I cannot prove that :) > > > > > > > > Since David's patch we're having trouble with the iBFT ACPI table, > > > > > > > > which > > > > > > > > is mapped in via kmap() - see acpi_map() in "drivers/acpi/osl.c". > > > > > > > > KASAN > > > > > > > > detects that it is being used after free when ibft_init() accesses > > > > > > > > the > > > > > > > > iBFT table, but as of yet we can't find where it get's freed (we've > > > > > > > > instrumented calls to kunmap()). > > > > > > > Maybe it doesn't get freed, but what you see is a wild or a large > > > > > > > out-of-bounds access. Since KASAN marks all memory as freed during the > > > > > > > memblock->page_alloc transition, such bugs can manifest as > > > > > > > use-after-frees. > > > > > > > > > > > > It gets freed and re-used. By the time the iBFT table is accessed by > > > > > > ibft_init() the page has been over-written. > > > > > > > > > > > > Setting page flags like the following before the call to kmap() > > > > > > prevents the iBFT table page from being freed: > > > > > > > > > > Cleaned up version: > > > > > > > > > > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c > > > > > index 0418feb..8f0a8e7 100644 > > > > > --- a/drivers/acpi/osl.c > > > > > +++ b/drivers/acpi/osl.c > > > > > @@ -287,9 +287,12 @@ static void __iomem *acpi_map(acpi_physical_address > > > > > pg_off, unsigned long pg_sz) > > > > > > > > > > pfn = pg_off >> PAGE_SHIFT; > > > > > if (should_use_kmap(pfn)) { > > > > > + struct page *page = pfn_to_page(pfn); > > > > > + > > > > > if (pg_sz > PAGE_SIZE) > > > > > return NULL; > > > > > - return (void __iomem __force *)kmap(pfn_to_page(pfn)); > > > > > + SetPageReserved(page); > > > > > + return (void __iomem __force *)kmap(page); > > > > > } else > > > > > return acpi_os_ioremap(pg_off, pg_sz); > > > > > } > > > > > @@ -299,9 +302,12 @@ static void acpi_unmap(acpi_physical_address > > > > > pg_off, void __iomem *vaddr) > > > > > unsigned long pfn; > > > > > > > > > > pfn = pg_off >> PAGE_SHIFT; > > > > > - if (should_use_kmap(pfn)) > > > > > - kunmap(pfn_to_page(pfn)); > > > > > - else > > > > > + if (should_use_kmap(pfn)) { > > > > > + struct page *page = pfn_to_page(pfn); > > > > > + > > > > > + ClearPageReserved(page); > > > > > + kunmap(page); > > > > > + } else > > > > > iounmap(vaddr); > > > > > } > > > > > > > > > > David, the above works, but wondering why it is now necessary. kunmap() > > > > > is not hit. What other ways could a page mapped via kmap() be unmapped? > > > > > > > > > > > > > Let me look into the code ... I have little experience with ACPI > > > > details, so bear with me. > > > > > > > > I assume that acpi_map()/acpi_unmap() map some firmware blob that is > > > > provided via firmware/bios/... to us. > > > > > > > > should_use_kmap() tells us whether > > > > a) we have a "struct page" and should kmap() that one > > > > b) we don't have a "struct page" and should ioremap. > > > > > > > > As it is a blob, the firmware should always reserve that memory region > > > > via memblock (e.g., memblock_reserve()), such that we either > > > > 1) don't create a memmap ("struct page") at all (-> case b) ) > > > > 2) if we have to create e memmap, we mark the page PG_reserved and > > > > *never* expose it to the buddy (-> case a) ) > > > > > > > > > > > > Are you telling me that in this case we might have a memmap for the HW > > > > blob that is *not* PG_reserved? In that case it most probably got > > > > exposed to the buddy where it can happily get allocated/freed. > > > > > > > > The latent BUG would be that that blob gets exposed to the system like > > > > ordinary RAM, and not reserved via memblock early during boot. > > > > Assuming that blob has a low physical address, with my patch it will > > > > get allocated/used a lot earlier - which would mean we trigger this > > > > latent BUG now more easily. > > > > > > > > There have been similar latent BUGs on ARM boards that my patch > > > > discovered where special RAM regions did not get marked as reserved > > > > via the device tree properly. > > > > > > > > Now, this is just a wild guess :) Can you dump the page when mapping > > > > (before PageReserved()) and when unmapping, to see what the state of > > > > that memmap is? > > > > > > Thank you David for the explanation and your help on this, > > > > > > dump_page() before PageReserved and before kmap() in the above patch: > > > > > > [ 1.116480] ACPI: Core revision 20201113 > > > [ 1.117628] XXX acpi_map: about to call kmap()... > > > [ 1.118561] page:ffffea0002f914c0 refcount:0 mapcount:0 > > > mapping:0000000000000000 index:0x0 pfn:0xbe453 > > > [ 1.120381] flags: 0xfffffc0000000() > > > [ 1.121116] raw: 000fffffc0000000 ffffea0002f914c8 ffffea0002f914c8 > > > 0000000000000000 > > > [ 1.122638] raw: 0000000000000000 0000000000000000 00000000ffffffff > > > 0000000000000000 > > > [ 1.124146] page dumped because: acpi_map pre SetPageReserved > > > > > > I also added dump_page() before unmapping, but it is not hit. The > > > following for the same pfn now shows up I believe as a result of setting > > > PageReserved: > > > > > > [ 28.098208] BUG:Bad page state in process mo dprobe pfn:be453 > > > [ 28.098394] page:ffffea0002f914c0 refcount:0 mapcount:0 > > > mapping:0000000000000000 index:0x1 pfn:0xbe453 > > > [ 28.098394] flags: 0xfffffc0001000(reserved) > > > [ 28.098394] raw: 000fffffc0001000 dead000000000100 dead000000000122 > > > 0000000000000000 > > > [ 28.098394] raw: 0000000000000001 0000000000000000 00000000ffffffff > > > 0000000000000000 > > > [ 28.098394] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag(s) set > > > [ 28.098394] page_owner info is not present (never set?) > > > [ 28.098394] Modules linked in: > > > [ 28.098394] CPU: 2 PID: 204 Comm: modprobe Not tainted 5.11.0-3dbd5e3 #66 > > > [ 28.098394] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > > BIOS 0.0.0 02/06/2015 > > > [ 28.098394] Call Trace: > > > [ 28.098394] dump_stack+0xdb/0x120 > > > [ 28.098394] bad_page.cold.108+0xc6/0xcb > > > [ 28.098394] check_new_page_bad+0x47/0xa0 > > > [ 28.098394] get_page_from_freelist+0x30cd/0x5730 > > > [ 28.098394] ? __isolate_free_page+0x4f0/0x4f0 > > > [ 28.098394] ? init_object+0x7e/0x90 > > > [ 28.098394] __alloc_pages_nodemask+0x2d8/0x650 > > > [ 28.098394] ? write_comp_data+0x2f/0x90 > > > [ 28.098394] ? __alloc_pages_slowpath.constprop.103+0x2110/0x2110 > > > [ 28.098394] ? __sanitizer_cov_trace_pc+0x21/0x50 > > > [ 28.098394] alloc_pages_vma+0xe2/0x560 > > > [ 28.098394] do_fault+0x194/0x12c0 > > > [ 28.098394] ? write_comp_data+0x2f/0x90 > > > [ 28.098394] __handle_mm_fault+0x1650/0x26c0 > > > [ 28.098394] ? copy_page_range+0x1350/0x1350 > > > [ 28.098394] ? write_comp_data+0x2f/0x90 > > > [ 28.098394] ? write_comp_data+0x2f/0x90 > > > [ 28.098394] handle_mm_fault+0x1f9/0x810 > > > [ 28.098394] ? write_comp_data+0x2f/0x90 > > > [ 28.098394] do_user_addr_fault+0x6f7/0xca0 > > > [ 28.098394] exc_page_fault+0xaf/0x1a0 > > > [ 28.098394] asm_exc_page_fault+0x1e/0x30 > > > [ 28.098394] RIP: 0010:__clear_user+0x30/0x60 > > > > I think the PAGE_FLAGS_CHECK_AT_PREP check in this instance means that > > someone is trying to allocate that page with the PG_reserved bit set. > > This means that the page actually was exposed to the buddy. > > > > However, when you SetPageReserved(), I don't think that PG_buddy is set > > and the refcount is 0. That could indicate that the page is on the buddy > > PCP list. Could be that it is getting reused a couple of times. > > > > The PFN 0xbe453 looks a little strange, though. Do we expect ACPI tables > > close to 3 GiB ? No idea. Could it be that you are trying to map a wrong > > table? Just a guess. Nah, ACPI MADT enumerates the table and that is the proper location of it. > > ... but I assume ibft_check_device() would bail out on an invalid checksum. > So the question is, why is this page not properly marked as reserved > already. The ibft_check_device ends up being called as module way way after the kernel has cleaned the memory. The funny thing about iBFT is that (it is also mentioned in the spec) that the table can resize in memory .. or in the ACPI regions (which have no E820_RAM and are considered "MMIO" regions). Either place is fine, so it can be in either RAM or MMIO :-( > > -- > Thanks, > > David / dhildenb >
On 2/22/2021 11:13 AM, David Hildenbrand wrote: > On 22.02.21 16:13, George Kennedy wrote: >> >> >> On 2/22/2021 4:52 AM, David Hildenbrand wrote: >>> On 20.02.21 00:04, George Kennedy wrote: >>>> >>>> >>>> On 2/19/2021 11:45 AM, George Kennedy wrote: >>>>> >>>>> >>>>> On 2/18/2021 7:09 PM, Andrey Konovalov wrote: >>>>>> On Fri, Feb 19, 2021 at 1:06 AM George Kennedy >>>>>> <george.kennedy@oracle.com> wrote: >>>>>>> >>>>>>> >>>>>>> On 2/18/2021 3:55 AM, David Hildenbrand wrote: >>>>>>>> On 17.02.21 21:56, Andrey Konovalov wrote: >>>>>>>>> During boot, all non-reserved memblock memory is exposed to the >>>>>>>>> buddy >>>>>>>>> allocator. Poisoning all that memory with KASAN lengthens boot >>>>>>>>> time, >>>>>>>>> especially on systems with large amount of RAM. This patch makes >>>>>>>>> page_alloc to not call kasan_free_pages() on all new memory. >>>>>>>>> >>>>>>>>> __free_pages_core() is used when exposing fresh memory during >>>>>>>>> system >>>>>>>>> boot and when onlining memory during hotplug. This patch adds >>>>>>>>> a new >>>>>>>>> FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok() >>>>>>>>> through >>>>>>>>> free_pages_prepare() from __free_pages_core(). >>>>>>>>> >>>>>>>>> This has little impact on KASAN memory tracking. >>>>>>>>> >>>>>>>>> Assuming that there are no references to newly exposed pages >>>>>>>>> before they >>>>>>>>> are ever allocated, there won't be any intended (but buggy) >>>>>>>>> accesses to >>>>>>>>> that memory that KASAN would normally detect. >>>>>>>>> >>>>>>>>> However, with this patch, KASAN stops detecting wild and large >>>>>>>>> out-of-bounds accesses that happen to land on a fresh memory page >>>>>>>>> that >>>>>>>>> was never allocated. This is taken as an acceptable trade-off. >>>>>>>>> >>>>>>>>> All memory allocated normally when the boot is over keeps getting >>>>>>>>> poisoned as usual. >>>>>>>>> >>>>>>>>> Signed-off-by: Andrey Konovalov <andreyknvl@google.com> >>>>>>>>> Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d >>>>>>>> Not sure this is the right thing to do, see >>>>>>>> >>>>>>>> https://lkml.kernel.org/r/bcf8925d-0949-3fe1-baa8-cc536c529860@oracle.com >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Reversing the order in which memory gets allocated + used during >>>>>>>> boot >>>>>>>> (in a patch by me) might have revealed an invalid memory access >>>>>>>> during >>>>>>>> boot. >>>>>>>> >>>>>>>> I suspect that that issue would no longer get detected with your >>>>>>>> patch, as the invalid memory access would simply not get detected. >>>>>>>> Now, I cannot prove that :) >>>>>>> Since David's patch we're having trouble with the iBFT ACPI table, >>>>>>> which >>>>>>> is mapped in via kmap() - see acpi_map() in "drivers/acpi/osl.c". >>>>>>> KASAN >>>>>>> detects that it is being used after free when ibft_init() accesses >>>>>>> the >>>>>>> iBFT table, but as of yet we can't find where it get's freed (we've >>>>>>> instrumented calls to kunmap()). >>>>>> Maybe it doesn't get freed, but what you see is a wild or a large >>>>>> out-of-bounds access. Since KASAN marks all memory as freed >>>>>> during the >>>>>> memblock->page_alloc transition, such bugs can manifest as >>>>>> use-after-frees. >>>>> >>>>> It gets freed and re-used. By the time the iBFT table is accessed by >>>>> ibft_init() the page has been over-written. >>>>> >>>>> Setting page flags like the following before the call to kmap() >>>>> prevents the iBFT table page from being freed: >>>> >>>> Cleaned up version: >>>> >>>> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c >>>> index 0418feb..8f0a8e7 100644 >>>> --- a/drivers/acpi/osl.c >>>> +++ b/drivers/acpi/osl.c >>>> @@ -287,9 +287,12 @@ static void __iomem >>>> *acpi_map(acpi_physical_address >>>> pg_off, unsigned long pg_sz) >>>> >>>> pfn = pg_off >> PAGE_SHIFT; >>>> if (should_use_kmap(pfn)) { >>>> + struct page *page = pfn_to_page(pfn); >>>> + >>>> if (pg_sz > PAGE_SIZE) >>>> return NULL; >>>> - return (void __iomem __force *)kmap(pfn_to_page(pfn)); >>>> + SetPageReserved(page); >>>> + return (void __iomem __force *)kmap(page); >>>> } else >>>> return acpi_os_ioremap(pg_off, pg_sz); >>>> } >>>> @@ -299,9 +302,12 @@ static void acpi_unmap(acpi_physical_address >>>> pg_off, void __iomem *vaddr) >>>> unsigned long pfn; >>>> >>>> pfn = pg_off >> PAGE_SHIFT; >>>> - if (should_use_kmap(pfn)) >>>> - kunmap(pfn_to_page(pfn)); >>>> - else >>>> + if (should_use_kmap(pfn)) { >>>> + struct page *page = pfn_to_page(pfn); >>>> + >>>> + ClearPageReserved(page); >>>> + kunmap(page); >>>> + } else >>>> iounmap(vaddr); >>>> } >>>> >>>> David, the above works, but wondering why it is now necessary. >>>> kunmap() >>>> is not hit. What other ways could a page mapped via kmap() be >>>> unmapped? >>>> >>> >>> Let me look into the code ... I have little experience with ACPI >>> details, so bear with me. >>> >>> I assume that acpi_map()/acpi_unmap() map some firmware blob that is >>> provided via firmware/bios/... to us. >>> >>> should_use_kmap() tells us whether >>> a) we have a "struct page" and should kmap() that one >>> b) we don't have a "struct page" and should ioremap. >>> >>> As it is a blob, the firmware should always reserve that memory region >>> via memblock (e.g., memblock_reserve()), such that we either >>> 1) don't create a memmap ("struct page") at all (-> case b) ) >>> 2) if we have to create e memmap, we mark the page PG_reserved and >>> *never* expose it to the buddy (-> case a) ) >>> >>> >>> Are you telling me that in this case we might have a memmap for the HW >>> blob that is *not* PG_reserved? In that case it most probably got >>> exposed to the buddy where it can happily get allocated/freed. >>> >>> The latent BUG would be that that blob gets exposed to the system like >>> ordinary RAM, and not reserved via memblock early during boot. >>> Assuming that blob has a low physical address, with my patch it will >>> get allocated/used a lot earlier - which would mean we trigger this >>> latent BUG now more easily. >>> >>> There have been similar latent BUGs on ARM boards that my patch >>> discovered where special RAM regions did not get marked as reserved >>> via the device tree properly. >>> >>> Now, this is just a wild guess :) Can you dump the page when mapping >>> (before PageReserved()) and when unmapping, to see what the state of >>> that memmap is? >> >> Thank you David for the explanation and your help on this, >> >> dump_page() before PageReserved and before kmap() in the above patch: >> >> [ 1.116480] ACPI: Core revision 20201113 >> [ 1.117628] XXX acpi_map: about to call kmap()... >> [ 1.118561] page:ffffea0002f914c0 refcount:0 mapcount:0 >> mapping:0000000000000000 index:0x0 pfn:0xbe453 >> [ 1.120381] flags: 0xfffffc0000000() >> [ 1.121116] raw: 000fffffc0000000 ffffea0002f914c8 ffffea0002f914c8 >> 0000000000000000 >> [ 1.122638] raw: 0000000000000000 0000000000000000 00000000ffffffff >> 0000000000000000 >> [ 1.124146] page dumped because: acpi_map pre SetPageReserved >> >> I also added dump_page() before unmapping, but it is not hit. The >> following for the same pfn now shows up I believe as a result of setting >> PageReserved: >> >> [ 28.098208] BUG:Bad page state in process mo dprobe pfn:be453 >> [ 28.098394] page:ffffea0002f914c0 refcount:0 mapcount:0 >> mapping:0000000000000000 index:0x1 pfn:0xbe453 >> [ 28.098394] flags: 0xfffffc0001000(reserved) >> [ 28.098394] raw: 000fffffc0001000 dead000000000100 dead000000000122 >> 0000000000000000 >> [ 28.098394] raw: 0000000000000001 0000000000000000 00000000ffffffff >> 0000000000000000 >> [ 28.098394] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag(s) set >> [ 28.098394] page_owner info is not present (never set?) >> [ 28.098394] Modules linked in: >> [ 28.098394] CPU: 2 PID: 204 Comm: modprobe Not tainted >> 5.11.0-3dbd5e3 #66 >> [ 28.098394] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), >> BIOS 0.0.0 02/06/2015 >> [ 28.098394] Call Trace: >> [ 28.098394] dump_stack+0xdb/0x120 >> [ 28.098394] bad_page.cold.108+0xc6/0xcb >> [ 28.098394] check_new_page_bad+0x47/0xa0 >> [ 28.098394] get_page_from_freelist+0x30cd/0x5730 >> [ 28.098394] ? __isolate_free_page+0x4f0/0x4f0 >> [ 28.098394] ? init_object+0x7e/0x90 >> [ 28.098394] __alloc_pages_nodemask+0x2d8/0x650 >> [ 28.098394] ? write_comp_data+0x2f/0x90 >> [ 28.098394] ? __alloc_pages_slowpath.constprop.103+0x2110/0x2110 >> [ 28.098394] ? __sanitizer_cov_trace_pc+0x21/0x50 >> [ 28.098394] alloc_pages_vma+0xe2/0x560 >> [ 28.098394] do_fault+0x194/0x12c0 >> [ 28.098394] ? write_comp_data+0x2f/0x90 >> [ 28.098394] __handle_mm_fault+0x1650/0x26c0 >> [ 28.098394] ? copy_page_range+0x1350/0x1350 >> [ 28.098394] ? write_comp_data+0x2f/0x90 >> [ 28.098394] ? write_comp_data+0x2f/0x90 >> [ 28.098394] handle_mm_fault+0x1f9/0x810 >> [ 28.098394] ? write_comp_data+0x2f/0x90 >> [ 28.098394] do_user_addr_fault+0x6f7/0xca0 >> [ 28.098394] exc_page_fault+0xaf/0x1a0 >> [ 28.098394] asm_exc_page_fault+0x1e/0x30 >> [ 28.098394] RIP: 0010:__clear_user+0x30/0x60 > > I think the PAGE_FLAGS_CHECK_AT_PREP check in this instance means that > someone is trying to allocate that page with the PG_reserved bit set. > This means that the page actually was exposed to the buddy. > > However, when you SetPageReserved(), I don't think that PG_buddy is > set and the refcount is 0. That could indicate that the page is on the > buddy PCP list. Could be that it is getting reused a couple of times. > > The PFN 0xbe453 looks a little strange, though. Do we expect ACPI > tables close to 3 GiB ? No idea. Could it be that you are trying to > map a wrong table? Just a guess. > >> >> What would be the correct way to reserve the page so that the above >> would not be hit? > > I would have assumed that if this is a binary blob, that someone > (which I think would be acpi code) reserved via memblock_reserve() > early during boot. > > E.g., see drivers/acpi/tables.c:acpi_table_upgrade()->memblock_reserve(). acpi_table_upgrade() gets called, but bails out before memblock_reserve() is called. Thus, it appears no pages are getting reserved. 503 void __init acpi_table_upgrade(void) 504 { 505 void *data; 506 size_t size; 507 int sig, no, table_nr = 0, total_offset = 0; 508 long offset = 0; 509 struct acpi_table_header *table; 510 char cpio_path[32] = "kernel/firmware/acpi/"; 511 struct cpio_data file; 512 513 if (IS_ENABLED(CONFIG_ACPI_TABLE_OVERRIDE_VIA_BUILTIN_INITRD)) { 514 data = __initramfs_start; 515 size = __initramfs_size; 516 } else { 517 data = (void *)initrd_start; 518 size = initrd_end - initrd_start; 519 } 520 521 if (data == NULL || size == 0) 522 return; 523 524 for (no = 0; no < NR_ACPI_INITRD_TABLES; no++) { 525 file = find_cpio_data(cpio_path, data, size, &offset); 526 if (!file.data) 527 break; ... 563 all_tables_size += table->length; 564 acpi_initrd_files[table_nr].data = file.data; 565 acpi_initrd_files[table_nr].size = file.size; 566 table_nr++; 567 } 568 if (table_nr == 0) 569 return; <-- bails out here "drivers/acpi/tables.c" George
On Mon, Feb 22, 2021 at 12:40:36PM -0500, Konrad Rzeszutek Wilk wrote: > On Mon, Feb 22, 2021 at 05:39:29PM +0100, David Hildenbrand wrote: > > On 22.02.21 17:13, David Hildenbrand wrote: > > > On 22.02.21 16:13, George Kennedy wrote: > > > > > > > > > > > > On 2/22/2021 4:52 AM, David Hildenbrand wrote: > > > > > On 20.02.21 00:04, George Kennedy wrote: > > > > > > > > > > > > > > > > > > On 2/19/2021 11:45 AM, George Kennedy wrote: > > > > > > > > > > > > > > > > > > > > > On 2/18/2021 7:09 PM, Andrey Konovalov wrote: > > > > > > > > On Fri, Feb 19, 2021 at 1:06 AM George Kennedy > > > > > > > > <george.kennedy@oracle.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > On 2/18/2021 3:55 AM, David Hildenbrand wrote: > > > > > > > > > > On 17.02.21 21:56, Andrey Konovalov wrote: > > > > > > > > > > > During boot, all non-reserved memblock memory is exposed to the > > > > > > > > > > > buddy > > > > > > > > > > > allocator. Poisoning all that memory with KASAN lengthens boot > > > > > > > > > > > time, > > > > > > > > > > > especially on systems with large amount of RAM. This patch makes > > > > > > > > > > > page_alloc to not call kasan_free_pages() on all new memory. > > > > > > > > > > > > > > > > > > > > > > __free_pages_core() is used when exposing fresh memory during > > > > > > > > > > > system > > > > > > > > > > > boot and when onlining memory during hotplug. This patch adds a new > > > > > > > > > > > FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok() > > > > > > > > > > > through > > > > > > > > > > > free_pages_prepare() from __free_pages_core(). > > > > > > > > > > > > > > > > > > > > > > This has little impact on KASAN memory tracking. > > > > > > > > > > > > > > > > > > > > > > Assuming that there are no references to newly exposed pages > > > > > > > > > > > before they > > > > > > > > > > > are ever allocated, there won't be any intended (but buggy) > > > > > > > > > > > accesses to > > > > > > > > > > > that memory that KASAN would normally detect. > > > > > > > > > > > > > > > > > > > > > > However, with this patch, KASAN stops detecting wild and large > > > > > > > > > > > out-of-bounds accesses that happen to land on a fresh memory page > > > > > > > > > > > that > > > > > > > > > > > was never allocated. This is taken as an acceptable trade-off. > > > > > > > > > > > > > > > > > > > > > > All memory allocated normally when the boot is over keeps getting > > > > > > > > > > > poisoned as usual. > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Andrey Konovalov <andreyknvl@google.com> > > > > > > > > > > > Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d > > > > > > > > > > Not sure this is the right thing to do, see > > > > > > > > > > > > > > > > > > > > https://lkml.kernel.org/r/bcf8925d-0949-3fe1-baa8-cc536c529860@oracle.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Reversing the order in which memory gets allocated + used during > > > > > > > > > > boot > > > > > > > > > > (in a patch by me) might have revealed an invalid memory access > > > > > > > > > > during > > > > > > > > > > boot. > > > > > > > > > > > > > > > > > > > > I suspect that that issue would no longer get detected with your > > > > > > > > > > patch, as the invalid memory access would simply not get detected. > > > > > > > > > > Now, I cannot prove that :) > > > > > > > > > Since David's patch we're having trouble with the iBFT ACPI table, > > > > > > > > > which > > > > > > > > > is mapped in via kmap() - see acpi_map() in "drivers/acpi/osl.c". > > > > > > > > > KASAN > > > > > > > > > detects that it is being used after free when ibft_init() accesses > > > > > > > > > the > > > > > > > > > iBFT table, but as of yet we can't find where it get's freed (we've > > > > > > > > > instrumented calls to kunmap()). > > > > > > > > Maybe it doesn't get freed, but what you see is a wild or a large > > > > > > > > out-of-bounds access. Since KASAN marks all memory as freed during the > > > > > > > > memblock->page_alloc transition, such bugs can manifest as > > > > > > > > use-after-frees. > > > > > > > > > > > > > > It gets freed and re-used. By the time the iBFT table is accessed by > > > > > > > ibft_init() the page has been over-written. > > > > > > > > > > > > > > Setting page flags like the following before the call to kmap() > > > > > > > prevents the iBFT table page from being freed: > > > > > > > > > > > > Cleaned up version: > > > > > > > > > > > > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c > > > > > > index 0418feb..8f0a8e7 100644 > > > > > > --- a/drivers/acpi/osl.c > > > > > > +++ b/drivers/acpi/osl.c > > > > > > @@ -287,9 +287,12 @@ static void __iomem *acpi_map(acpi_physical_address > > > > > > pg_off, unsigned long pg_sz) > > > > > > > > > > > > pfn = pg_off >> PAGE_SHIFT; > > > > > > if (should_use_kmap(pfn)) { > > > > > > + struct page *page = pfn_to_page(pfn); > > > > > > + > > > > > > if (pg_sz > PAGE_SIZE) > > > > > > return NULL; > > > > > > - return (void __iomem __force *)kmap(pfn_to_page(pfn)); > > > > > > + SetPageReserved(page); > > > > > > + return (void __iomem __force *)kmap(page); > > > > > > } else > > > > > > return acpi_os_ioremap(pg_off, pg_sz); > > > > > > } > > > > > > @@ -299,9 +302,12 @@ static void acpi_unmap(acpi_physical_address > > > > > > pg_off, void __iomem *vaddr) > > > > > > unsigned long pfn; > > > > > > > > > > > > pfn = pg_off >> PAGE_SHIFT; > > > > > > - if (should_use_kmap(pfn)) > > > > > > - kunmap(pfn_to_page(pfn)); > > > > > > - else > > > > > > + if (should_use_kmap(pfn)) { > > > > > > + struct page *page = pfn_to_page(pfn); > > > > > > + > > > > > > + ClearPageReserved(page); > > > > > > + kunmap(page); > > > > > > + } else > > > > > > iounmap(vaddr); > > > > > > } > > > > > > > > > > > > David, the above works, but wondering why it is now necessary. kunmap() > > > > > > is not hit. What other ways could a page mapped via kmap() be unmapped? > > > > > > > > > > > > > > > > Let me look into the code ... I have little experience with ACPI > > > > > details, so bear with me. > > > > > > > > > > I assume that acpi_map()/acpi_unmap() map some firmware blob that is > > > > > provided via firmware/bios/... to us. > > > > > > > > > > should_use_kmap() tells us whether > > > > > a) we have a "struct page" and should kmap() that one > > > > > b) we don't have a "struct page" and should ioremap. > > > > > > > > > > As it is a blob, the firmware should always reserve that memory region > > > > > via memblock (e.g., memblock_reserve()), such that we either > > > > > 1) don't create a memmap ("struct page") at all (-> case b) ) > > > > > 2) if we have to create e memmap, we mark the page PG_reserved and > > > > > *never* expose it to the buddy (-> case a) ) > > > > > > > > > > > > > > > Are you telling me that in this case we might have a memmap for the HW > > > > > blob that is *not* PG_reserved? In that case it most probably got > > > > > exposed to the buddy where it can happily get allocated/freed. > > > > > > > > > > The latent BUG would be that that blob gets exposed to the system like > > > > > ordinary RAM, and not reserved via memblock early during boot. > > > > > Assuming that blob has a low physical address, with my patch it will > > > > > get allocated/used a lot earlier - which would mean we trigger this > > > > > latent BUG now more easily. > > > > > > > > > > There have been similar latent BUGs on ARM boards that my patch > > > > > discovered where special RAM regions did not get marked as reserved > > > > > via the device tree properly. > > > > > > > > > > Now, this is just a wild guess :) Can you dump the page when mapping > > > > > (before PageReserved()) and when unmapping, to see what the state of > > > > > that memmap is? > > > > > > > > Thank you David for the explanation and your help on this, > > > > > > > > dump_page() before PageReserved and before kmap() in the above patch: > > > > > > > > [ 1.116480] ACPI: Core revision 20201113 > > > > [ 1.117628] XXX acpi_map: about to call kmap()... > > > > [ 1.118561] page:ffffea0002f914c0 refcount:0 mapcount:0 > > > > mapping:0000000000000000 index:0x0 pfn:0xbe453 > > > > [ 1.120381] flags: 0xfffffc0000000() > > > > [ 1.121116] raw: 000fffffc0000000 ffffea0002f914c8 ffffea0002f914c8 > > > > 0000000000000000 > > > > [ 1.122638] raw: 0000000000000000 0000000000000000 00000000ffffffff > > > > 0000000000000000 > > > > [ 1.124146] page dumped because: acpi_map pre SetPageReserved > > > > > > > > I also added dump_page() before unmapping, but it is not hit. The > > > > following for the same pfn now shows up I believe as a result of setting > > > > PageReserved: > > > > > > > > [ 28.098208] BUG:Bad page state in process mo dprobe pfn:be453 > > > > [ 28.098394] page:ffffea0002f914c0 refcount:0 mapcount:0 > > > > mapping:0000000000000000 index:0x1 pfn:0xbe453 > > > > [ 28.098394] flags: 0xfffffc0001000(reserved) > > > > [ 28.098394] raw: 000fffffc0001000 dead000000000100 dead000000000122 > > > > 0000000000000000 > > > > [ 28.098394] raw: 0000000000000001 0000000000000000 00000000ffffffff > > > > 0000000000000000 > > > > [ 28.098394] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag(s) set > > > > [ 28.098394] page_owner info is not present (never set?) > > > > [ 28.098394] Modules linked in: > > > > [ 28.098394] CPU: 2 PID: 204 Comm: modprobe Not tainted 5.11.0-3dbd5e3 #66 > > > > [ 28.098394] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > > > BIOS 0.0.0 02/06/2015 > > > > [ 28.098394] Call Trace: > > > > [ 28.098394] dump_stack+0xdb/0x120 > > > > [ 28.098394] bad_page.cold.108+0xc6/0xcb > > > > [ 28.098394] check_new_page_bad+0x47/0xa0 > > > > [ 28.098394] get_page_from_freelist+0x30cd/0x5730 > > > > [ 28.098394] ? __isolate_free_page+0x4f0/0x4f0 > > > > [ 28.098394] ? init_object+0x7e/0x90 > > > > [ 28.098394] __alloc_pages_nodemask+0x2d8/0x650 > > > > [ 28.098394] ? write_comp_data+0x2f/0x90 > > > > [ 28.098394] ? __alloc_pages_slowpath.constprop.103+0x2110/0x2110 > > > > [ 28.098394] ? __sanitizer_cov_trace_pc+0x21/0x50 > > > > [ 28.098394] alloc_pages_vma+0xe2/0x560 > > > > [ 28.098394] do_fault+0x194/0x12c0 > > > > [ 28.098394] ? write_comp_data+0x2f/0x90 > > > > [ 28.098394] __handle_mm_fault+0x1650/0x26c0 > > > > [ 28.098394] ? copy_page_range+0x1350/0x1350 > > > > [ 28.098394] ? write_comp_data+0x2f/0x90 > > > > [ 28.098394] ? write_comp_data+0x2f/0x90 > > > > [ 28.098394] handle_mm_fault+0x1f9/0x810 > > > > [ 28.098394] ? write_comp_data+0x2f/0x90 > > > > [ 28.098394] do_user_addr_fault+0x6f7/0xca0 > > > > [ 28.098394] exc_page_fault+0xaf/0x1a0 > > > > [ 28.098394] asm_exc_page_fault+0x1e/0x30 > > > > [ 28.098394] RIP: 0010:__clear_user+0x30/0x60 > > > > > > I think the PAGE_FLAGS_CHECK_AT_PREP check in this instance means that > > > someone is trying to allocate that page with the PG_reserved bit set. > > > This means that the page actually was exposed to the buddy. > > > > > > However, when you SetPageReserved(), I don't think that PG_buddy is set > > > and the refcount is 0. That could indicate that the page is on the buddy > > > PCP list. Could be that it is getting reused a couple of times. > > > > > > The PFN 0xbe453 looks a little strange, though. Do we expect ACPI tables > > > close to 3 GiB ? No idea. Could it be that you are trying to map a wrong > > > table? Just a guess. > > Nah, ACPI MADT enumerates the table and that is the proper location of it. > > > > ... but I assume ibft_check_device() would bail out on an invalid checksum. > > So the question is, why is this page not properly marked as reserved > > already. > > The ibft_check_device ends up being called as module way way after the > kernel has cleaned the memory. > > The funny thing about iBFT is that (it is also mentioned in the spec) > that the table can resize in memory .. or in the ACPI regions (which ^ reside I presume? > have no E820_RAM and are considered "MMIO" regions). > > Either place is fine, so it can be in either RAM or MMIO :-( I'd say that the tables in this case are in E820_RAM, because with MMIO we wouldn't get to kmap() at the first place. It can be easily confirmed by comparing the problematic address with /proc/iomem. Can't say I have a clue about what's going on there, but the theory that somehow iBFT table does not get PG_Reserved during boot makes sense. Do you see "iBFT found at 0x<addr>" early in the kernel log? I don't know if ACPI relocates the tables, but I could not find anywhere that it reserves the original ones. The memblock_reserve() in acpi_table_upgrade() is merely a part of open coded memblock allocation.
On Mon, Feb 22, 2021 at 01:42:56PM -0500, George Kennedy wrote: > > On 2/22/2021 11:13 AM, David Hildenbrand wrote: > > On 22.02.21 16:13, George Kennedy wrote: > > > > > > On 2/22/2021 4:52 AM, David Hildenbrand wrote: > > > > > > > > Let me look into the code ... I have little experience with ACPI > > > > details, so bear with me. > > > > > > > > I assume that acpi_map()/acpi_unmap() map some firmware blob that is > > > > provided via firmware/bios/... to us. > > > > > > > > should_use_kmap() tells us whether > > > > a) we have a "struct page" and should kmap() that one > > > > b) we don't have a "struct page" and should ioremap. > > > > > > > > As it is a blob, the firmware should always reserve that memory region > > > > via memblock (e.g., memblock_reserve()), such that we either > > > > 1) don't create a memmap ("struct page") at all (-> case b) ) > > > > 2) if we have to create e memmap, we mark the page PG_reserved and > > > > *never* expose it to the buddy (-> case a) ) > > > > > > > > > > > > Are you telling me that in this case we might have a memmap for the HW > > > > blob that is *not* PG_reserved? In that case it most probably got > > > > exposed to the buddy where it can happily get allocated/freed. > > > > > > > > The latent BUG would be that that blob gets exposed to the system like > > > > ordinary RAM, and not reserved via memblock early during boot. > > > > Assuming that blob has a low physical address, with my patch it will > > > > get allocated/used a lot earlier - which would mean we trigger this > > > > latent BUG now more easily. > > > > > > > > There have been similar latent BUGs on ARM boards that my patch > > > > discovered where special RAM regions did not get marked as reserved > > > > via the device tree properly. > > > > > > > > Now, this is just a wild guess :) Can you dump the page when mapping > > > > (before PageReserved()) and when unmapping, to see what the state of > > > > that memmap is? > > > > > > Thank you David for the explanation and your help on this, > > > > > > dump_page() before PageReserved and before kmap() in the above patch: > > > > > > [ 1.116480] ACPI: Core revision 20201113 > > > [ 1.117628] XXX acpi_map: about to call kmap()... > > > [ 1.118561] page:ffffea0002f914c0 refcount:0 mapcount:0 > > > mapping:0000000000000000 index:0x0 pfn:0xbe453 > > > [ 1.120381] flags: 0xfffffc0000000() > > > [ 1.121116] raw: 000fffffc0000000 ffffea0002f914c8 ffffea0002f914c8 > > > 0000000000000000 > > > [ 1.122638] raw: 0000000000000000 0000000000000000 00000000ffffffff > > > 0000000000000000 > > > [ 1.124146] page dumped because: acpi_map pre SetPageReserved > > > > > > I also added dump_page() before unmapping, but it is not hit. The > > > following for the same pfn now shows up I believe as a result of setting > > > PageReserved: > > > > > > [ 28.098208] BUG:Bad page state in process mo dprobe pfn:be453 > > > [ 28.098394] page:ffffea0002f914c0 refcount:0 mapcount:0 > > > mapping:0000000000000000 index:0x1 pfn:0xbe453 > > > [ 28.098394] flags: 0xfffffc0001000(reserved) > > > [ 28.098394] raw: 000fffffc0001000 dead000000000100 dead000000000122 > > > 0000000000000000 > > > [ 28.098394] raw: 0000000000000001 0000000000000000 00000000ffffffff > > > 0000000000000000 > > > [ 28.098394] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag(s) set > > > [ 28.098394] page_owner info is not present (never set?) > > > [ 28.098394] Modules linked in: > > > [ 28.098394] CPU: 2 PID: 204 Comm: modprobe Not tainted > > > 5.11.0-3dbd5e3 #66 > > > [ 28.098394] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > > BIOS 0.0.0 02/06/2015 > > > [ 28.098394] Call Trace: > > > [ 28.098394] dump_stack+0xdb/0x120 > > > [ 28.098394] bad_page.cold.108+0xc6/0xcb > > > [ 28.098394] check_new_page_bad+0x47/0xa0 > > > [ 28.098394] get_page_from_freelist+0x30cd/0x5730 > > > [ 28.098394] ? __isolate_free_page+0x4f0/0x4f0 > > > [ 28.098394] ? init_object+0x7e/0x90 > > > [ 28.098394] __alloc_pages_nodemask+0x2d8/0x650 > > > [ 28.098394] ? write_comp_data+0x2f/0x90 > > > [ 28.098394] ? __alloc_pages_slowpath.constprop.103+0x2110/0x2110 > > > [ 28.098394] ? __sanitizer_cov_trace_pc+0x21/0x50 > > > [ 28.098394] alloc_pages_vma+0xe2/0x560 > > > [ 28.098394] do_fault+0x194/0x12c0 > > > [ 28.098394] ? write_comp_data+0x2f/0x90 > > > [ 28.098394] __handle_mm_fault+0x1650/0x26c0 > > > [ 28.098394] ? copy_page_range+0x1350/0x1350 > > > [ 28.098394] ? write_comp_data+0x2f/0x90 > > > [ 28.098394] ? write_comp_data+0x2f/0x90 > > > [ 28.098394] handle_mm_fault+0x1f9/0x810 > > > [ 28.098394] ? write_comp_data+0x2f/0x90 > > > [ 28.098394] do_user_addr_fault+0x6f7/0xca0 > > > [ 28.098394] exc_page_fault+0xaf/0x1a0 > > > [ 28.098394] asm_exc_page_fault+0x1e/0x30 > > > [ 28.098394] RIP: 0010:__clear_user+0x30/0x60 > > > > I think the PAGE_FLAGS_CHECK_AT_PREP check in this instance means that > > someone is trying to allocate that page with the PG_reserved bit set. > > This means that the page actually was exposed to the buddy. > > > > However, when you SetPageReserved(), I don't think that PG_buddy is set > > and the refcount is 0. That could indicate that the page is on the buddy > > PCP list. Could be that it is getting reused a couple of times. > > > > The PFN 0xbe453 looks a little strange, though. Do we expect ACPI tables > > close to 3 GiB ? No idea. Could it be that you are trying to map a wrong > > table? Just a guess. > > > > > > > > What would be the correct way to reserve the page so that the above > > > would not be hit? > > > > I would have assumed that if this is a binary blob, that someone (which > > I think would be acpi code) reserved via memblock_reserve() early during > > boot. > > > > E.g., see drivers/acpi/tables.c:acpi_table_upgrade()->memblock_reserve(). > > acpi_table_upgrade() gets called, but bails out before memblock_reserve() is > called. Thus, it appears no pages are getting reserved. acpi_table_upgrade() does not actually reserve memory but rather open codes memblock allocation with memblock_find_in_range() + memblock_reserve(), so it does not seem related anyway. Do you have by chance a full boot log handy? > 503 void __init acpi_table_upgrade(void) > 504 { ... > 568 if (table_nr == 0) > 569 return; <-- bails > out here > "drivers/acpi/tables.c" > > George >
(re-added CC) On Mon, Feb 22, 2021 at 08:24:59PM -0500, George Kennedy wrote: > > On 2/22/2021 4:55 PM, Mike Rapoport wrote: > > On Mon, Feb 22, 2021 at 01:42:56PM -0500, George Kennedy wrote: > > > On 2/22/2021 11:13 AM, David Hildenbrand wrote: > > > > On 22.02.21 16:13, George Kennedy wrote: > > > > > > > > The PFN 0xbe453 looks a little strange, though. Do we expect ACPI tables > > > > close to 3 GiB ? No idea. Could it be that you are trying to map a wrong > > > > table? Just a guess. > > > > > > > > > What would be the correct way to reserve the page so that the above > > > > > would not be hit? > > > > I would have assumed that if this is a binary blob, that someone (which > > > > I think would be acpi code) reserved via memblock_reserve() early during > > > > boot. > > > > > > > > E.g., see drivers/acpi/tables.c:acpi_table_upgrade()->memblock_reserve(). > > > acpi_table_upgrade() gets called, but bails out before memblock_reserve() is > > > called. Thus, it appears no pages are getting reserved. > > acpi_table_upgrade() does not actually reserve memory but rather open > > codes memblock allocation with memblock_find_in_range() + > > memblock_reserve(), so it does not seem related anyway. > > > > Do you have by chance a full boot log handy? > > Hello Mike, > > Are you after the console output? See attached. > > It includes my patch to set PG_Reserved along with the dump_page() debug > that David asked for - see: "page:" So, iBFT is indeed at pfn 0xbe453: [ 0.077698] ACPI: iBFT 0x00000000BE453000 000800 (v01 BOCHS BXPCFACP 00000000 00000000) and it's in E820_TYPE_RAM region rather than in ACPI data: [ 0.000000] BIOS-e820: [mem 0x0000000000810000-0x00000000008fffff] ACPI NVS [ 0.000000] BIOS-e820: [mem 0x0000000000900000-0x00000000be49afff] usable [ 0.000000] BIOS-e820: [mem 0x00000000be49b000-0x00000000be49bfff] ACPI data I could not find anywhere in x86 setup or in ACPI tables parsing the code that reserves this memory or any other ACPI data for that matter. It could be that I've missed some copying of the data to statically allocated initial_tables, but AFAICS any ACPI data that was not marked as such in e820 tables by BIOS resides in memory that is considered as free. Can you please check if this hack (entirely untested) changes anything: diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 7bdc0239a943..c118dd54a747 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -1551,6 +1551,7 @@ void __init acpi_boot_table_init(void) if (acpi_disabled) return; +#if 0 /* * Initialize the ACPI boot-time table parser. */ @@ -1558,6 +1559,7 @@ void __init acpi_boot_table_init(void) disable_acpi(); return; } +#endif acpi_table_parse(ACPI_SIG_BOOT, acpi_parse_sbf); diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index d883176ef2ce..c8a07a7b9577 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1032,6 +1032,14 @@ void __init setup_arch(char **cmdline_p) */ find_smp_config(); + /* + * Initialize the ACPI boot-time table parser. + */ + if (acpi_table_init()) { + disable_acpi(); + return; + } + reserve_ibft_region(); early_alloc_pgt_buf(); diff --git a/drivers/firmware/iscsi_ibft_find.c b/drivers/firmware/iscsi_ibft_find.c index 64bb94523281..2e5e04090fe2 100644 --- a/drivers/firmware/iscsi_ibft_find.c +++ b/drivers/firmware/iscsi_ibft_find.c @@ -80,6 +80,21 @@ static int __init find_ibft_in_mem(void) done: return len; } + +static void __init acpi_find_ibft_region(void) +{ + int i; + struct acpi_table_header *table = NULL; + + if (acpi_disabled) + return; + + for (i = 0; i < ARRAY_SIZE(ibft_signs) && !ibft_addr; i++) { + acpi_get_table(ibft_signs[i].sign, 0, &table); + ibft_addr = (struct acpi_table_ibft *)table; + } +} + /* * Routine used to find the iSCSI Boot Format Table. The logical * kernel address is set in the ibft_addr global variable. @@ -93,6 +108,8 @@ unsigned long __init find_ibft_region(unsigned long *sizep) if (!efi_enabled(EFI_BOOT)) find_ibft_in_mem(); + else + acpi_find_ibft_region(); if (ibft_addr) { *sizep = PAGE_ALIGN(ibft_addr->header.length); > Thank you, > George
Hi George, On Tue, Feb 23, 2021 at 09:35:32AM -0500, George Kennedy wrote: > > On 2/23/2021 5:33 AM, Mike Rapoport wrote: > > (re-added CC) > > > > On Mon, Feb 22, 2021 at 08:24:59PM -0500, George Kennedy wrote: > > > On 2/22/2021 4:55 PM, Mike Rapoport wrote: > > > > On Mon, Feb 22, 2021 at 01:42:56PM -0500, George Kennedy wrote: > > > > > On 2/22/2021 11:13 AM, David Hildenbrand wrote: > > > > > > On 22.02.21 16:13, George Kennedy wrote: > > > > > > > > > > > > The PFN 0xbe453 looks a little strange, though. Do we expect ACPI tables > > > > > > close to 3 GiB ? No idea. Could it be that you are trying to map a wrong > > > > > > table? Just a guess. > > > > > > > > > > > > > What would be the correct way to reserve the page so that the above > > > > > > > would not be hit? > > > > > > I would have assumed that if this is a binary blob, that someone (which > > > > > > I think would be acpi code) reserved via memblock_reserve() early during > > > > > > boot. > > > > > > > > > > > > E.g., see drivers/acpi/tables.c:acpi_table_upgrade()->memblock_reserve(). > > > > > acpi_table_upgrade() gets called, but bails out before memblock_reserve() is > > > > > called. Thus, it appears no pages are getting reserved. > > > > acpi_table_upgrade() does not actually reserve memory but rather open > > > > codes memblock allocation with memblock_find_in_range() + > > > > memblock_reserve(), so it does not seem related anyway. > > > > > > > > Do you have by chance a full boot log handy? > > > Hello Mike, > > > > > > Are you after the console output? See attached. > > > > > > It includes my patch to set PG_Reserved along with the dump_page() debug > > > that David asked for - see: "page:" > > So, iBFT is indeed at pfn 0xbe453: > > > > [ 0.077698] ACPI: iBFT 0x00000000BE453000 000800 (v01 BOCHS BXPCFACP 00000000 00000000) > > and it's in E820_TYPE_RAM region rather than in ACPI data: > > > > [ 0.000000] BIOS-e820: [mem 0x0000000000810000-0x00000000008fffff] ACPI NVS > > [ 0.000000] BIOS-e820: [mem 0x0000000000900000-0x00000000be49afff] usable > > [ 0.000000] BIOS-e820: [mem 0x00000000be49b000-0x00000000be49bfff] ACPI data > > > > I could not find anywhere in x86 setup or in ACPI tables parsing the code > > that reserves this memory or any other ACPI data for that matter. It could > > be that I've missed some copying of the data to statically allocated > > initial_tables, but AFAICS any ACPI data that was not marked as such in > > e820 tables by BIOS resides in memory that is considered as free. > > > > Close... > > Applied the patch, see "[ 30.136157] iBFT detected.", but now hit the > following (missing iounmap()? see full console output attached): > > diff --git a/drivers/firmware/iscsi_ibft_find.c > b/drivers/firmware/iscsi_ibft_find.c > index 64bb945..2e5e040 100644 > --- a/drivers/firmware/iscsi_ibft_find.c > +++ b/drivers/firmware/iscsi_ibft_find.c > @@ -80,6 +80,21 @@ static int __init find_ibft_in_mem(void) > done: > return len; > } > + > +static void __init acpi_find_ibft_region(void) > +{ > + int i; > + struct acpi_table_header *table = NULL; > + > + if (acpi_disabled) > + return; > + > + for (i = 0; i < ARRAY_SIZE(ibft_signs) && !ibft_addr; i++) { > + acpi_get_table(ibft_signs[i].sign, 0, &table); > + ibft_addr = (struct acpi_table_ibft *)table; Can you try adding acpi_put_table(table); here? > + } > +} > +
On 2/23/2021 10:47 AM, Mike Rapoport wrote: > Hi George, > > On Tue, Feb 23, 2021 at 09:35:32AM -0500, George Kennedy wrote: >> On 2/23/2021 5:33 AM, Mike Rapoport wrote: >>> (re-added CC) >>> >>> On Mon, Feb 22, 2021 at 08:24:59PM -0500, George Kennedy wrote: >>>> On 2/22/2021 4:55 PM, Mike Rapoport wrote: >>>>> On Mon, Feb 22, 2021 at 01:42:56PM -0500, George Kennedy wrote: >>>>>> On 2/22/2021 11:13 AM, David Hildenbrand wrote: >>>>>>> On 22.02.21 16:13, George Kennedy wrote: >>>>>>> >>>>>>> The PFN 0xbe453 looks a little strange, though. Do we expect ACPI tables >>>>>>> close to 3 GiB ? No idea. Could it be that you are trying to map a wrong >>>>>>> table? Just a guess. >>>>>>> >>>>>>>> What would be the correct way to reserve the page so that the above >>>>>>>> would not be hit? >>>>>>> I would have assumed that if this is a binary blob, that someone (which >>>>>>> I think would be acpi code) reserved via memblock_reserve() early during >>>>>>> boot. >>>>>>> >>>>>>> E.g., see drivers/acpi/tables.c:acpi_table_upgrade()->memblock_reserve(). >>>>>> acpi_table_upgrade() gets called, but bails out before memblock_reserve() is >>>>>> called. Thus, it appears no pages are getting reserved. >>>>> acpi_table_upgrade() does not actually reserve memory but rather open >>>>> codes memblock allocation with memblock_find_in_range() + >>>>> memblock_reserve(), so it does not seem related anyway. >>>>> >>>>> Do you have by chance a full boot log handy? >>>> Hello Mike, >>>> >>>> Are you after the console output? See attached. >>>> >>>> It includes my patch to set PG_Reserved along with the dump_page() debug >>>> that David asked for - see: "page:" >>> So, iBFT is indeed at pfn 0xbe453: >>> >>> [ 0.077698] ACPI: iBFT 0x00000000BE453000 000800 (v01 BOCHS BXPCFACP 00000000 00000000) >>> and it's in E820_TYPE_RAM region rather than in ACPI data: >>> >>> [ 0.000000] BIOS-e820: [mem 0x0000000000810000-0x00000000008fffff] ACPI NVS >>> [ 0.000000] BIOS-e820: [mem 0x0000000000900000-0x00000000be49afff] usable >>> [ 0.000000] BIOS-e820: [mem 0x00000000be49b000-0x00000000be49bfff] ACPI data >>> >>> I could not find anywhere in x86 setup or in ACPI tables parsing the code >>> that reserves this memory or any other ACPI data for that matter. It could >>> be that I've missed some copying of the data to statically allocated >>> initial_tables, but AFAICS any ACPI data that was not marked as such in >>> e820 tables by BIOS resides in memory that is considered as free. >>> >> Close... >> >> Applied the patch, see "[ 30.136157] iBFT detected.", but now hit the >> following (missing iounmap()? see full console output attached): >> >> diff --git a/drivers/firmware/iscsi_ibft_find.c >> b/drivers/firmware/iscsi_ibft_find.c >> index 64bb945..2e5e040 100644 >> --- a/drivers/firmware/iscsi_ibft_find.c >> +++ b/drivers/firmware/iscsi_ibft_find.c >> @@ -80,6 +80,21 @@ static int __init find_ibft_in_mem(void) >> done: >> return len; >> } >> + >> +static void __init acpi_find_ibft_region(void) >> +{ >> + int i; >> + struct acpi_table_header *table = NULL; >> + >> + if (acpi_disabled) >> + return; >> + >> + for (i = 0; i < ARRAY_SIZE(ibft_signs) && !ibft_addr; i++) { >> + acpi_get_table(ibft_signs[i].sign, 0, &table); >> + ibft_addr = (struct acpi_table_ibft *)table; > Can you try adding > > acpi_put_table(table); > > here? Mike, It now crashes here: [ 0.051019] ACPI: Early table checksum verification disabled [ 0.056721] ACPI: RSDP 0x00000000BFBFA014 000024 (v02 BOCHS ) [ 0.057874] ACPI: XSDT 0x00000000BFBF90E8 00004C (v01 BOCHS BXPCFACP 00000001 01000013) [ 0.059590] ACPI: FACP 0x00000000BFBF5000 000074 (v01 BOCHS BXPCFACP 00000001 BXPC 00000001) [ 0.061306] ACPI: DSDT 0x00000000BFBF6000 00238D (v01 BOCHS BXPCDSDT 00000001 BXPC 00000001) [ 0.063006] ACPI: FACS 0x00000000BFBFD000 000040 [ 0.063938] ACPI: APIC 0x00000000BFBF4000 000090 (v01 BOCHS BXPCAPIC 00000001 BXPC 00000001) [ 0.065638] ACPI: HPET 0x00000000BFBF3000 000038 (v01 BOCHS BXPCHPET 00000001 BXPC 00000001) [ 0.067335] ACPI: BGRT 0x00000000BE49B000 000038 (v01 INTEL EDK2 00000002 01000013) [ 0.069030] ACPI: iBFT 0x00000000BE453000 000800 (v01 BOCHS BXPCFACP 00000000 00000000) [ 0.070734] XXX acpi_find_ibft_region: [ 0.071468] XXX iBFT, status=0 [ 0.072073] XXX about to call acpi_put_table()... ibft_addr=ffffffffff240000 [ 0.073449] XXX acpi_find_ibft_region(EXIT): PANIC: early exception 0x0e IP 10:ffffffff9259f439 error 0 cr2 0xffffffffff240004 [ 0.075711] CPU: 0 PID: 0 Comm: swapper Not tainted 5.11.0-34a2105 #8 [ 0.076983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 0.078579] RIP: 0010:find_ibft_region+0x470/0x577 [ 0.079541] Code: f1 40 0f 9e c6 84 c9 0f 95 c1 40 84 ce 75 11 83 e0 07 38 c2 0f 9e c1 84 d2 0f 95 c0 84 c1 74 0a be 04 00 00 00 e8 37 f8 5f ef <8b> 5b 04 4c 89 fa b8 ff ff 37 00 48 c1 ea 03 48 c1 e0 2a 81 c3 ff [ 0.083207] RSP: 0000:ffffffff8fe07ca8 EFLAGS: 00010046 ORIG_RAX: 0000000000000000 [ 0.084709] RAX: 0000000000000000 RBX: ffffffffff240000 RCX: ffffffff815fcf01 [ 0.086109] RDX: dffffc0000000000 RSI: 0000000000000001 RDI: ffffffffff240004 [ 0.087509] RBP: ffffffff8fe07d60 R08: fffffbfff1fc0f21 R09: fffffbfff1fc0f21 [ 0.088911] R10: ffffffff8fe07907 R11: fffffbfff1fc0f20 R12: ffffffff8fe07d38 [ 0.090310] R13: 0000000000000001 R14: 0000000000000001 R15: ffffffff8fe07e80 [ 0.091716] FS: 0000000000000000(0000) GS:ffffffff92409000(0000) knlGS:0000000000000000 [ 0.093304] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.094435] CR2: ffffffffff240004 CR3: 0000000027630000 CR4: 00000000000006a0 [ 0.095843] Call Trace: [ 0.096345] ? acpi_table_init+0x3eb/0x428 [ 0.097164] ? dmi_id_init+0x871/0x871 [ 0.097912] ? early_memunmap+0x22/0x27 [ 0.098683] ? smp_scan_config+0x20e/0x230 [ 0.099500] setup_arch+0xd3e/0x181d [ 0.100221] ? reserve_standard_io_resources+0x3e/0x3e [ 0.101265] ? __sanitizer_cov_trace_pc+0x21/0x50 [ 0.102203] ? vprintk_func+0xe9/0x200 [ 0.102953] ? printk+0xac/0xd4 [ 0.103589] ? record_print_text.cold.38+0x16/0x16 [ 0.104540] ? write_comp_data+0x2f/0x90 [ 0.105325] ? __sanitizer_cov_trace_pc+0x21/0x50 [ 0.106262] start_kernel+0x6c/0x474 [ 0.106981] x86_64_start_reservations+0x37/0x39 [ 0.107902] x86_64_start_kernel+0x7b/0x7e [ 0.108722] secondary_startup_64_no_verify+0xb0/0xbb Added debug to dump out the ibft_addr: [root@gkennedy-20210107-1202 linux-upwork]# git diff diff --git a/drivers/firmware/iscsi_ibft_find.c b/drivers/firmware/iscsi_ibft_find.c index 2e5e040..a246373 100644 --- a/drivers/firmware/iscsi_ibft_find.c +++ b/drivers/firmware/iscsi_ibft_find.c @@ -83,16 +83,22 @@ static int __init find_ibft_in_mem(void) static void __init acpi_find_ibft_region(void) { - int i; + int i, status; struct acpi_table_header *table = NULL; - +printk(KERN_ERR "XXX acpi_find_ibft_region:\n"); if (acpi_disabled) return; for (i = 0; i < ARRAY_SIZE(ibft_signs) && !ibft_addr; i++) { - acpi_get_table(ibft_signs[i].sign, 0, &table); - ibft_addr = (struct acpi_table_ibft *)table; + status = acpi_get_table(ibft_signs[i].sign, 0, &table); + printk(KERN_ERR "XXX %s, status=%x\n", ibft_signs[i].sign, status); + if (ACPI_SUCCESS(status)) { + ibft_addr = (struct acpi_table_ibft *)table; + printk(KERN_ERR "XXX about to call acpi_put_table()... ibft_addr=%llx\n", (u64)ibft_addr); + acpi_put_table(table); + } } +printk(KERN_ERR "XXX acpi_find_ibft_region(EXIT):\n"); } /* (END) George > >> + } >> +} >> +
On Tue, Feb 23, 2021 at 01:05:05PM -0500, George Kennedy wrote: > On 2/23/2021 10:47 AM, Mike Rapoport wrote: > > It now crashes here: > > [ 0.051019] ACPI: Early table checksum verification disabled > [ 0.056721] ACPI: RSDP 0x00000000BFBFA014 000024 (v02 BOCHS ) > [ 0.057874] ACPI: XSDT 0x00000000BFBF90E8 00004C (v01 BOCHS BXPCFACP > 00000001 01000013) > [ 0.059590] ACPI: FACP 0x00000000BFBF5000 000074 (v01 BOCHS BXPCFACP > 00000001 BXPC 00000001) > [ 0.061306] ACPI: DSDT 0x00000000BFBF6000 00238D (v01 BOCHS BXPCDSDT > 00000001 BXPC 00000001) > [ 0.063006] ACPI: FACS 0x00000000BFBFD000 000040 > [ 0.063938] ACPI: APIC 0x00000000BFBF4000 000090 (v01 BOCHS BXPCAPIC > 00000001 BXPC 00000001) > [ 0.065638] ACPI: HPET 0x00000000BFBF3000 000038 (v01 BOCHS BXPCHPET > 00000001 BXPC 00000001) > [ 0.067335] ACPI: BGRT 0x00000000BE49B000 000038 (v01 INTEL EDK2 > 00000002 01000013) > [ 0.069030] ACPI: iBFT 0x00000000BE453000 000800 (v01 BOCHS BXPCFACP > 00000000 00000000) > [ 0.070734] XXX acpi_find_ibft_region: > [ 0.071468] XXX iBFT, status=0 > [ 0.072073] XXX about to call acpi_put_table()... > ibft_addr=ffffffffff240000 > [ 0.073449] XXX acpi_find_ibft_region(EXIT): > PANIC: early exception 0x0e IP 10:ffffffff9259f439 error 0 cr2 > 0xffffffffff240004 Right, I've missed the dereference of the ibft_addr after acpi_find_ibft_region(). With this change to iscsi_ibft_find.c instead of the previous one it should be better: diff --git a/drivers/firmware/iscsi_ibft_find.c b/drivers/firmware/iscsi_ibft_find.c index 64bb94523281..1be7481d5c69 100644 --- a/drivers/firmware/iscsi_ibft_find.c +++ b/drivers/firmware/iscsi_ibft_find.c @@ -80,6 +80,27 @@ static int __init find_ibft_in_mem(void) done: return len; } + +static void __init acpi_find_ibft_region(unsigned long *sizep) +{ + int i; + struct acpi_table_header *table = NULL; + acpi_status status; + + if (acpi_disabled) + return; + + for (i = 0; i < ARRAY_SIZE(ibft_signs) && !ibft_addr; i++) { + status = acpi_get_table(ibft_signs[i].sign, 0, &table); + if (ACPI_SUCCESS(status)) { + ibft_addr = (struct acpi_table_ibft *)table; + *sizep = PAGE_ALIGN(ibft_addr->header.length); + acpi_put_table(table); + break; + } + } +} + /* * Routine used to find the iSCSI Boot Format Table. The logical * kernel address is set in the ibft_addr global variable. @@ -91,14 +112,16 @@ unsigned long __init find_ibft_region(unsigned long *sizep) /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will * only use ACPI for this */ - if (!efi_enabled(EFI_BOOT)) + if (!efi_enabled(EFI_BOOT)) { find_ibft_in_mem(); - - if (ibft_addr) { *sizep = PAGE_ALIGN(ibft_addr->header.length); - return (u64)virt_to_phys(ibft_addr); + } else { + acpi_find_ibft_region(sizep); } + if (ibft_addr) + return (u64)virt_to_phys(ibft_addr); + *sizep = 0; return 0; } > [ 0.075711] CPU: 0 PID: 0 Comm: swapper Not tainted 5.11.0-34a2105 #8 > [ 0.076983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > 0.0.0 02/06/2015 > [ 0.078579] RIP: 0010:find_ibft_region+0x470/0x577
On 2/23/2021 3:09 PM, Mike Rapoport wrote: > On Tue, Feb 23, 2021 at 01:05:05PM -0500, George Kennedy wrote: >> On 2/23/2021 10:47 AM, Mike Rapoport wrote: >> >> It now crashes here: >> >> [ 0.051019] ACPI: Early table checksum verification disabled >> [ 0.056721] ACPI: RSDP 0x00000000BFBFA014 000024 (v02 BOCHS ) >> [ 0.057874] ACPI: XSDT 0x00000000BFBF90E8 00004C (v01 BOCHS BXPCFACP >> 00000001 01000013) >> [ 0.059590] ACPI: FACP 0x00000000BFBF5000 000074 (v01 BOCHS BXPCFACP >> 00000001 BXPC 00000001) >> [ 0.061306] ACPI: DSDT 0x00000000BFBF6000 00238D (v01 BOCHS BXPCDSDT >> 00000001 BXPC 00000001) >> [ 0.063006] ACPI: FACS 0x00000000BFBFD000 000040 >> [ 0.063938] ACPI: APIC 0x00000000BFBF4000 000090 (v01 BOCHS BXPCAPIC >> 00000001 BXPC 00000001) >> [ 0.065638] ACPI: HPET 0x00000000BFBF3000 000038 (v01 BOCHS BXPCHPET >> 00000001 BXPC 00000001) >> [ 0.067335] ACPI: BGRT 0x00000000BE49B000 000038 (v01 INTEL EDK2 >> 00000002 01000013) >> [ 0.069030] ACPI: iBFT 0x00000000BE453000 000800 (v01 BOCHS BXPCFACP >> 00000000 00000000) >> [ 0.070734] XXX acpi_find_ibft_region: >> [ 0.071468] XXX iBFT, status=0 >> [ 0.072073] XXX about to call acpi_put_table()... >> ibft_addr=ffffffffff240000 >> [ 0.073449] XXX acpi_find_ibft_region(EXIT): >> PANIC: early exception 0x0e IP 10:ffffffff9259f439 error 0 cr2 >> 0xffffffffff240004 > Right, I've missed the dereference of the ibft_addr after > acpi_find_ibft_region(). > > With this change to iscsi_ibft_find.c instead of the previous one it should > be better: > > diff --git a/drivers/firmware/iscsi_ibft_find.c b/drivers/firmware/iscsi_ibft_find.c > index 64bb94523281..1be7481d5c69 100644 > --- a/drivers/firmware/iscsi_ibft_find.c > +++ b/drivers/firmware/iscsi_ibft_find.c > @@ -80,6 +80,27 @@ static int __init find_ibft_in_mem(void) > done: > return len; > } > + > +static void __init acpi_find_ibft_region(unsigned long *sizep) > +{ > + int i; > + struct acpi_table_header *table = NULL; > + acpi_status status; > + > + if (acpi_disabled) > + return; > + > + for (i = 0; i < ARRAY_SIZE(ibft_signs) && !ibft_addr; i++) { > + status = acpi_get_table(ibft_signs[i].sign, 0, &table); > + if (ACPI_SUCCESS(status)) { > + ibft_addr = (struct acpi_table_ibft *)table; > + *sizep = PAGE_ALIGN(ibft_addr->header.length); > + acpi_put_table(table); > + break; > + } > + } > +} > + > /* > * Routine used to find the iSCSI Boot Format Table. The logical > * kernel address is set in the ibft_addr global variable. > @@ -91,14 +112,16 @@ unsigned long __init find_ibft_region(unsigned long *sizep) > /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will > * only use ACPI for this */ > > - if (!efi_enabled(EFI_BOOT)) > + if (!efi_enabled(EFI_BOOT)) { > find_ibft_in_mem(); > - > - if (ibft_addr) { > *sizep = PAGE_ALIGN(ibft_addr->header.length); > - return (u64)virt_to_phys(ibft_addr); > + } else { > + acpi_find_ibft_region(sizep); > } > > + if (ibft_addr) > + return (u64)virt_to_phys(ibft_addr); > + > *sizep = 0; > return 0; > } Mike, No luck. Back to the original KASAN ibft_init crash. I ran with only the above patch from you. Was that what you wanted? Your previous patch had a section defined out by #if 0. Was that supposed to be in there as well? If you need the console output let me know. Got bounced because it was too large. [ 30.124650] iBFT detected. [ 30.125228] ================================================================== [ 30.126201] BUG: KASAN: use-after-free in ibft_init+0x134/0xc33 [ 30.126201] Read of size 4 at addr ffff8880be453004 by task swapper/0/1 [ 30.126201] [ 30.126201] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.11.0-f9593a0 #9 [ 30.126201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 30.126201] Call Trace: [ 30.126201] dump_stack+0xdb/0x120 [ 30.126201] ? ibft_init+0x134/0xc33 [ 30.126201] print_address_description.constprop.7+0x41/0x60 [ 30.126201] ? ibft_init+0x134/0xc33 [ 30.126201] ? ibft_init+0x134/0xc33 [ 30.126201] kasan_report.cold.10+0x78/0xd1 [ 30.126201] ? ibft_init+0x134/0xc33 [ 30.126201] __asan_report_load_n_noabort+0xf/0x20 [ 30.126201] ibft_init+0x134/0xc33 [ 30.126201] ? write_comp_data+0x2f/0x90 [ 30.126201] ? ibft_check_initiator_for+0x159/0x159 [ 30.126201] ? write_comp_data+0x2f/0x90 [ 30.126201] ? ibft_check_initiator_for+0x159/0x159 [ 30.126201] do_one_initcall+0xc4/0x3e0 [ 30.126201] ? perf_trace_initcall_level+0x3e0/0x3e0 [ 30.126201] ? unpoison_range+0x14/0x40 [ 30.126201] ? ____kasan_kmalloc.constprop.5+0x8f/0xc0 [ 30.126201] ? kernel_init_freeable+0x420/0x652 [ 30.126201] ? __kasan_kmalloc+0x9/0x10 [ 30.126201] ? __sanitizer_cov_trace_pc+0x21/0x50 [ 30.126201] kernel_init_freeable+0x596/0x652 [ 30.126201] ? console_on_rootfs+0x7d/0x7d [ 30.126201] ? __sanitizer_cov_trace_pc+0x21/0x50 [ 30.126201] ? rest_init+0xf0/0xf0 [ 30.126201] kernel_init+0x16/0x1d0 [ 30.126201] ? rest_init+0xf0/0xf0 [ 30.126201] ret_from_fork+0x22/0x30 [ 30.126201] [ 30.126201] The buggy address belongs to the page: [ 30.126201] page:0000000091b8f2b4 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0xbe453 [ 30.126201] flags: 0xfffffc0000000() [ 30.126201] raw: 000fffffc0000000 ffffea0002fac708 ffffea0002fac748 0000000000000000 [ 30.126201] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 [ 30.126201] page dumped because: kasan: bad access detected [ 30.126201] page_owner tracks the page as freed [ 30.126201] page last allocated via order 0, migratetype Movable, gfp_mask 0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), pid 204, ts 27975563827 [ 30.126201] prep_new_page+0xfb/0x140 [ 30.126201] get_page_from_freelist+0x3503/0x5730 [ 30.126201] __alloc_pages_nodemask+0x2d8/0x650 [ 30.126201] alloc_pages_vma+0xe2/0x560 [ 30.126201] __handle_mm_fault+0x930/0x26c0 [ 30.126201] handle_mm_fault+0x1f9/0x810 [ 30.126201] do_user_addr_fault+0x6f7/0xca0 [ 30.126201] exc_page_fault+0xaf/0x1a0 [ 30.126201] asm_exc_page_fault+0x1e/0x30 [ 30.126201] page last free stack trace: [ 30.126201] free_pcp_prepare+0x122/0x290 [ 30.126201] free_unref_page_list+0xe6/0x490 [ 30.126201] release_pages+0x2ed/0x1270 [ 30.126201] free_pages_and_swap_cache+0x245/0x2e0 [ 30.126201] tlb_flush_mmu+0x11e/0x680 [ 30.126201] tlb_finish_mmu+0xa6/0x3e0 [ 30.126201] exit_mmap+0x2b3/0x540 [ 30.126201] mmput+0x11d/0x450 [ 30.126201] do_exit+0xaa6/0x2d40 [ 30.126201] do_group_exit+0x128/0x340 [ 30.126201] __x64_sys_exit_group+0x43/0x50 [ 30.126201] do_syscall_64+0x37/0x50 [ 30.126201] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 30.126201] [ 30.126201] Memory state around the buggy address: [ 30.126201] ffff8880be452f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 30.126201] ffff8880be452f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 30.126201] >ffff8880be453000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 30.126201] ^ [ 30.126201] ffff8880be453080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 30.126201] ffff8880be453100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 30.126201] ================================================================== This is all I ran with: # git diff diff --git a/drivers/firmware/iscsi_ibft_find.c b/drivers/firmware/iscsi_ibft_find.c index 64bb945..1be7481 100644 --- a/drivers/firmware/iscsi_ibft_find.c +++ b/drivers/firmware/iscsi_ibft_find.c @@ -80,6 +80,27 @@ static int __init find_ibft_in_mem(void) done: return len; } + +static void __init acpi_find_ibft_region(unsigned long *sizep) +{ + int i; + struct acpi_table_header *table = NULL; + acpi_status status; + + if (acpi_disabled) + return; + + for (i = 0; i < ARRAY_SIZE(ibft_signs) && !ibft_addr; i++) { + status = acpi_get_table(ibft_signs[i].sign, 0, &table); + if (ACPI_SUCCESS(status)) { + ibft_addr = (struct acpi_table_ibft *)table; + *sizep = PAGE_ALIGN(ibft_addr->header.length); + acpi_put_table(table); + break; + } + } +} + /* * Routine used to find the iSCSI Boot Format Table. The logical * kernel address is set in the ibft_addr global variable. @@ -91,14 +112,16 @@ unsigned long __init find_ibft_region(unsigned long *sizep) /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will * only use ACPI for this */ - if (!efi_enabled(EFI_BOOT)) + if (!efi_enabled(EFI_BOOT)) { find_ibft_in_mem(); - - if (ibft_addr) { *sizep = PAGE_ALIGN(ibft_addr->header.length); - return (u64)virt_to_phys(ibft_addr); + } else { + acpi_find_ibft_region(sizep); } + if (ibft_addr) + return (u64)virt_to_phys(ibft_addr); + *sizep = 0; return 0; } Thank you, George >> [ 0.075711] CPU: 0 PID: 0 Comm: swapper Not tainted 5.11.0-34a2105 #8 >> [ 0.076983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >> 0.0.0 02/06/2015 >> [ 0.078579] RIP: 0010:find_ibft_region+0x470/0x577
On Tue, Feb 23, 2021 at 04:16:44PM -0500, George Kennedy wrote: > > > On 2/23/2021 3:09 PM, Mike Rapoport wrote: > > On Tue, Feb 23, 2021 at 01:05:05PM -0500, George Kennedy wrote: > > > On 2/23/2021 10:47 AM, Mike Rapoport wrote: > > > > > > It now crashes here: > > > > > > [ 0.051019] ACPI: Early table checksum verification disabled > > > [ 0.056721] ACPI: RSDP 0x00000000BFBFA014 000024 (v02 BOCHS ) > > > [ 0.057874] ACPI: XSDT 0x00000000BFBF90E8 00004C (v01 BOCHS BXPCFACP > > > 00000001 01000013) > > > [ 0.059590] ACPI: FACP 0x00000000BFBF5000 000074 (v01 BOCHS BXPCFACP > > > 00000001 BXPC 00000001) > > > [ 0.061306] ACPI: DSDT 0x00000000BFBF6000 00238D (v01 BOCHS BXPCDSDT > > > 00000001 BXPC 00000001) > > > [ 0.063006] ACPI: FACS 0x00000000BFBFD000 000040 > > > [ 0.063938] ACPI: APIC 0x00000000BFBF4000 000090 (v01 BOCHS BXPCAPIC > > > 00000001 BXPC 00000001) > > > [ 0.065638] ACPI: HPET 0x00000000BFBF3000 000038 (v01 BOCHS BXPCHPET > > > 00000001 BXPC 00000001) > > > [ 0.067335] ACPI: BGRT 0x00000000BE49B000 000038 (v01 INTEL EDK2 > > > 00000002 01000013) > > > [ 0.069030] ACPI: iBFT 0x00000000BE453000 000800 (v01 BOCHS BXPCFACP > > > 00000000 00000000) > > > [ 0.070734] XXX acpi_find_ibft_region: > > > [ 0.071468] XXX iBFT, status=0 > > > [ 0.072073] XXX about to call acpi_put_table()... > > > ibft_addr=ffffffffff240000 > > > [ 0.073449] XXX acpi_find_ibft_region(EXIT): > > > PANIC: early exception 0x0e IP 10:ffffffff9259f439 error 0 cr2 > > > 0xffffffffff240004 > > Right, I've missed the dereference of the ibft_addr after > > acpi_find_ibft_region(). > > > > With this change to iscsi_ibft_find.c instead of the previous one it should > > be better: > > > > diff --git a/drivers/firmware/iscsi_ibft_find.c b/drivers/firmware/iscsi_ibft_find.c > > index 64bb94523281..1be7481d5c69 100644 > > --- a/drivers/firmware/iscsi_ibft_find.c > > +++ b/drivers/firmware/iscsi_ibft_find.c > > @@ -80,6 +80,27 @@ static int __init find_ibft_in_mem(void) > > done: > > return len; > > } > > + > > +static void __init acpi_find_ibft_region(unsigned long *sizep) > > +{ > > + int i; > > + struct acpi_table_header *table = NULL; > > + acpi_status status; > > + > > + if (acpi_disabled) > > + return; > > + > > + for (i = 0; i < ARRAY_SIZE(ibft_signs) && !ibft_addr; i++) { > > + status = acpi_get_table(ibft_signs[i].sign, 0, &table); > > + if (ACPI_SUCCESS(status)) { > > + ibft_addr = (struct acpi_table_ibft *)table; > > + *sizep = PAGE_ALIGN(ibft_addr->header.length); > > + acpi_put_table(table); > > + break; > > + } > > + } > > +} > > + > > /* > > * Routine used to find the iSCSI Boot Format Table. The logical > > * kernel address is set in the ibft_addr global variable. > > @@ -91,14 +112,16 @@ unsigned long __init find_ibft_region(unsigned long *sizep) > > /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will > > * only use ACPI for this */ > > - if (!efi_enabled(EFI_BOOT)) > > + if (!efi_enabled(EFI_BOOT)) { > > find_ibft_in_mem(); > > - > > - if (ibft_addr) { > > *sizep = PAGE_ALIGN(ibft_addr->header.length); > > - return (u64)virt_to_phys(ibft_addr); > > + } else { > > + acpi_find_ibft_region(sizep); > > } > > + if (ibft_addr) > > + return (u64)virt_to_phys(ibft_addr); > > + > > *sizep = 0; > > return 0; > > } > Mike, > > No luck. Back to the original KASAN ibft_init crash. > > I ran with only the above patch from you. Was that what you wanted? Your > previous patch had a section defined out by #if 0. Was that supposed to be > in there as well? Sorry, I wasn't clear, but I meant to use the first patch and only replace changes to iscsi_ibft_find.c with the new patch. Here's the full patch to be sure we're on the same page: diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 7bdc0239a943..c118dd54a747 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -1551,6 +1551,7 @@ void __init acpi_boot_table_init(void) if (acpi_disabled) return; +#if 0 /* * Initialize the ACPI boot-time table parser. */ @@ -1558,6 +1559,7 @@ void __init acpi_boot_table_init(void) disable_acpi(); return; } +#endif acpi_table_parse(ACPI_SIG_BOOT, acpi_parse_sbf); diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index d883176ef2ce..c8a07a7b9577 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1032,6 +1032,14 @@ void __init setup_arch(char **cmdline_p) */ find_smp_config(); + /* + * Initialize the ACPI boot-time table parser. + */ + if (acpi_table_init()) { + disable_acpi(); + return; + } + reserve_ibft_region(); early_alloc_pgt_buf(); diff --git a/drivers/firmware/iscsi_ibft_find.c b/drivers/firmware/iscsi_ibft_find.c index 64bb94523281..1be7481d5c69 100644 --- a/drivers/firmware/iscsi_ibft_find.c +++ b/drivers/firmware/iscsi_ibft_find.c @@ -80,6 +80,27 @@ static int __init find_ibft_in_mem(void) done: return len; } + +static void __init acpi_find_ibft_region(unsigned long *sizep) +{ + int i; + struct acpi_table_header *table = NULL; + acpi_status status; + + if (acpi_disabled) + return; + + for (i = 0; i < ARRAY_SIZE(ibft_signs) && !ibft_addr; i++) { + status = acpi_get_table(ibft_signs[i].sign, 0, &table); + if (ACPI_SUCCESS(status)) { + ibft_addr = (struct acpi_table_ibft *)table; + *sizep = PAGE_ALIGN(ibft_addr->header.length); + acpi_put_table(table); + break; + } + } +} + /* * Routine used to find the iSCSI Boot Format Table. The logical * kernel address is set in the ibft_addr global variable. @@ -91,14 +112,16 @@ unsigned long __init find_ibft_region(unsigned long *sizep) /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will * only use ACPI for this */ - if (!efi_enabled(EFI_BOOT)) + if (!efi_enabled(EFI_BOOT)) { find_ibft_in_mem(); - - if (ibft_addr) { *sizep = PAGE_ALIGN(ibft_addr->header.length); - return (u64)virt_to_phys(ibft_addr); + } else { + acpi_find_ibft_region(sizep); } + if (ibft_addr) + return (u64)virt_to_phys(ibft_addr); + *sizep = 0; return 0; }
On 2/23/2021 4:32 PM, Mike Rapoport wrote: > diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c > index 7bdc0239a943..c118dd54a747 100644 > --- a/arch/x86/kernel/acpi/boot.c > +++ b/arch/x86/kernel/acpi/boot.c > @@ -1551,6 +1551,7 @@ void __init acpi_boot_table_init(void) > if (acpi_disabled) > return; > > +#if 0 > /* > * Initialize the ACPI boot-time table parser. > */ > @@ -1558,6 +1559,7 @@ void __init acpi_boot_table_init(void) > disable_acpi(); > return; > } > +#endif > > acpi_table_parse(ACPI_SIG_BOOT, acpi_parse_sbf); > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c > index d883176ef2ce..c8a07a7b9577 100644 > --- a/arch/x86/kernel/setup.c > +++ b/arch/x86/kernel/setup.c > @@ -1032,6 +1032,14 @@ void __init setup_arch(char **cmdline_p) > */ > find_smp_config(); > > + /* > + * Initialize the ACPI boot-time table parser. > + */ > + if (acpi_table_init()) { > + disable_acpi(); > + return; > + } > + > reserve_ibft_region(); > > early_alloc_pgt_buf(); > diff --git a/drivers/firmware/iscsi_ibft_find.c b/drivers/firmware/iscsi_ibft_find.c > index 64bb94523281..1be7481d5c69 100644 > --- a/drivers/firmware/iscsi_ibft_find.c > +++ b/drivers/firmware/iscsi_ibft_find.c > @@ -80,6 +80,27 @@ static int __init find_ibft_in_mem(void) > done: > return len; > } > + > +static void __init acpi_find_ibft_region(unsigned long *sizep) > +{ > + int i; > + struct acpi_table_header *table = NULL; > + acpi_status status; > + > + if (acpi_disabled) > + return; > + > + for (i = 0; i < ARRAY_SIZE(ibft_signs) && !ibft_addr; i++) { > + status = acpi_get_table(ibft_signs[i].sign, 0, &table); > + if (ACPI_SUCCESS(status)) { > + ibft_addr = (struct acpi_table_ibft *)table; > + *sizep = PAGE_ALIGN(ibft_addr->header.length); > + acpi_put_table(table); > + break; > + } > + } > +} > + > /* > * Routine used to find the iSCSI Boot Format Table. The logical > * kernel address is set in the ibft_addr global variable. > @@ -91,14 +112,16 @@ unsigned long __init find_ibft_region(unsigned long *sizep) > /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will > * only use ACPI for this */ > > - if (!efi_enabled(EFI_BOOT)) > + if (!efi_enabled(EFI_BOOT)) { > find_ibft_in_mem(); > - > - if (ibft_addr) { > *sizep = PAGE_ALIGN(ibft_addr->header.length); > - return (u64)virt_to_phys(ibft_addr); > + } else { > + acpi_find_ibft_region(sizep); > } > > + if (ibft_addr) > + return (u64)virt_to_phys(ibft_addr); > + > *sizep = 0; > return 0; > } > Mike, Still no luck. [ 30.193723] iscsi: registered transport (iser) [ 30.195970] iBFT detected. [ 30.196571] BUG: unable to handle page fault for address: ffffffffff240004 [ 30.196824] #PF: supervisor read access in kernel mode [ 30.196824] #PF: error_code(0x0000) - not-present page [ 30.196824] PGD 24e34067 P4D 24e34067 PUD 24e36067 PMD 27a0e067 PTE 0 [ 30.196824] Oops: 0000 [#1] SMP KASAN PTI [ 30.196824] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.11.0-f9593a0 #10 [ 30.196824] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 30.196824] RIP: 0010:ibft_init+0x13d/0xc33 [ 30.196824] Code: c1 40 84 ce 75 11 83 e0 07 38 c2 0f 9e c1 84 d2 0f 95 c0 84 c1 74 0a be 04 00 00 00 e8 77 f2 5f ef 49 8d 7f 08 b8 ff ff 37 00 <4d> 63 6f 04 48 89 fa 48 c1 e0 2a 48 c1 ea 03 8a 04 02 48 89 fa 83 [ 30.196824] RSP: 0000:ffff888100fafc30 EFLAGS: 00010246 [ 30.196824] RAX: 000000000037ffff RBX: ffffffff937c6fc0 RCX: ffffffff815fcf01 [ 30.196824] RDX: dffffc0000000000 RSI: 0000000000000001 RDI: ffffffffff240008 [ 30.196824] RBP: ffff888100fafcf8 R08: ffffed10201f5f12 R09: ffffed10201f5f12 [ 30.196824] R10: ffff888100faf88f R11: ffffed10201f5f11 R12: dffffc0000000000 [ 30.196824] R13: ffff888100fafdc0 R14: ffff888100fafcd0 R15: ffffffffff240000 [ 30.196824] FS: 0000000000000000(0000) GS:ffff88810ad80000(0000) knlGS:0000000000000000 [ 30.196824] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 30.196824] CR2: ffffffffff240004 CR3: 0000000024e30000 CR4: 00000000000006e0 [ 30.196824] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 30.196824] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 30.196824] Call Trace: [ 30.196824] ? write_comp_data+0x2f/0x90 [ 30.196824] ? __sanitizer_cov_trace_pc+0x21/0x50 [ 30.196824] ? ibft_check_initiator_for+0x159/0x159 [ 30.196824] ? dmi_setup+0x46c/0x46c [ 30.196824] ? write_comp_data+0x2f/0x90 [ 30.196824] ? ibft_check_initiator_for+0x159/0x159 [ 30.196824] do_one_initcall+0xc4/0x3e0 [ 30.196824] ? perf_trace_initcall_level+0x3e0/0x3e0 [ 30.196824] ? asm_sysvec_error_interrupt+0x10/0x20 [ 30.196824] ? do_one_initcall+0x18c/0x3e0 [ 30.196824] kernel_init_freeable+0x596/0x652 [ 30.196824] ? console_on_rootfs+0x7d/0x7d [ 30.196824] ? __sanitizer_cov_trace_pc+0x21/0x50 [ 30.196824] ? rest_init+0xf0/0xf0 [ 30.196824] kernel_init+0x16/0x1d0 [ 30.196824] ? rest_init+0xf0/0xf0 [ 30.196824] ret_from_fork+0x22/0x30 [ 30.196824] Modules linked in: [ 30.196824] Dumping ftrace buffer: [ 30.196824] (ftrace buffer empty) [ 30.196824] CR2: ffffffffff240004 [ 30.196824] ---[ end trace 293eae51adac1398 ]--- [ 30.196824] RIP: 0010:ibft_init+0x13d/0xc33 [ 30.196824] Code: c1 40 84 ce 75 11 83 e0 07 38 c2 0f 9e c1 84 d2 0f 95 c0 84 c1 74 0a be 04 00 00 00 e8 77 f2 5f ef 49 8d 7f 08 b8 ff ff 37 00 <4d> 63 6f 04 48 89 fa 48 c1 e0 2a 48 c1 ea 03 8a 04 02 48 89 fa 83 [ 30.196824] RSP: 0000:ffff888100fafc30 EFLAGS: 00010246 [ 30.196824] RAX: 000000000037ffff RBX: ffffffff937c6fc0 RCX: ffffffff815fcf01 [ 30.196824] RDX: dffffc0000000000 RSI: 0000000000000001 RDI: ffffffffff240008 [ 30.196824] RBP: ffff888100fafcf8 R08: ffffed10201f5f12 R09: ffffed10201f5f12 [ 30.196824] R10: ffff888100faf88f R11: ffffed10201f5f11 R12: dffffc0000000000 [ 30.196824] R13: ffff888100fafdc0 R14: ffff888100fafcd0 R15: ffffffffff240000 [ 30.196824] FS: 0000000000000000(0000) GS:ffff88810ad80000(0000) knlGS:0000000000000000 [ 30.196824] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 30.196824] CR2: ffffffffff240004 CR3: 0000000024e30000 CR4: 00000000000006e0 [ 30.196824] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 30.196824] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 30.196824] Kernel panic - not syncing: Fatal exception [ 30.196824] Dumping ftrace buffer: [ 30.196824] (ftrace buffer empty) [ 30.196824] Kernel Offset: disabled [ 30.196824] Rebooting in 1 seconds.. George
On Tue, Feb 23, 2021 at 04:46:28PM -0500, George Kennedy wrote: > > Mike, > > Still no luck. > > [ 30.193723] iscsi: registered transport (iser) > [ 30.195970] iBFT detected. > [ 30.196571] BUG: unable to handle page fault for address: ffffffffff240004 Hmm, we cannot set ibft_addr to early pointer to the ACPI table. Let's try something more disruptive and move the reservation back to iscsi_ibft_find.c. diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 7bdc0239a943..c118dd54a747 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -1551,6 +1551,7 @@ void __init acpi_boot_table_init(void) if (acpi_disabled) return; +#if 0 /* * Initialize the ACPI boot-time table parser. */ @@ -1558,6 +1559,7 @@ void __init acpi_boot_table_init(void) disable_acpi(); return; } +#endif acpi_table_parse(ACPI_SIG_BOOT, acpi_parse_sbf); diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index d883176ef2ce..c615ce96c9a2 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -570,16 +570,6 @@ void __init reserve_standard_io_resources(void) } -static __init void reserve_ibft_region(void) -{ - unsigned long addr, size = 0; - - addr = find_ibft_region(&size); - - if (size) - memblock_reserve(addr, size); -} - static bool __init snb_gfx_workaround_needed(void) { #ifdef CONFIG_PCI @@ -1032,6 +1022,12 @@ void __init setup_arch(char **cmdline_p) */ find_smp_config(); + /* + * Initialize the ACPI boot-time table parser. + */ + if (acpi_table_init()) + disable_acpi(); + reserve_ibft_region(); early_alloc_pgt_buf(); diff --git a/drivers/firmware/iscsi_ibft_find.c b/drivers/firmware/iscsi_ibft_find.c index 64bb94523281..01be513843d6 100644 --- a/drivers/firmware/iscsi_ibft_find.c +++ b/drivers/firmware/iscsi_ibft_find.c @@ -47,7 +47,25 @@ static const struct { #define VGA_MEM 0xA0000 /* VGA buffer */ #define VGA_SIZE 0x20000 /* 128kB */ -static int __init find_ibft_in_mem(void) +static void __init *acpi_find_ibft_region(void) +{ + int i; + struct acpi_table_header *table = NULL; + acpi_status status; + + if (acpi_disabled) + return NULL; + + for (i = 0; i < ARRAY_SIZE(ibft_signs) && !ibft_addr; i++) { + status = acpi_get_table(ibft_signs[i].sign, 0, &table); + if (ACPI_SUCCESS(status)) + return table; + } + + return NULL; +} + +static void __init *find_ibft_in_mem(void) { unsigned long pos; unsigned int len = 0; @@ -70,35 +88,44 @@ static int __init find_ibft_in_mem(void) /* if the length of the table extends past 1M, * the table cannot be valid. */ if (pos + len <= (IBFT_END-1)) { - ibft_addr = (struct acpi_table_ibft *)virt; pr_info("iBFT found at 0x%lx.\n", pos); - goto done; + return virt; } } } } -done: - return len; + + return NULL; } + +static void __init *find_ibft(void) +{ + /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will + * only use ACPI for this */ + if (!efi_enabled(EFI_BOOT)) + return find_ibft_in_mem(); + else + return acpi_find_ibft_region(); +} + /* * Routine used to find the iSCSI Boot Format Table. The logical * kernel address is set in the ibft_addr global variable. */ -unsigned long __init find_ibft_region(unsigned long *sizep) +void __init reserve_ibft_region(void) { - ibft_addr = NULL; + struct acpi_table_ibft *table; + unsigned long size; - /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will - * only use ACPI for this */ + table = find_ibft(); + if (!table) + return; - if (!efi_enabled(EFI_BOOT)) - find_ibft_in_mem(); - - if (ibft_addr) { - *sizep = PAGE_ALIGN(ibft_addr->header.length); - return (u64)virt_to_phys(ibft_addr); - } + size = PAGE_ALIGN(table->header.length); + memblock_reserve(virt_to_phys(table), size); - *sizep = 0; - return 0; + if (efi_enabled(EFI_BOOT)) + acpi_put_table(&table->header); + else + ibft_addr = table; } diff --git a/include/linux/iscsi_ibft.h b/include/linux/iscsi_ibft.h index b7b45ca82bea..da813c891990 100644 --- a/include/linux/iscsi_ibft.h +++ b/include/linux/iscsi_ibft.h @@ -26,13 +26,9 @@ extern struct acpi_table_ibft *ibft_addr; * mapped address is set in the ibft_addr variable. */ #ifdef CONFIG_ISCSI_IBFT_FIND -unsigned long find_ibft_region(unsigned long *sizep); +void reserve_ibft_region(void); #else -static inline unsigned long find_ibft_region(unsigned long *sizep) -{ - *sizep = 0; - return 0; -} +static inline void reserve_ibft_region(void) {} #endif #endif /* ISCSI_IBFT_H */
On 2/24/2021 5:37 AM, Mike Rapoport wrote: > On Tue, Feb 23, 2021 at 04:46:28PM -0500, George Kennedy wrote: >> Mike, >> >> Still no luck. >> >> [ 30.193723] iscsi: registered transport (iser) >> [ 30.195970] iBFT detected. >> [ 30.196571] BUG: unable to handle page fault for address: ffffffffff240004 > Hmm, we cannot set ibft_addr to early pointer to the ACPI table. > Let's try something more disruptive and move the reservation back to > iscsi_ibft_find.c. > > diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c > index 7bdc0239a943..c118dd54a747 100644 > --- a/arch/x86/kernel/acpi/boot.c > +++ b/arch/x86/kernel/acpi/boot.c > @@ -1551,6 +1551,7 @@ void __init acpi_boot_table_init(void) > if (acpi_disabled) > return; > > +#if 0 > /* > * Initialize the ACPI boot-time table parser. > */ > @@ -1558,6 +1559,7 @@ void __init acpi_boot_table_init(void) > disable_acpi(); > return; > } > +#endif > > acpi_table_parse(ACPI_SIG_BOOT, acpi_parse_sbf); > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c > index d883176ef2ce..c615ce96c9a2 100644 > --- a/arch/x86/kernel/setup.c > +++ b/arch/x86/kernel/setup.c > @@ -570,16 +570,6 @@ void __init reserve_standard_io_resources(void) > > } > > -static __init void reserve_ibft_region(void) > -{ > - unsigned long addr, size = 0; > - > - addr = find_ibft_region(&size); > - > - if (size) > - memblock_reserve(addr, size); > -} > - > static bool __init snb_gfx_workaround_needed(void) > { > #ifdef CONFIG_PCI > @@ -1032,6 +1022,12 @@ void __init setup_arch(char **cmdline_p) > */ > find_smp_config(); > > + /* > + * Initialize the ACPI boot-time table parser. > + */ > + if (acpi_table_init()) > + disable_acpi(); > + > reserve_ibft_region(); > > early_alloc_pgt_buf(); > diff --git a/drivers/firmware/iscsi_ibft_find.c b/drivers/firmware/iscsi_ibft_find.c > index 64bb94523281..01be513843d6 100644 > --- a/drivers/firmware/iscsi_ibft_find.c > +++ b/drivers/firmware/iscsi_ibft_find.c > @@ -47,7 +47,25 @@ static const struct { > #define VGA_MEM 0xA0000 /* VGA buffer */ > #define VGA_SIZE 0x20000 /* 128kB */ > > -static int __init find_ibft_in_mem(void) > +static void __init *acpi_find_ibft_region(void) > +{ > + int i; > + struct acpi_table_header *table = NULL; > + acpi_status status; > + > + if (acpi_disabled) > + return NULL; > + > + for (i = 0; i < ARRAY_SIZE(ibft_signs) && !ibft_addr; i++) { > + status = acpi_get_table(ibft_signs[i].sign, 0, &table); > + if (ACPI_SUCCESS(status)) > + return table; > + } > + > + return NULL; > +} > + > +static void __init *find_ibft_in_mem(void) > { > unsigned long pos; > unsigned int len = 0; > @@ -70,35 +88,44 @@ static int __init find_ibft_in_mem(void) > /* if the length of the table extends past 1M, > * the table cannot be valid. */ > if (pos + len <= (IBFT_END-1)) { > - ibft_addr = (struct acpi_table_ibft *)virt; > pr_info("iBFT found at 0x%lx.\n", pos); > - goto done; > + return virt; > } > } > } > } > -done: > - return len; > + > + return NULL; > } > + > +static void __init *find_ibft(void) > +{ > + /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will > + * only use ACPI for this */ > + if (!efi_enabled(EFI_BOOT)) > + return find_ibft_in_mem(); > + else > + return acpi_find_ibft_region(); > +} > + > /* > * Routine used to find the iSCSI Boot Format Table. The logical > * kernel address is set in the ibft_addr global variable. > */ > -unsigned long __init find_ibft_region(unsigned long *sizep) > +void __init reserve_ibft_region(void) > { > - ibft_addr = NULL; > + struct acpi_table_ibft *table; > + unsigned long size; > > - /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will > - * only use ACPI for this */ > + table = find_ibft(); > + if (!table) > + return; > > - if (!efi_enabled(EFI_BOOT)) > - find_ibft_in_mem(); > - > - if (ibft_addr) { > - *sizep = PAGE_ALIGN(ibft_addr->header.length); > - return (u64)virt_to_phys(ibft_addr); > - } > + size = PAGE_ALIGN(table->header.length); > + memblock_reserve(virt_to_phys(table), size); > > - *sizep = 0; > - return 0; > + if (efi_enabled(EFI_BOOT)) > + acpi_put_table(&table->header); > + else > + ibft_addr = table; > } > diff --git a/include/linux/iscsi_ibft.h b/include/linux/iscsi_ibft.h > index b7b45ca82bea..da813c891990 100644 > --- a/include/linux/iscsi_ibft.h > +++ b/include/linux/iscsi_ibft.h > @@ -26,13 +26,9 @@ extern struct acpi_table_ibft *ibft_addr; > * mapped address is set in the ibft_addr variable. > */ > #ifdef CONFIG_ISCSI_IBFT_FIND > -unsigned long find_ibft_region(unsigned long *sizep); > +void reserve_ibft_region(void); > #else > -static inline unsigned long find_ibft_region(unsigned long *sizep) > -{ > - *sizep = 0; > - return 0; > -} > +static inline void reserve_ibft_region(void) {} > #endif > > #endif /* ISCSI_IBFT_H */ Still no luck Mike, We're back to the original problem where the only thing that worked was to run "SetPageReserved(page)" before calling "kmap(page)". The page is being "freed" before ibft_init() is called as a result of the recent buddy page freeing changes. [ 30.385207] iscsi: registered transport (iser) [ 30.387462] iBFT detected. [ 30.388042] ================================================================== [ 30.388119] BUG: KASAN: use-after-free in ibft_init+0x134/0xc33 [ 30.388119] Read of size 4 at addr ffff8880be453004 by task swapper/0/1 [ 30.388119] [ 30.388119] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.11.0-f9593a0 #11 [ 30.388119] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 30.388119] Call Trace: [ 30.388119] dump_stack+0xdb/0x120 [ 30.388119] ? ibft_init+0x134/0xc33 [ 30.388119] print_address_description.constprop.7+0x41/0x60 [ 30.388119] ? ibft_init+0x134/0xc33 [ 30.388119] ? ibft_init+0x134/0xc33 [ 30.388119] kasan_report.cold.10+0x78/0xd1 [ 30.388119] ? ibft_init+0x134/0xc33 [ 30.388119] __asan_report_load_n_noabort+0xf/0x20 [ 30.388119] ibft_init+0x134/0xc33 [ 30.388119] ? write_comp_data+0x2f/0x90 [ 30.388119] ? ibft_check_initiator_for+0x159/0x159 [ 30.388119] ? write_comp_data+0x2f/0x90 [ 30.388119] ? ibft_check_initiator_for+0x159/0x159 [ 30.388119] do_one_initcall+0xc4/0x3e0 [ 30.388119] ? perf_trace_initcall_level+0x3e0/0x3e0 [ 30.388119] ? unpoison_range+0x14/0x40 [ 30.388119] ? ____kasan_kmalloc.constprop.5+0x8f/0xc0 [ 30.388119] ? kernel_init_freeable+0x420/0x652 George >
Hi George, > On 2/24/2021 5:37 AM, Mike Rapoport wrote: > > On Tue, Feb 23, 2021 at 04:46:28PM -0500, George Kennedy wrote: > > > Mike, > > > > > > Still no luck. > > > > > > [ 30.193723] iscsi: registered transport (iser) > > > [ 30.195970] iBFT detected. > > > [ 30.196571] BUG: unable to handle page fault for address: ffffffffff240004 > > Hmm, we cannot set ibft_addr to early pointer to the ACPI table. > > Let's try something more disruptive and move the reservation back to > > iscsi_ibft_find.c. > > > > diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c > > index 7bdc0239a943..c118dd54a747 100644 > > --- a/arch/x86/kernel/acpi/boot.c > > +++ b/arch/x86/kernel/acpi/boot.c > > @@ -1551,6 +1551,7 @@ void __init acpi_boot_table_init(void) > > if (acpi_disabled) > > return; > > +#if 0 > > /* > > * Initialize the ACPI boot-time table parser. > > */ > > @@ -1558,6 +1559,7 @@ void __init acpi_boot_table_init(void) > > disable_acpi(); > > return; > > } > > +#endif > > acpi_table_parse(ACPI_SIG_BOOT, acpi_parse_sbf); > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c > > index d883176ef2ce..c615ce96c9a2 100644 > > --- a/arch/x86/kernel/setup.c > > +++ b/arch/x86/kernel/setup.c > > @@ -570,16 +570,6 @@ void __init reserve_standard_io_resources(void) > > } > > -static __init void reserve_ibft_region(void) > > -{ > > - unsigned long addr, size = 0; > > - > > - addr = find_ibft_region(&size); > > - > > - if (size) > > - memblock_reserve(addr, size); > > -} > > - > > static bool __init snb_gfx_workaround_needed(void) > > { > > #ifdef CONFIG_PCI > > @@ -1032,6 +1022,12 @@ void __init setup_arch(char **cmdline_p) > > */ > > find_smp_config(); > > + /* > > + * Initialize the ACPI boot-time table parser. > > + */ > > + if (acpi_table_init()) > > + disable_acpi(); > > + > > reserve_ibft_region(); > > early_alloc_pgt_buf(); > > diff --git a/drivers/firmware/iscsi_ibft_find.c b/drivers/firmware/iscsi_ibft_find.c > > index 64bb94523281..01be513843d6 100644 > > --- a/drivers/firmware/iscsi_ibft_find.c > > +++ b/drivers/firmware/iscsi_ibft_find.c > > @@ -47,7 +47,25 @@ static const struct { > > #define VGA_MEM 0xA0000 /* VGA buffer */ > > #define VGA_SIZE 0x20000 /* 128kB */ > > -static int __init find_ibft_in_mem(void) > > +static void __init *acpi_find_ibft_region(void) > > +{ > > + int i; > > + struct acpi_table_header *table = NULL; > > + acpi_status status; > > + > > + if (acpi_disabled) > > + return NULL; > > + > > + for (i = 0; i < ARRAY_SIZE(ibft_signs) && !ibft_addr; i++) { > > + status = acpi_get_table(ibft_signs[i].sign, 0, &table); > > + if (ACPI_SUCCESS(status)) > > + return table; > > + } > > + > > + return NULL; > > +} > > + > > +static void __init *find_ibft_in_mem(void) > > { > > unsigned long pos; > > unsigned int len = 0; > > @@ -70,35 +88,44 @@ static int __init find_ibft_in_mem(void) > > /* if the length of the table extends past 1M, > > * the table cannot be valid. */ > > if (pos + len <= (IBFT_END-1)) { > > - ibft_addr = (struct acpi_table_ibft *)virt; > > pr_info("iBFT found at 0x%lx.\n", pos); > > - goto done; > > + return virt; > > } > > } > > } > > } > > -done: > > - return len; > > + > > + return NULL; > > } > > + > > +static void __init *find_ibft(void) > > +{ > > + /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will > > + * only use ACPI for this */ > > + if (!efi_enabled(EFI_BOOT)) > > + return find_ibft_in_mem(); > > + else > > + return acpi_find_ibft_region(); > > +} > > + > > /* > > * Routine used to find the iSCSI Boot Format Table. The logical > > * kernel address is set in the ibft_addr global variable. > > */ > > -unsigned long __init find_ibft_region(unsigned long *sizep) > > +void __init reserve_ibft_region(void) > > { > > - ibft_addr = NULL; > > + struct acpi_table_ibft *table; > > + unsigned long size; > > - /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will > > - * only use ACPI for this */ > > + table = find_ibft(); > > + if (!table) > > + return; > > - if (!efi_enabled(EFI_BOOT)) > > - find_ibft_in_mem(); > > - > > - if (ibft_addr) { > > - *sizep = PAGE_ALIGN(ibft_addr->header.length); > > - return (u64)virt_to_phys(ibft_addr); > > - } > > + size = PAGE_ALIGN(table->header.length); > > + memblock_reserve(virt_to_phys(table), size); > > - *sizep = 0; > > - return 0; > > + if (efi_enabled(EFI_BOOT)) > > + acpi_put_table(&table->header); > > + else > > + ibft_addr = table; > > } > > diff --git a/include/linux/iscsi_ibft.h b/include/linux/iscsi_ibft.h > > index b7b45ca82bea..da813c891990 100644 > > --- a/include/linux/iscsi_ibft.h > > +++ b/include/linux/iscsi_ibft.h > > @@ -26,13 +26,9 @@ extern struct acpi_table_ibft *ibft_addr; > > * mapped address is set in the ibft_addr variable. > > */ > > #ifdef CONFIG_ISCSI_IBFT_FIND > > -unsigned long find_ibft_region(unsigned long *sizep); > > +void reserve_ibft_region(void); > > #else > > -static inline unsigned long find_ibft_region(unsigned long *sizep) > > -{ > > - *sizep = 0; > > - return 0; > > -} > > +static inline void reserve_ibft_region(void) {} > > #endif > > #endif /* ISCSI_IBFT_H */ > > Still no luck Mike, > > We're back to the original problem where the only thing that worked was to > run "SetPageReserved(page)" before calling "kmap(page)". The page is being > "freed" before ibft_init() is called as a result of the recent buddy page > freeing changes. I keep missing some little details each time :( Ok, let's try from the different angle. diff --git a/drivers/acpi/acpica/tbutils.c b/drivers/acpi/acpica/tbutils.c index 4b9b329a5a92..ec43e1447336 100644 --- a/drivers/acpi/acpica/tbutils.c +++ b/drivers/acpi/acpica/tbutils.c @@ -7,6 +7,8 @@ * *****************************************************************************/ +#include <linux/memblock.h> + #include <acpi/acpi.h> #include "accommon.h" #include "actables.h" @@ -339,6 +341,21 @@ acpi_tb_parse_root_table(acpi_physical_address rsdp_address) acpi_tb_parse_fadt(); } + if (ACPI_SUCCESS(status) && + ACPI_COMPARE_NAMESEG(&acpi_gbl_root_table_list. + tables[table_index].signature, + ACPI_SIG_IBFT)) { + struct acpi_table_header *ibft; + struct acpi_table_desc *desc; + + desc = &acpi_gbl_root_table_list.tables[table_index]; + status = acpi_tb_get_table(desc, &ibft); + if (ACPI_SUCCESS(status)) { + memblock_reserve(address, ibft->length); + acpi_tb_put_table(desc); + } + } + next_table: table_entry += table_entry_size;
On 2/25/2021 3:53 AM, Mike Rapoport wrote: > Hi George, > >> On 2/24/2021 5:37 AM, Mike Rapoport wrote: >>> On Tue, Feb 23, 2021 at 04:46:28PM -0500, George Kennedy wrote: >>>> Mike, >>>> >>>> Still no luck. >>>> >>>> [ 30.193723] iscsi: registered transport (iser) >>>> [ 30.195970] iBFT detected. >>>> [ 30.196571] BUG: unable to handle page fault for address: ffffffffff240004 >>> Hmm, we cannot set ibft_addr to early pointer to the ACPI table. >>> Let's try something more disruptive and move the reservation back to >>> iscsi_ibft_find.c. >>> >>> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c >>> index 7bdc0239a943..c118dd54a747 100644 >>> --- a/arch/x86/kernel/acpi/boot.c >>> +++ b/arch/x86/kernel/acpi/boot.c >>> @@ -1551,6 +1551,7 @@ void __init acpi_boot_table_init(void) >>> if (acpi_disabled) >>> return; >>> +#if 0 >>> /* >>> * Initialize the ACPI boot-time table parser. >>> */ >>> @@ -1558,6 +1559,7 @@ void __init acpi_boot_table_init(void) >>> disable_acpi(); >>> return; >>> } >>> +#endif >>> acpi_table_parse(ACPI_SIG_BOOT, acpi_parse_sbf); >>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c >>> index d883176ef2ce..c615ce96c9a2 100644 >>> --- a/arch/x86/kernel/setup.c >>> +++ b/arch/x86/kernel/setup.c >>> @@ -570,16 +570,6 @@ void __init reserve_standard_io_resources(void) >>> } >>> -static __init void reserve_ibft_region(void) >>> -{ >>> - unsigned long addr, size = 0; >>> - >>> - addr = find_ibft_region(&size); >>> - >>> - if (size) >>> - memblock_reserve(addr, size); >>> -} >>> - >>> static bool __init snb_gfx_workaround_needed(void) >>> { >>> #ifdef CONFIG_PCI >>> @@ -1032,6 +1022,12 @@ void __init setup_arch(char **cmdline_p) >>> */ >>> find_smp_config(); >>> + /* >>> + * Initialize the ACPI boot-time table parser. >>> + */ >>> + if (acpi_table_init()) >>> + disable_acpi(); >>> + >>> reserve_ibft_region(); >>> early_alloc_pgt_buf(); >>> diff --git a/drivers/firmware/iscsi_ibft_find.c b/drivers/firmware/iscsi_ibft_find.c >>> index 64bb94523281..01be513843d6 100644 >>> --- a/drivers/firmware/iscsi_ibft_find.c >>> +++ b/drivers/firmware/iscsi_ibft_find.c >>> @@ -47,7 +47,25 @@ static const struct { >>> #define VGA_MEM 0xA0000 /* VGA buffer */ >>> #define VGA_SIZE 0x20000 /* 128kB */ >>> -static int __init find_ibft_in_mem(void) >>> +static void __init *acpi_find_ibft_region(void) >>> +{ >>> + int i; >>> + struct acpi_table_header *table = NULL; >>> + acpi_status status; >>> + >>> + if (acpi_disabled) >>> + return NULL; >>> + >>> + for (i = 0; i < ARRAY_SIZE(ibft_signs) && !ibft_addr; i++) { >>> + status = acpi_get_table(ibft_signs[i].sign, 0, &table); >>> + if (ACPI_SUCCESS(status)) >>> + return table; >>> + } >>> + >>> + return NULL; >>> +} >>> + >>> +static void __init *find_ibft_in_mem(void) >>> { >>> unsigned long pos; >>> unsigned int len = 0; >>> @@ -70,35 +88,44 @@ static int __init find_ibft_in_mem(void) >>> /* if the length of the table extends past 1M, >>> * the table cannot be valid. */ >>> if (pos + len <= (IBFT_END-1)) { >>> - ibft_addr = (struct acpi_table_ibft *)virt; >>> pr_info("iBFT found at 0x%lx.\n", pos); >>> - goto done; >>> + return virt; >>> } >>> } >>> } >>> } >>> -done: >>> - return len; >>> + >>> + return NULL; >>> } >>> + >>> +static void __init *find_ibft(void) >>> +{ >>> + /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will >>> + * only use ACPI for this */ >>> + if (!efi_enabled(EFI_BOOT)) >>> + return find_ibft_in_mem(); >>> + else >>> + return acpi_find_ibft_region(); >>> +} >>> + >>> /* >>> * Routine used to find the iSCSI Boot Format Table. The logical >>> * kernel address is set in the ibft_addr global variable. >>> */ >>> -unsigned long __init find_ibft_region(unsigned long *sizep) >>> +void __init reserve_ibft_region(void) >>> { >>> - ibft_addr = NULL; >>> + struct acpi_table_ibft *table; >>> + unsigned long size; >>> - /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will >>> - * only use ACPI for this */ >>> + table = find_ibft(); >>> + if (!table) >>> + return; >>> - if (!efi_enabled(EFI_BOOT)) >>> - find_ibft_in_mem(); >>> - >>> - if (ibft_addr) { >>> - *sizep = PAGE_ALIGN(ibft_addr->header.length); >>> - return (u64)virt_to_phys(ibft_addr); >>> - } >>> + size = PAGE_ALIGN(table->header.length); >>> + memblock_reserve(virt_to_phys(table), size); >>> - *sizep = 0; >>> - return 0; >>> + if (efi_enabled(EFI_BOOT)) >>> + acpi_put_table(&table->header); >>> + else >>> + ibft_addr = table; >>> } >>> diff --git a/include/linux/iscsi_ibft.h b/include/linux/iscsi_ibft.h >>> index b7b45ca82bea..da813c891990 100644 >>> --- a/include/linux/iscsi_ibft.h >>> +++ b/include/linux/iscsi_ibft.h >>> @@ -26,13 +26,9 @@ extern struct acpi_table_ibft *ibft_addr; >>> * mapped address is set in the ibft_addr variable. >>> */ >>> #ifdef CONFIG_ISCSI_IBFT_FIND >>> -unsigned long find_ibft_region(unsigned long *sizep); >>> +void reserve_ibft_region(void); >>> #else >>> -static inline unsigned long find_ibft_region(unsigned long *sizep) >>> -{ >>> - *sizep = 0; >>> - return 0; >>> -} >>> +static inline void reserve_ibft_region(void) {} >>> #endif >>> #endif /* ISCSI_IBFT_H */ >> Still no luck Mike, >> >> We're back to the original problem where the only thing that worked was to >> run "SetPageReserved(page)" before calling "kmap(page)". The page is being >> "freed" before ibft_init() is called as a result of the recent buddy page >> freeing changes. > I keep missing some little details each time :( No worries. Thanks for all your help. Does this patch go on top of your previous patch or is it standalone? George > Ok, let's try from the different angle. > > diff --git a/drivers/acpi/acpica/tbutils.c b/drivers/acpi/acpica/tbutils.c > index 4b9b329a5a92..ec43e1447336 100644 > --- a/drivers/acpi/acpica/tbutils.c > +++ b/drivers/acpi/acpica/tbutils.c > @@ -7,6 +7,8 @@ > * > *****************************************************************************/ > > +#include <linux/memblock.h> > + > #include <acpi/acpi.h> > #include "accommon.h" > #include "actables.h" > @@ -339,6 +341,21 @@ acpi_tb_parse_root_table(acpi_physical_address rsdp_address) > acpi_tb_parse_fadt(); > } > > + if (ACPI_SUCCESS(status) && > + ACPI_COMPARE_NAMESEG(&acpi_gbl_root_table_list. > + tables[table_index].signature, > + ACPI_SIG_IBFT)) { > + struct acpi_table_header *ibft; > + struct acpi_table_desc *desc; > + > + desc = &acpi_gbl_root_table_list.tables[table_index]; > + status = acpi_tb_get_table(desc, &ibft); > + if (ACPI_SUCCESS(status)) { > + memblock_reserve(address, ibft->length); > + acpi_tb_put_table(desc); > + } > + } > + > next_table: > > table_entry += table_entry_size; > > >
On Thu, Feb 25, 2021 at 07:38:19AM -0500, George Kennedy wrote: > On 2/25/2021 3:53 AM, Mike Rapoport wrote: > > Hi George, > > > > > On 2/24/2021 5:37 AM, Mike Rapoport wrote: > > > > On Tue, Feb 23, 2021 at 04:46:28PM -0500, George Kennedy wrote: > > > > > Mike, > > > > > > > > > > Still no luck. > > > > > > > > > > [ 30.193723] iscsi: registered transport (iser) > > > > > [ 30.195970] iBFT detected. > > > > > [ 30.196571] BUG: unable to handle page fault for address: ffffffffff240004 > > > > Hmm, we cannot set ibft_addr to early pointer to the ACPI table. > > > > Let's try something more disruptive and move the reservation back to > > > > iscsi_ibft_find.c. > > > > > > > > diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c > > > > index 7bdc0239a943..c118dd54a747 100644 > > > > --- a/arch/x86/kernel/acpi/boot.c > > > > +++ b/arch/x86/kernel/acpi/boot.c > > > > @@ -1551,6 +1551,7 @@ void __init acpi_boot_table_init(void) > > > > if (acpi_disabled) > > > > return; > > > > +#if 0 > > > > /* > > > > * Initialize the ACPI boot-time table parser. > > > > */ > > > > @@ -1558,6 +1559,7 @@ void __init acpi_boot_table_init(void) > > > > disable_acpi(); > > > > return; > > > > } > > > > +#endif > > > > acpi_table_parse(ACPI_SIG_BOOT, acpi_parse_sbf); > > > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c > > > > index d883176ef2ce..c615ce96c9a2 100644 > > > > --- a/arch/x86/kernel/setup.c > > > > +++ b/arch/x86/kernel/setup.c > > > > @@ -570,16 +570,6 @@ void __init reserve_standard_io_resources(void) > > > > } > > > > -static __init void reserve_ibft_region(void) > > > > -{ > > > > - unsigned long addr, size = 0; > > > > - > > > > - addr = find_ibft_region(&size); > > > > - > > > > - if (size) > > > > - memblock_reserve(addr, size); > > > > -} > > > > - > > > > static bool __init snb_gfx_workaround_needed(void) > > > > { > > > > #ifdef CONFIG_PCI > > > > @@ -1032,6 +1022,12 @@ void __init setup_arch(char **cmdline_p) > > > > */ > > > > find_smp_config(); > > > > + /* > > > > + * Initialize the ACPI boot-time table parser. > > > > + */ > > > > + if (acpi_table_init()) > > > > + disable_acpi(); > > > > + > > > > reserve_ibft_region(); > > > > early_alloc_pgt_buf(); > > > > diff --git a/drivers/firmware/iscsi_ibft_find.c b/drivers/firmware/iscsi_ibft_find.c > > > > index 64bb94523281..01be513843d6 100644 > > > > --- a/drivers/firmware/iscsi_ibft_find.c > > > > +++ b/drivers/firmware/iscsi_ibft_find.c > > > > @@ -47,7 +47,25 @@ static const struct { > > > > #define VGA_MEM 0xA0000 /* VGA buffer */ > > > > #define VGA_SIZE 0x20000 /* 128kB */ > > > > -static int __init find_ibft_in_mem(void) > > > > +static void __init *acpi_find_ibft_region(void) > > > > +{ > > > > + int i; > > > > + struct acpi_table_header *table = NULL; > > > > + acpi_status status; > > > > + > > > > + if (acpi_disabled) > > > > + return NULL; > > > > + > > > > + for (i = 0; i < ARRAY_SIZE(ibft_signs) && !ibft_addr; i++) { > > > > + status = acpi_get_table(ibft_signs[i].sign, 0, &table); > > > > + if (ACPI_SUCCESS(status)) > > > > + return table; > > > > + } > > > > + > > > > + return NULL; > > > > +} > > > > + > > > > +static void __init *find_ibft_in_mem(void) > > > > { > > > > unsigned long pos; > > > > unsigned int len = 0; > > > > @@ -70,35 +88,44 @@ static int __init find_ibft_in_mem(void) > > > > /* if the length of the table extends past 1M, > > > > * the table cannot be valid. */ > > > > if (pos + len <= (IBFT_END-1)) { > > > > - ibft_addr = (struct acpi_table_ibft *)virt; > > > > pr_info("iBFT found at 0x%lx.\n", pos); > > > > - goto done; > > > > + return virt; > > > > } > > > > } > > > > } > > > > } > > > > -done: > > > > - return len; > > > > + > > > > + return NULL; > > > > } > > > > + > > > > +static void __init *find_ibft(void) > > > > +{ > > > > + /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will > > > > + * only use ACPI for this */ > > > > + if (!efi_enabled(EFI_BOOT)) > > > > + return find_ibft_in_mem(); > > > > + else > > > > + return acpi_find_ibft_region(); > > > > +} > > > > + > > > > /* > > > > * Routine used to find the iSCSI Boot Format Table. The logical > > > > * kernel address is set in the ibft_addr global variable. > > > > */ > > > > -unsigned long __init find_ibft_region(unsigned long *sizep) > > > > +void __init reserve_ibft_region(void) > > > > { > > > > - ibft_addr = NULL; > > > > + struct acpi_table_ibft *table; > > > > + unsigned long size; > > > > - /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will > > > > - * only use ACPI for this */ > > > > + table = find_ibft(); > > > > + if (!table) > > > > + return; > > > > - if (!efi_enabled(EFI_BOOT)) > > > > - find_ibft_in_mem(); > > > > - > > > > - if (ibft_addr) { > > > > - *sizep = PAGE_ALIGN(ibft_addr->header.length); > > > > - return (u64)virt_to_phys(ibft_addr); > > > > - } > > > > + size = PAGE_ALIGN(table->header.length); > > > > + memblock_reserve(virt_to_phys(table), size); > > > > - *sizep = 0; > > > > - return 0; > > > > + if (efi_enabled(EFI_BOOT)) > > > > + acpi_put_table(&table->header); > > > > + else > > > > + ibft_addr = table; > > > > } > > > > diff --git a/include/linux/iscsi_ibft.h b/include/linux/iscsi_ibft.h > > > > index b7b45ca82bea..da813c891990 100644 > > > > --- a/include/linux/iscsi_ibft.h > > > > +++ b/include/linux/iscsi_ibft.h > > > > @@ -26,13 +26,9 @@ extern struct acpi_table_ibft *ibft_addr; > > > > * mapped address is set in the ibft_addr variable. > > > > */ > > > > #ifdef CONFIG_ISCSI_IBFT_FIND > > > > -unsigned long find_ibft_region(unsigned long *sizep); > > > > +void reserve_ibft_region(void); > > > > #else > > > > -static inline unsigned long find_ibft_region(unsigned long *sizep) > > > > -{ > > > > - *sizep = 0; > > > > - return 0; > > > > -} > > > > +static inline void reserve_ibft_region(void) {} > > > > #endif > > > > #endif /* ISCSI_IBFT_H */ > > > Still no luck Mike, > > > > > > We're back to the original problem where the only thing that worked was to > > > run "SetPageReserved(page)" before calling "kmap(page)". The page is being > > > "freed" before ibft_init() is called as a result of the recent buddy page > > > freeing changes. > > I keep missing some little details each time :( > No worries. Thanks for all your help. Does this patch go on top of your > previous patch or is it standalone? This is standalone. > George > > Ok, let's try from the different angle. > > > > diff --git a/drivers/acpi/acpica/tbutils.c b/drivers/acpi/acpica/tbutils.c > > index 4b9b329a5a92..ec43e1447336 100644 > > --- a/drivers/acpi/acpica/tbutils.c > > +++ b/drivers/acpi/acpica/tbutils.c > > @@ -7,6 +7,8 @@ > > * > > *****************************************************************************/ > > +#include <linux/memblock.h> > > + > > #include <acpi/acpi.h> > > #include "accommon.h" > > #include "actables.h" > > @@ -339,6 +341,21 @@ acpi_tb_parse_root_table(acpi_physical_address rsdp_address) > > acpi_tb_parse_fadt(); > > } > > + if (ACPI_SUCCESS(status) && > > + ACPI_COMPARE_NAMESEG(&acpi_gbl_root_table_list. > > + tables[table_index].signature, > > + ACPI_SIG_IBFT)) { > > + struct acpi_table_header *ibft; > > + struct acpi_table_desc *desc; > > + > > + desc = &acpi_gbl_root_table_list.tables[table_index]; > > + status = acpi_tb_get_table(desc, &ibft); > > + if (ACPI_SUCCESS(status)) { > > + memblock_reserve(address, ibft->length); > > + acpi_tb_put_table(desc); > > + } > > + } > > + > > next_table: > > table_entry += table_entry_size; > > > > >
On 2/25/2021 9:57 AM, Mike Rapoport wrote: > On Thu, Feb 25, 2021 at 07:38:19AM -0500, George Kennedy wrote: >> On 2/25/2021 3:53 AM, Mike Rapoport wrote: >>> Hi George, >>> >>>> On 2/24/2021 5:37 AM, Mike Rapoport wrote: >>>>> On Tue, Feb 23, 2021 at 04:46:28PM -0500, George Kennedy wrote: >>>>>> Mike, >>>>>> >>>>>> Still no luck. >>>>>> >>>>>> [ 30.193723] iscsi: registered transport (iser) >>>>>> [ 30.195970] iBFT detected. >>>>>> [ 30.196571] BUG: unable to handle page fault for address: ffffffffff240004 >>>>> Hmm, we cannot set ibft_addr to early pointer to the ACPI table. >>>>> Let's try something more disruptive and move the reservation back to >>>>> iscsi_ibft_find.c. >>>>> >>>>> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c >>>>> index 7bdc0239a943..c118dd54a747 100644 >>>>> --- a/arch/x86/kernel/acpi/boot.c >>>>> +++ b/arch/x86/kernel/acpi/boot.c >>>>> @@ -1551,6 +1551,7 @@ void __init acpi_boot_table_init(void) >>>>> if (acpi_disabled) >>>>> return; >>>>> +#if 0 >>>>> /* >>>>> * Initialize the ACPI boot-time table parser. >>>>> */ >>>>> @@ -1558,6 +1559,7 @@ void __init acpi_boot_table_init(void) >>>>> disable_acpi(); >>>>> return; >>>>> } >>>>> +#endif >>>>> acpi_table_parse(ACPI_SIG_BOOT, acpi_parse_sbf); >>>>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c >>>>> index d883176ef2ce..c615ce96c9a2 100644 >>>>> --- a/arch/x86/kernel/setup.c >>>>> +++ b/arch/x86/kernel/setup.c >>>>> @@ -570,16 +570,6 @@ void __init reserve_standard_io_resources(void) >>>>> } >>>>> -static __init void reserve_ibft_region(void) >>>>> -{ >>>>> - unsigned long addr, size = 0; >>>>> - >>>>> - addr = find_ibft_region(&size); >>>>> - >>>>> - if (size) >>>>> - memblock_reserve(addr, size); >>>>> -} >>>>> - >>>>> static bool __init snb_gfx_workaround_needed(void) >>>>> { >>>>> #ifdef CONFIG_PCI >>>>> @@ -1032,6 +1022,12 @@ void __init setup_arch(char **cmdline_p) >>>>> */ >>>>> find_smp_config(); >>>>> + /* >>>>> + * Initialize the ACPI boot-time table parser. >>>>> + */ >>>>> + if (acpi_table_init()) >>>>> + disable_acpi(); >>>>> + >>>>> reserve_ibft_region(); >>>>> early_alloc_pgt_buf(); >>>>> diff --git a/drivers/firmware/iscsi_ibft_find.c b/drivers/firmware/iscsi_ibft_find.c >>>>> index 64bb94523281..01be513843d6 100644 >>>>> --- a/drivers/firmware/iscsi_ibft_find.c >>>>> +++ b/drivers/firmware/iscsi_ibft_find.c >>>>> @@ -47,7 +47,25 @@ static const struct { >>>>> #define VGA_MEM 0xA0000 /* VGA buffer */ >>>>> #define VGA_SIZE 0x20000 /* 128kB */ >>>>> -static int __init find_ibft_in_mem(void) >>>>> +static void __init *acpi_find_ibft_region(void) >>>>> +{ >>>>> + int i; >>>>> + struct acpi_table_header *table = NULL; >>>>> + acpi_status status; >>>>> + >>>>> + if (acpi_disabled) >>>>> + return NULL; >>>>> + >>>>> + for (i = 0; i < ARRAY_SIZE(ibft_signs) && !ibft_addr; i++) { >>>>> + status = acpi_get_table(ibft_signs[i].sign, 0, &table); >>>>> + if (ACPI_SUCCESS(status)) >>>>> + return table; >>>>> + } >>>>> + >>>>> + return NULL; >>>>> +} >>>>> + >>>>> +static void __init *find_ibft_in_mem(void) >>>>> { >>>>> unsigned long pos; >>>>> unsigned int len = 0; >>>>> @@ -70,35 +88,44 @@ static int __init find_ibft_in_mem(void) >>>>> /* if the length of the table extends past 1M, >>>>> * the table cannot be valid. */ >>>>> if (pos + len <= (IBFT_END-1)) { >>>>> - ibft_addr = (struct acpi_table_ibft *)virt; >>>>> pr_info("iBFT found at 0x%lx.\n", pos); >>>>> - goto done; >>>>> + return virt; >>>>> } >>>>> } >>>>> } >>>>> } >>>>> -done: >>>>> - return len; >>>>> + >>>>> + return NULL; >>>>> } >>>>> + >>>>> +static void __init *find_ibft(void) >>>>> +{ >>>>> + /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will >>>>> + * only use ACPI for this */ >>>>> + if (!efi_enabled(EFI_BOOT)) >>>>> + return find_ibft_in_mem(); >>>>> + else >>>>> + return acpi_find_ibft_region(); >>>>> +} >>>>> + >>>>> /* >>>>> * Routine used to find the iSCSI Boot Format Table. The logical >>>>> * kernel address is set in the ibft_addr global variable. >>>>> */ >>>>> -unsigned long __init find_ibft_region(unsigned long *sizep) >>>>> +void __init reserve_ibft_region(void) >>>>> { >>>>> - ibft_addr = NULL; >>>>> + struct acpi_table_ibft *table; >>>>> + unsigned long size; >>>>> - /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will >>>>> - * only use ACPI for this */ >>>>> + table = find_ibft(); >>>>> + if (!table) >>>>> + return; >>>>> - if (!efi_enabled(EFI_BOOT)) >>>>> - find_ibft_in_mem(); >>>>> - >>>>> - if (ibft_addr) { >>>>> - *sizep = PAGE_ALIGN(ibft_addr->header.length); >>>>> - return (u64)virt_to_phys(ibft_addr); >>>>> - } >>>>> + size = PAGE_ALIGN(table->header.length); >>>>> + memblock_reserve(virt_to_phys(table), size); >>>>> - *sizep = 0; >>>>> - return 0; >>>>> + if (efi_enabled(EFI_BOOT)) >>>>> + acpi_put_table(&table->header); >>>>> + else >>>>> + ibft_addr = table; >>>>> } >>>>> diff --git a/include/linux/iscsi_ibft.h b/include/linux/iscsi_ibft.h >>>>> index b7b45ca82bea..da813c891990 100644 >>>>> --- a/include/linux/iscsi_ibft.h >>>>> +++ b/include/linux/iscsi_ibft.h >>>>> @@ -26,13 +26,9 @@ extern struct acpi_table_ibft *ibft_addr; >>>>> * mapped address is set in the ibft_addr variable. >>>>> */ >>>>> #ifdef CONFIG_ISCSI_IBFT_FIND >>>>> -unsigned long find_ibft_region(unsigned long *sizep); >>>>> +void reserve_ibft_region(void); >>>>> #else >>>>> -static inline unsigned long find_ibft_region(unsigned long *sizep) >>>>> -{ >>>>> - *sizep = 0; >>>>> - return 0; >>>>> -} >>>>> +static inline void reserve_ibft_region(void) {} >>>>> #endif >>>>> #endif /* ISCSI_IBFT_H */ >>>> Still no luck Mike, >>>> >>>> We're back to the original problem where the only thing that worked was to >>>> run "SetPageReserved(page)" before calling "kmap(page)". The page is being >>>> "freed" before ibft_init() is called as a result of the recent buddy page >>>> freeing changes. >>> I keep missing some little details each time :( >> No worries. Thanks for all your help. Does this patch go on top of your >> previous patch or is it standalone? > This is standalone. > >> George >>> Ok, let's try from the different angle. >>> >>> diff --git a/drivers/acpi/acpica/tbutils.c b/drivers/acpi/acpica/tbutils.c >>> index 4b9b329a5a92..ec43e1447336 100644 >>> --- a/drivers/acpi/acpica/tbutils.c >>> +++ b/drivers/acpi/acpica/tbutils.c >>> @@ -7,6 +7,8 @@ >>> * >>> *****************************************************************************/ >>> +#include <linux/memblock.h> >>> + >>> #include <acpi/acpi.h> >>> #include "accommon.h" >>> #include "actables.h" >>> @@ -339,6 +341,21 @@ acpi_tb_parse_root_table(acpi_physical_address rsdp_address) >>> acpi_tb_parse_fadt(); >>> } >>> + if (ACPI_SUCCESS(status) && >>> + ACPI_COMPARE_NAMESEG(&acpi_gbl_root_table_list. >>> + tables[table_index].signature, >>> + ACPI_SIG_IBFT)) { >>> + struct acpi_table_header *ibft; >>> + struct acpi_table_desc *desc; >>> + >>> + desc = &acpi_gbl_root_table_list.tables[table_index]; >>> + status = acpi_tb_get_table(desc, &ibft); >>> + if (ACPI_SUCCESS(status)) { >>> + memblock_reserve(address, ibft->length); >>> + acpi_tb_put_table(desc); >>> + >>> + } >>> + >>> next_table: >>> table_entry += table_entry_size; >>> >>> Applied just your latest patch, but same failure. I thought there was an earlier comment (which I can't find now) that stated that memblock_reserve() wouldn't reserve the page, which is what's needed here. [ 30.308229] iBFT detected.. [ 30.308796] ================================================================== [ 30.308890] BUG: KASAN: use-after-free in ibft_init+0x134/0xc33 [ 30.308890] Read of size 4 at addr ffff8880be453004 by task swapper/0/1 [ 30.308890] [ 30.308890] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.11.0-f9593a0 #12 [ 30.308890] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 30.308890] Call Trace: [ 30.308890] dump_stack+0xdb/0x120 [ 30.308890] ? ibft_init+0x134/0xc33 [ 30.308890] print_address_description.constprop.7+0x41/0x60 [ 30.308890] ? ibft_init+0x134/0xc33 [ 30.308890] ? ibft_init+0x134/0xc33 [ 30.308890] kasan_report.cold.10+0x78/0xd1 [ 30.308890] ? ibft_init+0x134/0xc33 [ 30.308890] __asan_report_load_n_noabort+0xf/0x20 [ 30.308890] ibft_init+0x134/0xc33 [ 30.308890] ? write_comp_data+0x2f/0x90 [ 30.308890] ? ibft_check_initiator_for+0x159/0x159 [ 30.308890] ? write_comp_data+0x2f/0x90 [ 30.308890] ? ibft_check_initiator_for+0x159/0x159 [ 30.308890] do_one_initcall+0xc4/0x3e0 [ 30.308890] ? perf_trace_initcall_level+0x3e0/0x3e0 [ 30.308890] ? unpoison_range+0x14/0x40 [ 30.308890] ? ____kasan_kmalloc.constprop.5+0x8f/0xc0 [ 30.308890] ? kernel_init_freeable+0x420/0x652 [ 30.308890] ? __kasan_kmalloc+0x9/0x10 [ 30.308890] ? __sanitizer_cov_trace_pc+0x21/0x50 [ 30.308890] kernel_init_freeable+0x596/0x652 [ 30.308890] ? console_on_rootfs+0x7d/0x7d [ 30.308890] ? __sanitizer_cov_trace_pc+0x21/0x50 [ 30.308890] ? rest_init+0xf0/0xf0 [ 30.308890] kernel_init+0x16/0x1d0 [ 30.308890] ? rest_init+0xf0/0xf0 [ 30.308890] ret_from_fork+0x22/0x30 [ 30.308890] [ 30.308890] The buggy address belongs to the page: [ 30.308890] page:0000000001b7b17c refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0xbe453 [ 30.308890] flags: 0xfffffc0000000() [ 30.308890] raw: 000fffffc0000000 ffffea0002ef9788 ffffea0002f91488 0000000000000000 [ 30.308890] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 [ 30.308890] page dumped because: kasan: bad access detected [ 30.308890] page_owner tracks the page as freed [ 30.308890] page last allocated via order 0, migratetype Movable, gfp_mask 0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), pid 204, ts 28121288605 [ 30.308890] prep_new_page+0xfb/0x140 [ 30.308890] get_page_from_freelist+0x3503/0x5730 [ 30.308890] __alloc_pages_nodemask+0x2d8/0x650 [ 30.308890] alloc_pages_vma+0xe2/0x560 [ 30.308890] __handle_mm_fault+0x930/0x26c0 [ 30.308890] handle_mm_fault+0x1f9/0x810 [ 30.308890] do_user_addr_fault+0x6f7/0xca0 [ 30.308890] exc_page_fault+0xaf/0x1a0 [ 30.308890] asm_exc_page_fault+0x1e/0x30 [ 30.308890] page last free stack trace: [ 30.308890] free_pcp_prepare+0x122/0x290 [ 30.308890] free_unref_page_list+0xe6/0x490 [ 30.308890] release_pages+0x2ed/0x1270 [ 30.308890] free_pages_and_swap_cache+0x245/0x2e0 [ 30.308890] tlb_flush_mmu+0x11e/0x680 [ 30.308890] tlb_finish_mmu+0xa6/0x3e0 [ 30.308890] exit_mmap+0x2b3/0x540 [ 30.308890] mmput+0x11d/0x450 [ 30.308890] do_exit+0xaa6/0x2d40 [ 30.308890] do_group_exit+0x128/0x340 [ 30.308890] __x64_sys_exit_group+0x43/0x50 [ 30.308890] do_syscall_64+0x37/0x50 [ 30.308890] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 30.308890] [ 30.308890] Memory state around the buggy address: [ 30.308890] ffff8880be452f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 30.308890] ffff8880be452f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 30.308890] >ffff8880be453000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 30.308890] ^ [ 30.308890] ffff8880be453080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 30.308890] ffff8880be453100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 30.308890] ================================================================== George
On 2/25/2021 10:22 AM, George Kennedy wrote: > > > On 2/25/2021 9:57 AM, Mike Rapoport wrote: >> On Thu, Feb 25, 2021 at 07:38:19AM -0500, George Kennedy wrote: >>> On 2/25/2021 3:53 AM, Mike Rapoport wrote: >>>> Hi George, >>>> >>>>> On 2/24/2021 5:37 AM, Mike Rapoport wrote: >>>>>> On Tue, Feb 23, 2021 at 04:46:28PM -0500, George Kennedy wrote: >>>>>>> Mike, >>>>>>> >>>>>>> Still no luck. >>>>>>> >>>>>>> [ 30.193723] iscsi: registered transport (iser) >>>>>>> [ 30.195970] iBFT detected. >>>>>>> [ 30.196571] BUG: unable to handle page fault for address: >>>>>>> ffffffffff240004 >>>>>> Hmm, we cannot set ibft_addr to early pointer to the ACPI table. >>>>>> Let's try something more disruptive and move the reservation back to >>>>>> iscsi_ibft_find.c. >>>>>> >>>>>> diff --git a/arch/x86/kernel/acpi/boot.c >>>>>> b/arch/x86/kernel/acpi/boot.c >>>>>> index 7bdc0239a943..c118dd54a747 100644 >>>>>> --- a/arch/x86/kernel/acpi/boot.c >>>>>> +++ b/arch/x86/kernel/acpi/boot.c >>>>>> @@ -1551,6 +1551,7 @@ void __init acpi_boot_table_init(void) >>>>>> if (acpi_disabled) >>>>>> return; >>>>>> +#if 0 >>>>>> /* >>>>>> * Initialize the ACPI boot-time table parser. >>>>>> */ >>>>>> @@ -1558,6 +1559,7 @@ void __init acpi_boot_table_init(void) >>>>>> disable_acpi(); >>>>>> return; >>>>>> } >>>>>> +#endif >>>>>> acpi_table_parse(ACPI_SIG_BOOT, acpi_parse_sbf); >>>>>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c >>>>>> index d883176ef2ce..c615ce96c9a2 100644 >>>>>> --- a/arch/x86/kernel/setup.c >>>>>> +++ b/arch/x86/kernel/setup.c >>>>>> @@ -570,16 +570,6 @@ void __init reserve_standard_io_resources(void) >>>>>> } >>>>>> -static __init void reserve_ibft_region(void) >>>>>> -{ >>>>>> - unsigned long addr, size = 0; >>>>>> - >>>>>> - addr = find_ibft_region(&size); >>>>>> - >>>>>> - if (size) >>>>>> - memblock_reserve(addr, size); >>>>>> -} >>>>>> - >>>>>> static bool __init snb_gfx_workaround_needed(void) >>>>>> { >>>>>> #ifdef CONFIG_PCI >>>>>> @@ -1032,6 +1022,12 @@ void __init setup_arch(char **cmdline_p) >>>>>> */ >>>>>> find_smp_config(); >>>>>> + /* >>>>>> + * Initialize the ACPI boot-time table parser. >>>>>> + */ >>>>>> + if (acpi_table_init()) >>>>>> + disable_acpi(); >>>>>> + >>>>>> reserve_ibft_region(); >>>>>> early_alloc_pgt_buf(); >>>>>> diff --git a/drivers/firmware/iscsi_ibft_find.c >>>>>> b/drivers/firmware/iscsi_ibft_find.c >>>>>> index 64bb94523281..01be513843d6 100644 >>>>>> --- a/drivers/firmware/iscsi_ibft_find.c >>>>>> +++ b/drivers/firmware/iscsi_ibft_find.c >>>>>> @@ -47,7 +47,25 @@ static const struct { >>>>>> #define VGA_MEM 0xA0000 /* VGA buffer */ >>>>>> #define VGA_SIZE 0x20000 /* 128kB */ >>>>>> -static int __init find_ibft_in_mem(void) >>>>>> +static void __init *acpi_find_ibft_region(void) >>>>>> +{ >>>>>> + int i; >>>>>> + struct acpi_table_header *table = NULL; >>>>>> + acpi_status status; >>>>>> + >>>>>> + if (acpi_disabled) >>>>>> + return NULL; >>>>>> + >>>>>> + for (i = 0; i < ARRAY_SIZE(ibft_signs) && !ibft_addr; i++) { >>>>>> + status = acpi_get_table(ibft_signs[i].sign, 0, &table); >>>>>> + if (ACPI_SUCCESS(status)) >>>>>> + return table; >>>>>> + } >>>>>> + >>>>>> + return NULL; >>>>>> +} >>>>>> + >>>>>> +static void __init *find_ibft_in_mem(void) >>>>>> { >>>>>> unsigned long pos; >>>>>> unsigned int len = 0; >>>>>> @@ -70,35 +88,44 @@ static int __init find_ibft_in_mem(void) >>>>>> /* if the length of the table extends past 1M, >>>>>> * the table cannot be valid. */ >>>>>> if (pos + len <= (IBFT_END-1)) { >>>>>> - ibft_addr = (struct acpi_table_ibft *)virt; >>>>>> pr_info("iBFT found at 0x%lx.\n", pos); >>>>>> - goto done; >>>>>> + return virt; >>>>>> } >>>>>> } >>>>>> } >>>>>> } >>>>>> -done: >>>>>> - return len; >>>>>> + >>>>>> + return NULL; >>>>>> } >>>>>> + >>>>>> +static void __init *find_ibft(void) >>>>>> +{ >>>>>> + /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will >>>>>> + * only use ACPI for this */ >>>>>> + if (!efi_enabled(EFI_BOOT)) >>>>>> + return find_ibft_in_mem(); >>>>>> + else >>>>>> + return acpi_find_ibft_region(); >>>>>> +} >>>>>> + >>>>>> /* >>>>>> * Routine used to find the iSCSI Boot Format Table. The logical >>>>>> * kernel address is set in the ibft_addr global variable. >>>>>> */ >>>>>> -unsigned long __init find_ibft_region(unsigned long *sizep) >>>>>> +void __init reserve_ibft_region(void) >>>>>> { >>>>>> - ibft_addr = NULL; >>>>>> + struct acpi_table_ibft *table; >>>>>> + unsigned long size; >>>>>> - /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will >>>>>> - * only use ACPI for this */ >>>>>> + table = find_ibft(); >>>>>> + if (!table) >>>>>> + return; >>>>>> - if (!efi_enabled(EFI_BOOT)) >>>>>> - find_ibft_in_mem(); >>>>>> - >>>>>> - if (ibft_addr) { >>>>>> - *sizep = PAGE_ALIGN(ibft_addr->header.length); >>>>>> - return (u64)virt_to_phys(ibft_addr); >>>>>> - } >>>>>> + size = PAGE_ALIGN(table->header.length); >>>>>> + memblock_reserve(virt_to_phys(table), size); >>>>>> - *sizep = 0; >>>>>> - return 0; >>>>>> + if (efi_enabled(EFI_BOOT)) >>>>>> + acpi_put_table(&table->header); >>>>>> + else >>>>>> + ibft_addr = table; >>>>>> } >>>>>> diff --git a/include/linux/iscsi_ibft.h b/include/linux/iscsi_ibft.h >>>>>> index b7b45ca82bea..da813c891990 100644 >>>>>> --- a/include/linux/iscsi_ibft.h >>>>>> +++ b/include/linux/iscsi_ibft.h >>>>>> @@ -26,13 +26,9 @@ extern struct acpi_table_ibft *ibft_addr; >>>>>> * mapped address is set in the ibft_addr variable. >>>>>> */ >>>>>> #ifdef CONFIG_ISCSI_IBFT_FIND >>>>>> -unsigned long find_ibft_region(unsigned long *sizep); >>>>>> +void reserve_ibft_region(void); >>>>>> #else >>>>>> -static inline unsigned long find_ibft_region(unsigned long *sizep) >>>>>> -{ >>>>>> - *sizep = 0; >>>>>> - return 0; >>>>>> -} >>>>>> +static inline void reserve_ibft_region(void) {} >>>>>> #endif >>>>>> #endif /* ISCSI_IBFT_H */ >>>>> Still no luck Mike, >>>>> >>>>> We're back to the original problem where the only thing that >>>>> worked was to >>>>> run "SetPageReserved(page)" before calling "kmap(page)". The page >>>>> is being >>>>> "freed" before ibft_init() is called as a result of the recent >>>>> buddy page >>>>> freeing changes. >>>> I keep missing some little details each time :( >>> No worries. Thanks for all your help. Does this patch go on top of your >>> previous patch or is it standalone? >> This is standalone. >>> George >>>> Ok, let's try from the different angle. >>>> >>>> diff --git a/drivers/acpi/acpica/tbutils.c >>>> b/drivers/acpi/acpica/tbutils.c >>>> index 4b9b329a5a92..ec43e1447336 100644 >>>> --- a/drivers/acpi/acpica/tbutils.c >>>> +++ b/drivers/acpi/acpica/tbutils.c >>>> @@ -7,6 +7,8 @@ >>>> * >>>> *****************************************************************************/ >>>> +#include <linux/memblock.h> >>>> + >>>> #include <acpi/acpi.h> >>>> #include "accommon.h" >>>> #include "actables.h" >>>> @@ -339,6 +341,21 @@ acpi_tb_parse_root_table(acpi_physical_address >>>> rsdp_address) >>>> acpi_tb_parse_fadt(); >>>> } >>>> + if (ACPI_SUCCESS(status) && >>>> + ACPI_COMPARE_NAMESEG(&acpi_gbl_root_table_list. >>>> + tables[table_index].signature, >>>> + ACPI_SIG_IBFT)) { >>>> + struct acpi_table_header *ibft; >>>> + struct acpi_table_desc *desc; >>>> + >>>> + desc = &acpi_gbl_root_table_list.tables[table_index]; >>>> + status = acpi_tb_get_table(desc, &ibft); >>>> + if (ACPI_SUCCESS(status)) { >>>> + memblock_reserve(address, ibft->length); >>>> + acpi_tb_put_table(desc); >>>> + >>>> + } >>>> + >>>> next_table: >>>> table_entry += table_entry_size; >>>> >>>> > Applied just your latest patch, but same failure. > > I thought there was an earlier comment (which I can't find now) that > stated that memblock_reserve() wouldn't reserve the page, which is > what's needed here. Mike, Here was David's explanation of what he thinks is going on (or should be going on) from a few days ago: QUOTE... I assume that acpi_map()/acpi_unmap() map some firmware blob that is provided via firmware/bios/... to us. should_use_kmap() tells us whether a) we have a "struct page" and should kmap() that one b) we don't have a "struct page" and should ioremap. As it is a blob, the firmware should always reserve that memory region via memblock (e.g., memblock_reserve()), such that we either 1) don't create a memmap ("struct page") at all (-> case b) ) 2) if we have to create e memmap, we mark the page PG_reserved and *never* expose it to the buddy (-> case a) ) Are you telling me that in this case we might have a memmap for the HW blob that is *not* PG_reserved? In that case it most probably got exposed to the buddy where it can happily get allocated/freed. The latent BUG would be that that blob gets exposed to the system like ordinary RAM, and not reserved via memblock early during boot. Assuming that blob has a low physical address, with my patch it will get allocated/used a lot earlier - which would mean we trigger this latent BUG now more easily. ...END_QUOTE Your most recent patch has added the memblock_reserve(), but it's still missing the PG_reserved setting. Thanks, George > > [ 30.308229] iBFT detected.. > [ 30.308796] > ================================================================== > [ 30.308890] BUG: KASAN: use-after-free in ibft_init+0x134/0xc33 > [ 30.308890] Read of size 4 at addr ffff8880be453004 by task > swapper/0/1 > [ 30.308890] > [ 30.308890] CPU: 1 PID: 1 Comm: swapper/0 Not tainted > 5.11.0-f9593a0 #12 > [ 30.308890] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS 0.0.0 02/06/2015 > [ 30.308890] Call Trace: > [ 30.308890] dump_stack+0xdb/0x120 > [ 30.308890] ? ibft_init+0x134/0xc33 > [ 30.308890] print_address_description.constprop.7+0x41/0x60 > [ 30.308890] ? ibft_init+0x134/0xc33 > [ 30.308890] ? ibft_init+0x134/0xc33 > [ 30.308890] kasan_report.cold.10+0x78/0xd1 > [ 30.308890] ? ibft_init+0x134/0xc33 > [ 30.308890] __asan_report_load_n_noabort+0xf/0x20 > [ 30.308890] ibft_init+0x134/0xc33 > [ 30.308890] ? write_comp_data+0x2f/0x90 > [ 30.308890] ? ibft_check_initiator_for+0x159/0x159 > [ 30.308890] ? write_comp_data+0x2f/0x90 > [ 30.308890] ? ibft_check_initiator_for+0x159/0x159 > [ 30.308890] do_one_initcall+0xc4/0x3e0 > [ 30.308890] ? perf_trace_initcall_level+0x3e0/0x3e0 > [ 30.308890] ? unpoison_range+0x14/0x40 > [ 30.308890] ? ____kasan_kmalloc.constprop.5+0x8f/0xc0 > [ 30.308890] ? kernel_init_freeable+0x420/0x652 > [ 30.308890] ? __kasan_kmalloc+0x9/0x10 > [ 30.308890] ? __sanitizer_cov_trace_pc+0x21/0x50 > [ 30.308890] kernel_init_freeable+0x596/0x652 > [ 30.308890] ? console_on_rootfs+0x7d/0x7d > [ 30.308890] ? __sanitizer_cov_trace_pc+0x21/0x50 > [ 30.308890] ? rest_init+0xf0/0xf0 > [ 30.308890] kernel_init+0x16/0x1d0 > [ 30.308890] ? rest_init+0xf0/0xf0 > [ 30.308890] ret_from_fork+0x22/0x30 > [ 30.308890] > [ 30.308890] The buggy address belongs to the page: > [ 30.308890] page:0000000001b7b17c refcount:0 mapcount:0 > mapping:0000000000000000 index:0x1 pfn:0xbe453 > [ 30.308890] flags: 0xfffffc0000000() > [ 30.308890] raw: 000fffffc0000000 ffffea0002ef9788 ffffea0002f91488 > 0000000000000000 > [ 30.308890] raw: 0000000000000001 0000000000000000 00000000ffffffff > 0000000000000000 > [ 30.308890] page dumped because: kasan: bad access detected > [ 30.308890] page_owner tracks the page as freed > [ 30.308890] page last allocated via order 0, migratetype Movable, > gfp_mask 0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), pid 204, ts > 28121288605 > [ 30.308890] prep_new_page+0xfb/0x140 > [ 30.308890] get_page_from_freelist+0x3503/0x5730 > [ 30.308890] __alloc_pages_nodemask+0x2d8/0x650 > [ 30.308890] alloc_pages_vma+0xe2/0x560 > [ 30.308890] __handle_mm_fault+0x930/0x26c0 > [ 30.308890] handle_mm_fault+0x1f9/0x810 > [ 30.308890] do_user_addr_fault+0x6f7/0xca0 > [ 30.308890] exc_page_fault+0xaf/0x1a0 > [ 30.308890] asm_exc_page_fault+0x1e/0x30 > [ 30.308890] page last free stack trace: > [ 30.308890] free_pcp_prepare+0x122/0x290 > [ 30.308890] free_unref_page_list+0xe6/0x490 > [ 30.308890] release_pages+0x2ed/0x1270 > [ 30.308890] free_pages_and_swap_cache+0x245/0x2e0 > [ 30.308890] tlb_flush_mmu+0x11e/0x680 > [ 30.308890] tlb_finish_mmu+0xa6/0x3e0 > [ 30.308890] exit_mmap+0x2b3/0x540 > [ 30.308890] mmput+0x11d/0x450 > [ 30.308890] do_exit+0xaa6/0x2d40 > [ 30.308890] do_group_exit+0x128/0x340 > [ 30.308890] __x64_sys_exit_group+0x43/0x50 > [ 30.308890] do_syscall_64+0x37/0x50 > [ 30.308890] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 30.308890] > [ 30.308890] Memory state around the buggy address: > [ 30.308890] ffff8880be452f00: ff ff ff ff ff ff ff ff ff ff ff ff > ff ff ff ff > [ 30.308890] ffff8880be452f80: ff ff ff ff ff ff ff ff ff ff ff ff > ff ff ff ff > [ 30.308890] >ffff8880be453000: ff ff ff ff ff ff ff ff ff ff ff ff > ff ff ff ff > [ 30.308890] ^ > [ 30.308890] ffff8880be453080: ff ff ff ff ff ff ff ff ff ff ff ff > ff ff ff ff > [ 30.308890] ffff8880be453100: ff ff ff ff ff ff ff ff ff ff ff ff > ff ff ff ff > [ 30.308890] > ================================================================== > > George >
On Thu, Feb 25, 2021 at 10:22:44AM -0500, George Kennedy wrote: > > > > > > On 2/24/2021 5:37 AM, Mike Rapoport wrote: > > Applied just your latest patch, but same failure. > > I thought there was an earlier comment (which I can't find now) that stated > that memblock_reserve() wouldn't reserve the page, which is what's needed > here. Actually, I think that memblock_reserve() should be just fine, but it seems I'm missing something in address calculation each time. What would happen if you stuck memblock_reserve(0xbe453000, PAGE_SIZE); say, at the beginning of find_ibft_region()? > [ 30.308229] iBFT detected.. > [ 30.308796] > ================================================================== > [ 30.308890] BUG: KASAN: use-after-free in ibft_init+0x134/0xc33 > [ 30.308890] Read of size 4 at addr ffff8880be453004 by task swapper/0/1 > [ 30.308890] > [ 30.308890] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.11.0-f9593a0 #12 > [ 30.308890] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > 0.0.0 02/06/2015 > [ 30.308890] Call Trace: > [ 30.308890] dump_stack+0xdb/0x120 > [ 30.308890] ? ibft_init+0x134/0xc33 > [ 30.308890] print_address_description.constprop.7+0x41/0x60 > [ 30.308890] ? ibft_init+0x134/0xc33 > [ 30.308890] ? ibft_init+0x134/0xc33 > [ 30.308890] kasan_report.cold.10+0x78/0xd1 > [ 30.308890] ? ibft_init+0x134/0xc33 > [ 30.308890] __asan_report_load_n_noabort+0xf/0x20 > [ 30.308890] ibft_init+0x134/0xc33 > [ 30.308890] ? write_comp_data+0x2f/0x90 > [ 30.308890] ? ibft_check_initiator_for+0x159/0x159 > [ 30.308890] ? write_comp_data+0x2f/0x90 > [ 30.308890] ? ibft_check_initiator_for+0x159/0x159 > [ 30.308890] do_one_initcall+0xc4/0x3e0 > [ 30.308890] ? perf_trace_initcall_level+0x3e0/0x3e0 > [ 30.308890] ? unpoison_range+0x14/0x40 > [ 30.308890] ? ____kasan_kmalloc.constprop.5+0x8f/0xc0 > [ 30.308890] ? kernel_init_freeable+0x420/0x652 > [ 30.308890] ? __kasan_kmalloc+0x9/0x10 > [ 30.308890] ? __sanitizer_cov_trace_pc+0x21/0x50 > [ 30.308890] kernel_init_freeable+0x596/0x652 > [ 30.308890] ? console_on_rootfs+0x7d/0x7d > [ 30.308890] ? __sanitizer_cov_trace_pc+0x21/0x50 > [ 30.308890] ? rest_init+0xf0/0xf0 > [ 30.308890] kernel_init+0x16/0x1d0 > [ 30.308890] ? rest_init+0xf0/0xf0 > [ 30.308890] ret_from_fork+0x22/0x30 > [ 30.308890] > [ 30.308890] The buggy address belongs to the page: > [ 30.308890] page:0000000001b7b17c refcount:0 mapcount:0 > mapping:0000000000000000 index:0x1 pfn:0xbe453 > [ 30.308890] flags: 0xfffffc0000000() > [ 30.308890] raw: 000fffffc0000000 ffffea0002ef9788 ffffea0002f91488 > 0000000000000000 > [ 30.308890] raw: 0000000000000001 0000000000000000 00000000ffffffff > 0000000000000000 > [ 30.308890] page dumped because: kasan: bad access detected > [ 30.308890] page_owner tracks the page as freed > [ 30.308890] page last allocated via order 0, migratetype Movable, > gfp_mask 0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), pid 204, ts 28121288605 > [ 30.308890] prep_new_page+0xfb/0x140 > [ 30.308890] get_page_from_freelist+0x3503/0x5730 > [ 30.308890] __alloc_pages_nodemask+0x2d8/0x650 > [ 30.308890] alloc_pages_vma+0xe2/0x560 > [ 30.308890] __handle_mm_fault+0x930/0x26c0 > [ 30.308890] handle_mm_fault+0x1f9/0x810 > [ 30.308890] do_user_addr_fault+0x6f7/0xca0 > [ 30.308890] exc_page_fault+0xaf/0x1a0 > [ 30.308890] asm_exc_page_fault+0x1e/0x30 > [ 30.308890] page last free stack trace: > [ 30.308890] free_pcp_prepare+0x122/0x290 > [ 30.308890] free_unref_page_list+0xe6/0x490 > [ 30.308890] release_pages+0x2ed/0x1270 > [ 30.308890] free_pages_and_swap_cache+0x245/0x2e0 > [ 30.308890] tlb_flush_mmu+0x11e/0x680 > [ 30.308890] tlb_finish_mmu+0xa6/0x3e0 > [ 30.308890] exit_mmap+0x2b3/0x540 > [ 30.308890] mmput+0x11d/0x450 > [ 30.308890] do_exit+0xaa6/0x2d40 > [ 30.308890] do_group_exit+0x128/0x340 > [ 30.308890] __x64_sys_exit_group+0x43/0x50 > [ 30.308890] do_syscall_64+0x37/0x50 > [ 30.308890] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 30.308890] > [ 30.308890] Memory state around the buggy address: > [ 30.308890] ffff8880be452f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff > ff ff > [ 30.308890] ffff8880be452f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff > ff ff > [ 30.308890] >ffff8880be453000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff > ff ff > [ 30.308890] ^ > [ 30.308890] ffff8880be453080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff > ff ff > [ 30.308890] ffff8880be453100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff > ff ff > [ 30.308890] > ================================================================== > > George >
On 2/25/2021 11:07 AM, Mike Rapoport wrote: > On Thu, Feb 25, 2021 at 10:22:44AM -0500, George Kennedy wrote: >>>>>> On 2/24/2021 5:37 AM, Mike Rapoport wrote: >> Applied just your latest patch, but same failure. >> >> I thought there was an earlier comment (which I can't find now) that stated >> that memblock_reserve() wouldn't reserve the page, which is what's needed >> here. > Actually, I think that memblock_reserve() should be just fine, but it seems > I'm missing something in address calculation each time. > > What would happen if you stuck > > memblock_reserve(0xbe453000, PAGE_SIZE); > > say, at the beginning of find_ibft_region()? Added debug to your patch and this is all that shows up. Looks like the patch is in the wrong place as acpi_tb_parse_root_table() is only called for the RSDP address. [ 0.064317] ACPI: Early table checksum verification disabled [ 0.065437] XXX acpi_tb_parse_root_table: rsdp_address=bfbfa014 [ 0.066612] ACPI: RSDP 0x00000000BFBFA014 000024 (v02 BOCHS ) [ 0.067759] ACPI: XSDT 0x00000000BFBF90E8 00004C (v01 BOCHS BXPCFACP 00000001 01000013) [ 0.069470] ACPI: FACP 0x00000000BFBF5000 000074 (v01 BOCHS BXPCFACP 00000001 BXPC 00000001) [ 0.071183] ACPI: DSDT 0x00000000BFBF6000 00238D (v01 BOCHS BXPCDSDT 00000001 BXPC 00000001) [ 0.072876] ACPI: FACS 0x00000000BFBFD000 000040 [ 0.073806] ACPI: APIC 0x00000000BFBF4000 000090 (v01 BOCHS BXPCAPIC 00000001 BXPC 00000001) [ 0.075501] ACPI: HPET 0x00000000BFBF3000 000038 (v01 BOCHS BXPCHPET 00000001 BXPC 00000001) [ 0.077194] ACPI: BGRT 0x00000000BE49B000 000038 (v01 INTEL EDK2 00000002 01000013) [ 0.078880] ACPI: iBFT 0x00000000BE453000 000800 (v01 BOCHS BXPCFACP 00000000 00000000) [ 0.080588] ACPI: Local APIC address 0xfee00000 diff --git a/drivers/acpi/acpica/tbutils.c b/drivers/acpi/acpica/tbutils.c index dfe1ac3..603b3a8 100644 --- a/drivers/acpi/acpica/tbutils.c +++ b/drivers/acpi/acpica/tbutils.c @@ -7,6 +7,8 @@ * *****************************************************************************/ +#include <linux/memblock.h> + #include <acpi/acpi.h> #include "accommon.h" #include "actables.h" @@ -232,6 +234,8 @@ struct acpi_table_header *acpi_tb_copy_dsdt(u32 table_index) acpi_status status; u32 table_index; +printk(KERN_ERR "XXX acpi_tb_parse_root_table: rsdp_address=%llx\n", rsdp_address); + ACPI_FUNCTION_TRACE(tb_parse_root_table); /* Map the entire RSDP and extract the address of the RSDT or XSDT */ @@ -339,6 +343,22 @@ struct acpi_table_header *acpi_tb_copy_dsdt(u32 table_index) acpi_tb_parse_fadt(); } + if (ACPI_SUCCESS(status) && + ACPI_COMPARE_NAMESEG(&acpi_gbl_root_table_list. + tables[table_index].signature, + ACPI_SIG_IBFT)) { + struct acpi_table_header *ibft; + struct acpi_table_desc *desc; + + desc = &acpi_gbl_root_table_list.tables[table_index]; + status = acpi_tb_get_table(desc, &ibft); + if (ACPI_SUCCESS(status)) { +printk(KERN_ERR "XXX acpi_tb_parse_root_table(calling memblock_reserve()): addres=%llx, ibft->length=%x\n", address, ibft->length); + memblock_reserve(address, ibft->length); + acpi_tb_put_table(desc); + } + } + next_table: table_entry += table_entry_size; > >> [ 30.308229] iBFT detected.. >> [ 30.308796] >> ================================================================== >> [ 30.308890] BUG: KASAN: use-after-free in ibft_init+0x134/0xc33 >> [ 30.308890] Read of size 4 at addr ffff8880be453004 by task swapper/0/1 >> [ 30.308890] >> [ 30.308890] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.11.0-f9593a0 #12 >> [ 30.308890] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >> 0.0.0 02/06/2015 >> [ 30.308890] Call Trace: >> [ 30.308890] dump_stack+0xdb/0x120 >> [ 30.308890] ? ibft_init+0x134/0xc33 >> [ 30.308890] print_address_description.constprop.7+0x41/0x60 >> [ 30.308890] ? ibft_init+0x134/0xc33 >> [ 30.308890] ? ibft_init+0x134/0xc33 >> [ 30.308890] kasan_report.cold.10+0x78/0xd1 >> [ 30.308890] ? ibft_init+0x134/0xc33 >> [ 30.308890] __asan_report_load_n_noabort+0xf/0x20 >> [ 30.308890] ibft_init+0x134/0xc33 >> [ 30.308890] ? write_comp_data+0x2f/0x90 >> [ 30.308890] ? ibft_check_initiator_for+0x159/0x159 >> [ 30.308890] ? write_comp_data+0x2f/0x90 >> [ 30.308890] ? ibft_check_initiator_for+0x159/0x159 >> [ 30.308890] do_one_initcall+0xc4/0x3e0 >> [ 30.308890] ? perf_trace_initcall_level+0x3e0/0x3e0 >> [ 30.308890] ? unpoison_range+0x14/0x40 >> [ 30.308890] ? ____kasan_kmalloc.constprop.5+0x8f/0xc0 >> [ 30.308890] ? kernel_init_freeable+0x420/0x652 >> [ 30.308890] ? __kasan_kmalloc+0x9/0x10 >> [ 30.308890] ? __sanitizer_cov_trace_pc+0x21/0x50 >> [ 30.308890] kernel_init_freeable+0x596/0x652 >> [ 30.308890] ? console_on_rootfs+0x7d/0x7d >> [ 30.308890] ? __sanitizer_cov_trace_pc+0x21/0x50 >> [ 30.308890] ? rest_init+0xf0/0xf0 >> [ 30.308890] kernel_init+0x16/0x1d0 >> [ 30.308890] ? rest_init+0xf0/0xf0 >> [ 30.308890] ret_from_fork+0x22/0x30 >> [ 30.308890] >> [ 30.308890] The buggy address belongs to the page: >> [ 30.308890] page:0000000001b7b17c refcount:0 mapcount:0 >> mapping:0000000000000000 index:0x1 pfn:0xbe453 >> [ 30.308890] flags: 0xfffffc0000000() >> [ 30.308890] raw: 000fffffc0000000 ffffea0002ef9788 ffffea0002f91488 >> 0000000000000000 >> [ 30.308890] raw: 0000000000000001 0000000000000000 00000000ffffffff >> 0000000000000000 >> [ 30.308890] page dumped because: kasan: bad access detected >> [ 30.308890] page_owner tracks the page as freed >> [ 30.308890] page last allocated via order 0, migratetype Movable, >> gfp_mask 0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), pid 204, ts 28121288605 >> [ 30.308890] prep_new_page+0xfb/0x140 >> [ 30.308890] get_page_from_freelist+0x3503/0x5730 >> [ 30.308890] __alloc_pages_nodemask+0x2d8/0x650 >> [ 30.308890] alloc_pages_vma+0xe2/0x560 >> [ 30.308890] __handle_mm_fault+0x930/0x26c0 >> [ 30.308890] handle_mm_fault+0x1f9/0x810 >> [ 30.308890] do_user_addr_fault+0x6f7/0xca0 >> [ 30.308890] exc_page_fault+0xaf/0x1a0 >> [ 30.308890] asm_exc_page_fault+0x1e/0x30 >> [ 30.308890] page last free stack trace: >> [ 30.308890] free_pcp_prepare+0x122/0x290 >> [ 30.308890] free_unref_page_list+0xe6/0x490 >> [ 30.308890] release_pages+0x2ed/0x1270 >> [ 30.308890] free_pages_and_swap_cache+0x245/0x2e0 >> [ 30.308890] tlb_flush_mmu+0x11e/0x680 >> [ 30.308890] tlb_finish_mmu+0xa6/0x3e0 >> [ 30.308890] exit_mmap+0x2b3/0x540 >> [ 30.308890] mmput+0x11d/0x450 >> [ 30.308890] do_exit+0xaa6/0x2d40 >> [ 30.308890] do_group_exit+0x128/0x340 >> [ 30.308890] __x64_sys_exit_group+0x43/0x50 >> [ 30.308890] do_syscall_64+0x37/0x50 >> [ 30.308890] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> [ 30.308890] >> [ 30.308890] Memory state around the buggy address: >> [ 30.308890] ffff8880be452f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> ff ff >> [ 30.308890] ffff8880be452f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> ff ff >> [ 30.308890] >ffff8880be453000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> ff ff >> [ 30.308890] ^ >> [ 30.308890] ffff8880be453080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> ff ff >> [ 30.308890] ffff8880be453100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> ff ff >> [ 30.308890] >> ================================================================== >> >> George >>
On 25.02.21 17:31, George Kennedy wrote: > : rsdp_address=bfbfa014 > [ 0.066612] ACPI: RSDP 0x00000000BFBFA014 000024 (v02 BOCHS ) > [ 0.067759] ACPI: XSDT 0x00000000BFBF90E8 00004C (v01 BOCHS BXPCFACP > 00000001 01000013) > [ 0.069470] ACPI: FACP 0x00000000BFBF5000 000074 (v01 BOCHS BXPCFACP > 00000001 BXPC 00000001) > [ 0.071183] ACPI: DSDT 0x00000000BFBF6000 00238D (v01 BOCHS BXPCDSDT > 00000001 BXPC 00000001) > [ 0.072876] ACPI: FACS 0x00000000BFBFD000 000040 > [ 0.073806] ACPI: APIC 0x00000000BFBF4000 000090 (v01 BOCHS BXPCAPIC > 00000001 BXPC 00000001) > [ 0.075501] ACPI: HPET 0x00000000BFBF3000 000038 (v01 BOCHS BXPCHPET > 00000001 BXPC 00000001) > [ 0.077194] ACPI: BGRT 0x00000000BE49B000 000038 (v01 INTEL EDK2 > 00000002 01000013) > [ 0.078880] ACPI: iBFT 0x00000000BE453000 000800 (v01 BOCHS BXPCFACP > 00000000 00000000) Can you explore the relevant area using the page-flags tools (located in Linux src code located in tools/vm/page-flags.c) ./page-types -L -r -a 0xbe490,0xbe4a0
On 2/25/2021 11:07 AM, Mike Rapoport wrote: > On Thu, Feb 25, 2021 at 10:22:44AM -0500, George Kennedy wrote: >>>>>> On 2/24/2021 5:37 AM, Mike Rapoport wrote: >> Applied just your latest patch, but same failure. >> >> I thought there was an earlier comment (which I can't find now) that stated >> that memblock_reserve() wouldn't reserve the page, which is what's needed >> here. > Actually, I think that memblock_reserve() should be just fine, but it seems > I'm missing something in address calculation each time. > > What would happen if you stuck > > memblock_reserve(0xbe453000, PAGE_SIZE); > > say, at the beginning of find_ibft_region()? Good news Mike! The above hack in yesterday's last patch works - 10 successful reboots. See: "BE453" below for the hack. I'll modify the patch to use "table_desc->address" instead, which is the physical address of the table. diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 7bdc023..c118dd5 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -1551,6 +1551,7 @@ void __init acpi_boot_table_init(void) if (acpi_disabled) return; +#if 0 /* * Initialize the ACPI boot-time table parser. */ @@ -1558,6 +1559,7 @@ void __init acpi_boot_table_init(void) disable_acpi(); return; } +#endif acpi_table_parse(ACPI_SIG_BOOT, acpi_parse_sbf); diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 740f3bdb..b045ab2 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -571,16 +571,6 @@ void __init reserve_standard_io_resources(void) } -static __init void reserve_ibft_region(void) -{ - unsigned long addr, size = 0; - - addr = find_ibft_region(&size); - - if (size) - memblock_reserve(addr, size); -} - static bool __init snb_gfx_workaround_needed(void) { #ifdef CONFIG_PCI @@ -1033,6 +1023,12 @@ void __init setup_arch(char **cmdline_p) */ find_smp_config(); + /* + * Initialize the ACPI boot-time table parser. + */ + if (acpi_table_init()) + disable_acpi(); + reserve_ibft_region(); early_alloc_pgt_buf(); diff --git a/drivers/firmware/iscsi_ibft_find.c b/drivers/firmware/iscsi_ibft_find.c index 64bb945..95fc1a6 100644 --- a/drivers/firmware/iscsi_ibft_find.c +++ b/drivers/firmware/iscsi_ibft_find.c @@ -47,7 +47,25 @@ #define VGA_MEM 0xA0000 /* VGA buffer */ #define VGA_SIZE 0x20000 /* 128kB */ -static int __init find_ibft_in_mem(void) +static void __init *acpi_find_ibft_region(void) +{ + int i; + struct acpi_table_header *table = NULL; + acpi_status status; + + if (acpi_disabled) + return NULL; + + for (i = 0; i < ARRAY_SIZE(ibft_signs) && !ibft_addr; i++) { + status = acpi_get_table(ibft_signs[i].sign, 0, &table); + if (ACPI_SUCCESS(status)) + return table; + } + + return NULL; +} + +static void __init *find_ibft_in_mem(void) { unsigned long pos; unsigned int len = 0; @@ -70,35 +88,52 @@ static int __init find_ibft_in_mem(void) /* if the length of the table extends past 1M, * the table cannot be valid. */ if (pos + len <= (IBFT_END-1)) { - ibft_addr = (struct acpi_table_ibft *)virt; pr_info("iBFT found at 0x%lx.\n", pos); - goto done; + return virt; } } } } -done: - return len; + + return NULL; } + +static void __init *find_ibft(void) +{ + /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will + * only use ACPI for this */ + if (!efi_enabled(EFI_BOOT)) + return find_ibft_in_mem(); + else + return acpi_find_ibft_region(); +} + /* * Routine used to find the iSCSI Boot Format Table. The logical * kernel address is set in the ibft_addr global variable. */ -unsigned long __init find_ibft_region(unsigned long *sizep) +void __init reserve_ibft_region(void) { - ibft_addr = NULL; + struct acpi_table_ibft *table; + unsigned long size; - /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will - * only use ACPI for this */ + table = find_ibft(); + if (!table) + return; - if (!efi_enabled(EFI_BOOT)) - find_ibft_in_mem(); - - if (ibft_addr) { - *sizep = PAGE_ALIGN(ibft_addr->header.length); - return (u64)virt_to_phys(ibft_addr); - } + size = PAGE_ALIGN(table->header.length); +#if 0 +printk(KERN_ERR "XXX reserve_ibft_region: table=%llx, virt_to_phys(table)=%llx, size=%lx\n", + (u64)table, virt_to_phys(table), size); + memblock_reserve(virt_to_phys(table), size); +#else +printk(KERN_ERR "XXX reserve_ibft_region: table=%llx, 0x00000000BE453000, size=%lx\n", + (u64)table, size); + memblock_reserve(0x00000000BE453000, size); +#endif - *sizep = 0; - return 0; + if (efi_enabled(EFI_BOOT)) + acpi_put_table(&table->header); + else + ibft_addr = table; } diff --git a/include/linux/iscsi_ibft.h b/include/linux/iscsi_ibft.h index b7b45ca..da813c8 100644 --- a/include/linux/iscsi_ibft.h +++ b/include/linux/iscsi_ibft.h @@ -26,13 +26,9 @@ * mapped address is set in the ibft_addr variable. */ #ifdef CONFIG_ISCSI_IBFT_FIND -unsigned long find_ibft_region(unsigned long *sizep); +void reserve_ibft_region(void); #else -static inline unsigned long find_ibft_region(unsigned long *sizep) -{ - *sizep = 0; - return 0; -} +static inline void reserve_ibft_region(void) {} #endif #endif /* ISCSI_IBFT_H */ Debug from the above: [ 0.020293] last_pfn = 0xbfedc max_arch_pfn = 0x400000000 [ 0.050778] ACPI: Early table checksum verification disabled [ 0.056475] ACPI: RSDP 0x00000000BFBFA014 000024 (v02 BOCHS ) [ 0.057628] ACPI: XSDT 0x00000000BFBF90E8 00004C (v01 BOCHS BXPCFACP 00000001 01000013) [ 0.059341] ACPI: FACP 0x00000000BFBF5000 000074 (v01 BOCHS BXPCFACP 00000001 BXPC 00000001) [ 0.061043] ACPI: DSDT 0x00000000BFBF6000 00238D (v01 BOCHS BXPCDSDT 00000001 BXPC 00000001) [ 0.062740] ACPI: FACS 0x00000000BFBFD000 000040 [ 0.063673] ACPI: APIC 0x00000000BFBF4000 000090 (v01 BOCHS BXPCAPIC 00000001 BXPC 00000001) [ 0.065369] ACPI: HPET 0x00000000BFBF3000 000038 (v01 BOCHS BXPCHPET 00000001 BXPC 00000001) [ 0.067061] ACPI: BGRT 0x00000000BE49B000 000038 (v01 INTEL EDK2 00000002 01000013) [ 0.068761] ACPI: iBFT 0x00000000BE453000 000800 (v01 BOCHS BXPCFACP 00000000 00000000) [ 0.070461] XXX reserve_ibft_region: table=ffffffffff240000, 0x00000000BE453000, size=1000 [ 0.072231] check: Scanning 1 areas for low memory corruption George > >> [ 30.308229] iBFT detected.. >> [ 30.308796] >> ================================================================== >> [ 30.308890] BUG: KASAN: use-after-free in ibft_init+0x134/0xc33 >> [ 30.308890] Read of size 4 at addr ffff8880be453004 by task swapper/0/1 >> [ 30.308890] >> [ 30.308890] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.11.0-f9593a0 #12 >> [ 30.308890] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >> 0.0.0 02/06/2015 >> [ 30.308890] Call Trace: >> [ 30.308890] dump_stack+0xdb/0x120 >> [ 30.308890] ? ibft_init+0x134/0xc33 >> [ 30.308890] print_address_description.constprop.7+0x41/0x60 >> [ 30.308890] ? ibft_init+0x134/0xc33 >> [ 30.308890] ? ibft_init+0x134/0xc33 >> [ 30.308890] kasan_report.cold.10+0x78/0xd1 >> [ 30.308890] ? ibft_init+0x134/0xc33 >> [ 30.308890] __asan_report_load_n_noabort+0xf/0x20 >> [ 30.308890] ibft_init+0x134/0xc33 >> [ 30.308890] ? write_comp_data+0x2f/0x90 >> [ 30.308890] ? ibft_check_initiator_for+0x159/0x159 >> [ 30.308890] ? write_comp_data+0x2f/0x90 >> [ 30.308890] ? ibft_check_initiator_for+0x159/0x159 >> [ 30.308890] do_one_initcall+0xc4/0x3e0 >> [ 30.308890] ? perf_trace_initcall_level+0x3e0/0x3e0 >> [ 30.308890] ? unpoison_range+0x14/0x40 >> [ 30.308890] ? ____kasan_kmalloc.constprop.5+0x8f/0xc0 >> [ 30.308890] ? kernel_init_freeable+0x420/0x652 >> [ 30.308890] ? __kasan_kmalloc+0x9/0x10 >> [ 30.308890] ? __sanitizer_cov_trace_pc+0x21/0x50 >> [ 30.308890] kernel_init_freeable+0x596/0x652 >> [ 30.308890] ? console_on_rootfs+0x7d/0x7d >> [ 30.308890] ? __sanitizer_cov_trace_pc+0x21/0x50 >> [ 30.308890] ? rest_init+0xf0/0xf0 >> [ 30.308890] kernel_init+0x16/0x1d0 >> [ 30.308890] ? rest_init+0xf0/0xf0 >> [ 30.308890] ret_from_fork+0x22/0x30 >> [ 30.308890] >> [ 30.308890] The buggy address belongs to the page: >> [ 30.308890] page:0000000001b7b17c refcount:0 mapcount:0 >> mapping:0000000000000000 index:0x1 pfn:0xbe453 >> [ 30.308890] flags: 0xfffffc0000000() >> [ 30.308890] raw: 000fffffc0000000 ffffea0002ef9788 ffffea0002f91488 >> 0000000000000000 >> [ 30.308890] raw: 0000000000000001 0000000000000000 00000000ffffffff >> 0000000000000000 >> [ 30.308890] page dumped because: kasan: bad access detected >> [ 30.308890] page_owner tracks the page as freed >> [ 30.308890] page last allocated via order 0, migratetype Movable, >> gfp_mask 0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), pid 204, ts 28121288605 >> [ 30.308890] prep_new_page+0xfb/0x140 >> [ 30.308890] get_page_from_freelist+0x3503/0x5730 >> [ 30.308890] __alloc_pages_nodemask+0x2d8/0x650 >> [ 30.308890] alloc_pages_vma+0xe2/0x560 >> [ 30.308890] __handle_mm_fault+0x930/0x26c0 >> [ 30.308890] handle_mm_fault+0x1f9/0x810 >> [ 30.308890] do_user_addr_fault+0x6f7/0xca0 >> [ 30.308890] exc_page_fault+0xaf/0x1a0 >> [ 30.308890] asm_exc_page_fault+0x1e/0x30 >> [ 30.308890] page last free stack trace: >> [ 30.308890] free_pcp_prepare+0x122/0x290 >> [ 30.308890] free_unref_page_list+0xe6/0x490 >> [ 30.308890] release_pages+0x2ed/0x1270 >> [ 30.308890] free_pages_and_swap_cache+0x245/0x2e0 >> [ 30.308890] tlb_flush_mmu+0x11e/0x680 >> [ 30.308890] tlb_finish_mmu+0xa6/0x3e0 >> [ 30.308890] exit_mmap+0x2b3/0x540 >> [ 30.308890] mmput+0x11d/0x450 >> [ 30.308890] do_exit+0xaa6/0x2d40 >> [ 30.308890] do_group_exit+0x128/0x340 >> [ 30.308890] __x64_sys_exit_group+0x43/0x50 >> [ 30.308890] do_syscall_64+0x37/0x50 >> [ 30.308890] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> [ 30.308890] >> [ 30.308890] Memory state around the buggy address: >> [ 30.308890] ffff8880be452f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> ff ff >> [ 30.308890] ffff8880be452f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> ff ff >> [ 30.308890] >ffff8880be453000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> ff ff >> [ 30.308890] ^ >> [ 30.308890] ffff8880be453080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> ff ff >> [ 30.308890] ffff8880be453100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> ff ff >> [ 30.308890] >> ================================================================== >> >> George >>
On Thu, Feb 25, 2021 at 06:23:24PM +0100, David Hildenbrand wrote: > On 25.02.21 17:31, George Kennedy wrote: > > : rsdp_address=bfbfa014 > > [ 0.066612] ACPI: RSDP 0x00000000BFBFA014 000024 (v02 BOCHS ) > > [ 0.067759] ACPI: XSDT 0x00000000BFBF90E8 00004C (v01 BOCHS BXPCFACP > > 00000001 01000013) > > [ 0.069470] ACPI: FACP 0x00000000BFBF5000 000074 (v01 BOCHS BXPCFACP > > 00000001 BXPC 00000001) > > [ 0.071183] ACPI: DSDT 0x00000000BFBF6000 00238D (v01 BOCHS BXPCDSDT > > 00000001 BXPC 00000001) > > [ 0.072876] ACPI: FACS 0x00000000BFBFD000 000040 > > [ 0.073806] ACPI: APIC 0x00000000BFBF4000 000090 (v01 BOCHS BXPCAPIC > > 00000001 BXPC 00000001) > > [ 0.075501] ACPI: HPET 0x00000000BFBF3000 000038 (v01 BOCHS BXPCHPET > > 00000001 BXPC 00000001) > > [ 0.077194] ACPI: BGRT 0x00000000BE49B000 000038 (v01 INTEL EDK2 > > 00000002 01000013) > > [ 0.078880] ACPI: iBFT 0x00000000BE453000 000800 (v01 BOCHS BXPCFACP > > 00000000 00000000) > > > Can you explore the relevant area using the page-flags tools (located in > Linux src code located in tools/vm/page-flags.c) > > > ./page-types -L -r -a 0xbe490,0xbe4a0 These are not iBFT and they are "ACPI data", so we should have them as PG_Reserved set at init_unavailable_mem(). [ 0.000000] BIOS-e820: [mem 0x0000000000808000-0x000000000080ffff] usable [ 0.000000] BIOS-e820: [mem 0x0000000000810000-0x00000000008fffff] ACPI NVS [ 0.000000] BIOS-e820: [mem 0x0000000000900000-0x00000000be49afff] usable ^ iBFT@0xbe453 lives here ^ And it should be a normal page, as it's in "usable" memory and nothing reserves it at boot, so no reason it won't be freed to buddy. If iBFT was in the low memory (<1M) it would have been reserved by reserve_ibft_region(), but with ACPI any block not marked by BIOS as "ACPI something" is treated like a normal memory and there is nothing that reserves it. So we do need to memblock_reserve() iBFT region, but I still couldn't find the right place to properly get its address without duplicating ACPI tables parsing :( [ 0.000000] BIOS-e820: [mem 0x00000000be49b000-0x00000000be49bfff] ACPI data
On Thu, Feb 25, 2021 at 11:31:04AM -0500, George Kennedy wrote: > > > On 2/25/2021 11:07 AM, Mike Rapoport wrote: > > On Thu, Feb 25, 2021 at 10:22:44AM -0500, George Kennedy wrote: > > > > > > > On 2/24/2021 5:37 AM, Mike Rapoport wrote: > > > Applied just your latest patch, but same failure. > > > > > > I thought there was an earlier comment (which I can't find now) that stated > > > that memblock_reserve() wouldn't reserve the page, which is what's needed > > > here. > > Actually, I think that memblock_reserve() should be just fine, but it seems > > I'm missing something in address calculation each time. > > > > What would happen if you stuck > > > > memblock_reserve(0xbe453000, PAGE_SIZE); > > > > say, at the beginning of find_ibft_region()? > > Added debug to your patch and this is all that shows up. Looks like the > patch is in the wrong place as acpi_tb_parse_root_table() is only called for > the RSDP address. Right, but I think it parses table description of the other tables and populates local tables with them. I think the problem is with how I compare the signatures, please see below > [ 0.064317] ACPI: Early table checksum verification disabled > [ 0.065437] XXX acpi_tb_parse_root_table: rsdp_address=bfbfa014 > [ 0.066612] ACPI: RSDP 0x00000000BFBFA014 000024 (v02 BOCHS ) > [ 0.067759] ACPI: XSDT 0x00000000BFBF90E8 00004C (v01 BOCHS BXPCFACP > 00000001 01000013) > [ 0.069470] ACPI: FACP 0x00000000BFBF5000 000074 (v01 BOCHS BXPCFACP > 00000001 BXPC 00000001) > [ 0.071183] ACPI: DSDT 0x00000000BFBF6000 00238D (v01 BOCHS BXPCDSDT > 00000001 BXPC 00000001) > [ 0.072876] ACPI: FACS 0x00000000BFBFD000 000040 > [ 0.073806] ACPI: APIC 0x00000000BFBF4000 000090 (v01 BOCHS BXPCAPIC > 00000001 BXPC 00000001) > [ 0.075501] ACPI: HPET 0x00000000BFBF3000 000038 (v01 BOCHS BXPCHPET > 00000001 BXPC 00000001) > [ 0.077194] ACPI: BGRT 0x00000000BE49B000 000038 (v01 INTEL EDK2 > 00000002 01000013) > [ 0.078880] ACPI: iBFT 0x00000000BE453000 000800 (v01 BOCHS BXPCFACP > 00000000 00000000) > [ 0.080588] ACPI: Local APIC address 0xfee00000 > > diff --git a/drivers/acpi/acpica/tbutils.c b/drivers/acpi/acpica/tbutils.c > index dfe1ac3..603b3a8 100644 > --- a/drivers/acpi/acpica/tbutils.c > +++ b/drivers/acpi/acpica/tbutils.c > @@ -7,6 +7,8 @@ > * > *****************************************************************************/ > > +#include <linux/memblock.h> > + > #include <acpi/acpi.h> > #include "accommon.h" > #include "actables.h" > @@ -232,6 +234,8 @@ struct acpi_table_header *acpi_tb_copy_dsdt(u32 > table_index) > acpi_status status; > u32 table_index; > > +printk(KERN_ERR "XXX acpi_tb_parse_root_table: rsdp_address=%llx\n", > rsdp_address); > + > ACPI_FUNCTION_TRACE(tb_parse_root_table); > > /* Map the entire RSDP and extract the address of the RSDT or XSDT */ > @@ -339,6 +343,22 @@ struct acpi_table_header *acpi_tb_copy_dsdt(u32 > table_index) > acpi_tb_parse_fadt(); > } > > + if (ACPI_SUCCESS(status) && > + ACPI_COMPARE_NAMESEG(&acpi_gbl_root_table_list. > + tables[table_index].signature, > + ACPI_SIG_IBFT)) { We have: include/acpi/actbl1.h:#define ACPI_SIG_IBFT "IBFT" /* iSCSI Boot Firmware Table */ and the BIOS uses "iBFT", so we need to loop over possible signature variants like iscsi_ibft_find does. Do you mind replacing ACPI_SIG_IBFT with "iBFT" and try again? > + struct acpi_table_header *ibft; > + struct acpi_table_desc *desc; > + > + desc = &acpi_gbl_root_table_list.tables[table_index]; > + status = acpi_tb_get_table(desc, &ibft); > + if (ACPI_SUCCESS(status)) { > +printk(KERN_ERR "XXX acpi_tb_parse_root_table(calling memblock_reserve()): > addres=%llx, ibft->length=%x\n", address, ibft->length); > + memblock_reserve(address, ibft->length); > + acpi_tb_put_table(desc); > + } > + } > + > next_table: > > table_entry += table_entry_size; > > > > > [ 30.308229] iBFT detected.. > > > [ 30.308796] > > > ================================================================== > > > [ 30.308890] BUG: KASAN: use-after-free in ibft_init+0x134/0xc33 > > > [ 30.308890] Read of size 4 at addr ffff8880be453004 by task swapper/0/1 > > > [ 30.308890] > > > [ 30.308890] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.11.0-f9593a0 #12 > > > [ 30.308890] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > > > 0.0.0 02/06/2015 > > > [ 30.308890] Call Trace: > > > [ 30.308890] dump_stack+0xdb/0x120 > > > [ 30.308890] ? ibft_init+0x134/0xc33 > > > [ 30.308890] print_address_description.constprop.7+0x41/0x60 > > > [ 30.308890] ? ibft_init+0x134/0xc33 > > > [ 30.308890] ? ibft_init+0x134/0xc33 > > > [ 30.308890] kasan_report.cold.10+0x78/0xd1 > > > [ 30.308890] ? ibft_init+0x134/0xc33 > > > [ 30.308890] __asan_report_load_n_noabort+0xf/0x20 > > > [ 30.308890] ibft_init+0x134/0xc33 > > > [ 30.308890] ? write_comp_data+0x2f/0x90 > > > [ 30.308890] ? ibft_check_initiator_for+0x159/0x159 > > > [ 30.308890] ? write_comp_data+0x2f/0x90 > > > [ 30.308890] ? ibft_check_initiator_for+0x159/0x159 > > > [ 30.308890] do_one_initcall+0xc4/0x3e0 > > > [ 30.308890] ? perf_trace_initcall_level+0x3e0/0x3e0 > > > [ 30.308890] ? unpoison_range+0x14/0x40 > > > [ 30.308890] ? ____kasan_kmalloc.constprop.5+0x8f/0xc0 > > > [ 30.308890] ? kernel_init_freeable+0x420/0x652 > > > [ 30.308890] ? __kasan_kmalloc+0x9/0x10 > > > [ 30.308890] ? __sanitizer_cov_trace_pc+0x21/0x50 > > > [ 30.308890] kernel_init_freeable+0x596/0x652 > > > [ 30.308890] ? console_on_rootfs+0x7d/0x7d > > > [ 30.308890] ? __sanitizer_cov_trace_pc+0x21/0x50 > > > [ 30.308890] ? rest_init+0xf0/0xf0 > > > [ 30.308890] kernel_init+0x16/0x1d0 > > > [ 30.308890] ? rest_init+0xf0/0xf0 > > > [ 30.308890] ret_from_fork+0x22/0x30 > > > [ 30.308890] > > > [ 30.308890] The buggy address belongs to the page: > > > [ 30.308890] page:0000000001b7b17c refcount:0 mapcount:0 > > > mapping:0000000000000000 index:0x1 pfn:0xbe453 > > > [ 30.308890] flags: 0xfffffc0000000() > > > [ 30.308890] raw: 000fffffc0000000 ffffea0002ef9788 ffffea0002f91488 > > > 0000000000000000 > > > [ 30.308890] raw: 0000000000000001 0000000000000000 00000000ffffffff > > > 0000000000000000 > > > [ 30.308890] page dumped because: kasan: bad access detected > > > [ 30.308890] page_owner tracks the page as freed > > > [ 30.308890] page last allocated via order 0, migratetype Movable, > > > gfp_mask 0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), pid 204, ts 28121288605 > > > [ 30.308890] prep_new_page+0xfb/0x140 > > > [ 30.308890] get_page_from_freelist+0x3503/0x5730 > > > [ 30.308890] __alloc_pages_nodemask+0x2d8/0x650 > > > [ 30.308890] alloc_pages_vma+0xe2/0x560 > > > [ 30.308890] __handle_mm_fault+0x930/0x26c0 > > > [ 30.308890] handle_mm_fault+0x1f9/0x810 > > > [ 30.308890] do_user_addr_fault+0x6f7/0xca0 > > > [ 30.308890] exc_page_fault+0xaf/0x1a0 > > > [ 30.308890] asm_exc_page_fault+0x1e/0x30 > > > [ 30.308890] page last free stack trace: > > > [ 30.308890] free_pcp_prepare+0x122/0x290 > > > [ 30.308890] free_unref_page_list+0xe6/0x490 > > > [ 30.308890] release_pages+0x2ed/0x1270 > > > [ 30.308890] free_pages_and_swap_cache+0x245/0x2e0 > > > [ 30.308890] tlb_flush_mmu+0x11e/0x680 > > > [ 30.308890] tlb_finish_mmu+0xa6/0x3e0 > > > [ 30.308890] exit_mmap+0x2b3/0x540 > > > [ 30.308890] mmput+0x11d/0x450 > > > [ 30.308890] do_exit+0xaa6/0x2d40 > > > [ 30.308890] do_group_exit+0x128/0x340 > > > [ 30.308890] __x64_sys_exit_group+0x43/0x50 > > > [ 30.308890] do_syscall_64+0x37/0x50 > > > [ 30.308890] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > [ 30.308890] > > > [ 30.308890] Memory state around the buggy address: > > > [ 30.308890] ffff8880be452f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff > > > ff ff > > > [ 30.308890] ffff8880be452f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff > > > ff ff > > > [ 30.308890] >ffff8880be453000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff > > > ff ff > > > [ 30.308890] ^ > > > [ 30.308890] ffff8880be453080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff > > > ff ff > > > [ 30.308890] ffff8880be453100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff > > > ff ff > > > [ 30.308890] > > > ================================================================== > > > > > > George > > > >
On 2/25/2021 12:33 PM, George Kennedy wrote: > > > On 2/25/2021 11:07 AM, Mike Rapoport wrote: >> On Thu, Feb 25, 2021 at 10:22:44AM -0500, George Kennedy wrote: >>>>>>> On 2/24/2021 5:37 AM, Mike Rapoport wrote: >>> Applied just your latest patch, but same failure. >>> >>> I thought there was an earlier comment (which I can't find now) that >>> stated >>> that memblock_reserve() wouldn't reserve the page, which is what's >>> needed >>> here. >> Actually, I think that memblock_reserve() should be just fine, but it >> seems >> I'm missing something in address calculation each time. >> >> What would happen if you stuck >> >> memblock_reserve(0xbe453000, PAGE_SIZE); >> >> say, at the beginning of find_ibft_region()? > > Good news Mike! > > The above hack in yesterday's last patch works - 10 successful > reboots. See: "BE453" below for the hack. > > I'll modify the patch to use "table_desc->address" instead, which is > the physical address of the table. > > diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c > index 7bdc023..c118dd5 100644 > --- a/arch/x86/kernel/acpi/boot.c > +++ b/arch/x86/kernel/acpi/boot.c > @@ -1551,6 +1551,7 @@ void __init acpi_boot_table_init(void) > if (acpi_disabled) > return; > > +#if 0 > /* > * Initialize the ACPI boot-time table parser. > */ > @@ -1558,6 +1559,7 @@ void __init acpi_boot_table_init(void) > disable_acpi(); > return; > } > +#endif > > acpi_table_parse(ACPI_SIG_BOOT, acpi_parse_sbf); > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c > index 740f3bdb..b045ab2 100644 > --- a/arch/x86/kernel/setup.c > +++ b/arch/x86/kernel/setup.c > @@ -571,16 +571,6 @@ void __init reserve_standard_io_resources(void) > > } > > -static __init void reserve_ibft_region(void) > -{ > - unsigned long addr, size = 0; > - > - addr = find_ibft_region(&size); > - > - if (size) > - memblock_reserve(addr, size); > -} > - > static bool __init snb_gfx_workaround_needed(void) > { > #ifdef CONFIG_PCI > @@ -1033,6 +1023,12 @@ void __init setup_arch(char **cmdline_p) > */ > find_smp_config(); > > + /* > + * Initialize the ACPI boot-time table parser. > + */ > + if (acpi_table_init()) > + disable_acpi(); > + > reserve_ibft_region(); > > early_alloc_pgt_buf(); > diff --git a/drivers/firmware/iscsi_ibft_find.c > b/drivers/firmware/iscsi_ibft_find.c > index 64bb945..95fc1a6 100644 > --- a/drivers/firmware/iscsi_ibft_find.c > +++ b/drivers/firmware/iscsi_ibft_find.c > @@ -47,7 +47,25 @@ > #define VGA_MEM 0xA0000 /* VGA buffer */ > #define VGA_SIZE 0x20000 /* 128kB */ > > -static int __init find_ibft_in_mem(void) > +static void __init *acpi_find_ibft_region(void) > +{ > + int i; > + struct acpi_table_header *table = NULL; > + acpi_status status; > + > + if (acpi_disabled) > + return NULL; > + > + for (i = 0; i < ARRAY_SIZE(ibft_signs) && !ibft_addr; i++) { > + status = acpi_get_table(ibft_signs[i].sign, 0, &table); > + if (ACPI_SUCCESS(status)) > + return table; > + } > + > + return NULL; > +} > + > +static void __init *find_ibft_in_mem(void) > { > unsigned long pos; > unsigned int len = 0; > @@ -70,35 +88,52 @@ static int __init find_ibft_in_mem(void) > /* if the length of the table extends past 1M, > * the table cannot be valid. */ > if (pos + len <= (IBFT_END-1)) { > - ibft_addr = (struct acpi_table_ibft *)virt; > pr_info("iBFT found at 0x%lx.\n", pos); > - goto done; > + return virt; > } > } > } > } > -done: > - return len; > + > + return NULL; > } > + > +static void __init *find_ibft(void) > +{ > + /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will > + * only use ACPI for this */ > + if (!efi_enabled(EFI_BOOT)) > + return find_ibft_in_mem(); > + else > + return acpi_find_ibft_region(); > +} > + > /* > * Routine used to find the iSCSI Boot Format Table. The logical > * kernel address is set in the ibft_addr global variable. > */ > -unsigned long __init find_ibft_region(unsigned long *sizep) > +void __init reserve_ibft_region(void) > { > - ibft_addr = NULL; > + struct acpi_table_ibft *table; > + unsigned long size; > > - /* iBFT 1.03 section 1.4.3.1 mandates that UEFI machines will > - * only use ACPI for this */ > + table = find_ibft(); > + if (!table) > + return; > > - if (!efi_enabled(EFI_BOOT)) > - find_ibft_in_mem(); > - > - if (ibft_addr) { > - *sizep = PAGE_ALIGN(ibft_addr->header.length); > - return (u64)virt_to_phys(ibft_addr); > - } > + size = PAGE_ALIGN(table->header.length); > +#if 0 > +printk(KERN_ERR "XXX reserve_ibft_region: table=%llx, > virt_to_phys(table)=%llx, size=%lx\n", > + (u64)table, virt_to_phys(table), size); > + memblock_reserve(virt_to_phys(table), size); > +#else > +printk(KERN_ERR "XXX reserve_ibft_region: table=%llx, > 0x00000000BE453000, size=%lx\n", > + (u64)table, size); > + memblock_reserve(0x00000000BE453000, size); > +#endif > > - *sizep = 0; > - return 0; > + if (efi_enabled(EFI_BOOT)) > + acpi_put_table(&table->header); > + else > + ibft_addr = table; > } > diff --git a/include/linux/iscsi_ibft.h b/include/linux/iscsi_ibft.h > index b7b45ca..da813c8 100644 > --- a/include/linux/iscsi_ibft.h > +++ b/include/linux/iscsi_ibft.h > @@ -26,13 +26,9 @@ > * mapped address is set in the ibft_addr variable. > */ > #ifdef CONFIG_ISCSI_IBFT_FIND > -unsigned long find_ibft_region(unsigned long *sizep); > +void reserve_ibft_region(void); > #else > -static inline unsigned long find_ibft_region(unsigned long *sizep) > -{ > - *sizep = 0; > - return 0; > -} > +static inline void reserve_ibft_region(void) {} > #endif > > #endif /* ISCSI_IBFT_H */ Mike, To get rid of the 0x00000000BE453000 hardcoding, I added the following patch to your above patch to get the iBFT table "address" to use with memblock_reserve(): diff --git a/drivers/acpi/acpica/tbfind.c b/drivers/acpi/acpica/tbfind.c index 56d81e4..4bc7bf3 100644 --- a/drivers/acpi/acpica/tbfind.c +++ b/drivers/acpi/acpica/tbfind.c @@ -120,3 +120,34 @@ (void)acpi_ut_release_mutex(ACPI_MTX_TABLES); return_ACPI_STATUS(status); } + +acpi_physical_address +acpi_tb_find_table_address(char *signature) +{ + acpi_physical_address address = 0; + struct acpi_table_desc *table_desc; + int i; + + ACPI_FUNCTION_TRACE(tb_find_table_address); + +printk(KERN_ERR "XXX acpi_tb_find_table_address: signature=%s\n", signature); + + (void)acpi_ut_acquire_mutex(ACPI_MTX_TABLES); + for (i = 0; i < acpi_gbl_root_table_list.current_table_count; ++i) { + if (memcmp(&(acpi_gbl_root_table_list.tables[i].signature), + signature, ACPI_NAMESEG_SIZE)) { + + /* Not the requested table */ + + continue; + } + + /* Table with matching signature has been found */ + table_desc = &acpi_gbl_root_table_list.tables[i]; + address = table_desc->address; + } + + (void)acpi_ut_release_mutex(ACPI_MTX_TABLES); +printk(KERN_ERR "XXX acpi_tb_find_table_address(EXIT): address=%llx\n", address); + return address; +} diff --git a/drivers/firmware/iscsi_ibft_find.c b/drivers/firmware/iscsi_ibft_find.c index 95fc1a6..0de70b4 100644 --- a/drivers/firmware/iscsi_ibft_find.c +++ b/drivers/firmware/iscsi_ibft_find.c @@ -28,6 +28,8 @@ #include <asm/mmzone.h> +extern acpi_physical_address acpi_tb_find_table_address(char *signature); + /* * Physical location of iSCSI Boot Format Table. */ @@ -116,24 +118,32 @@ void __init reserve_ibft_region(void) { struct acpi_table_ibft *table; unsigned long size; + acpi_physical_address address; table = find_ibft(); if (!table) return; size = PAGE_ALIGN(table->header.length); + address = acpi_tb_find_table_address(table->header.signature); #if 0 printk(KERN_ERR "XXX reserve_ibft_region: table=%llx, virt_to_phys(table)=%llx, size=%lx\n", (u64)table, virt_to_phys(table), size); memblock_reserve(virt_to_phys(table), size); #else -printk(KERN_ERR "XXX reserve_ibft_region: table=%llx, 0x00000000BE453000, size=%lx\n", - (u64)table, size); - memblock_reserve(0x00000000BE453000, size); +printk(KERN_ERR "XXX reserve_ibft_region: table=%llx, address=%llx, size=%lx\n", + (u64)table, address, size); + if (address) + memblock_reserve(address, size); + else + printk(KERN_ERR "%s: Can't find table address\n", __func__); #endif - if (efi_enabled(EFI_BOOT)) + if (efi_enabled(EFI_BOOT)) { +printk(KERN_ERR "XXX reserve_ibft_region: calling acpi_put_table(%llx)\n", (u64)&table->header); acpi_put_table(&table->header); - else + } else { ibft_addr = table; +printk(KERN_ERR "XXX reserve_ibft_region: ibft_addr=%llx\n", (u64)ibft_addr); + } } Debug from the above: [ 0.050646] ACPI: Early table checksum verification disabled [ 0.051778] ACPI: RSDP 0x00000000BFBFA014 000024 (v02 BOCHS ) [ 0.052922] ACPI: XSDT 0x00000000BFBF90E8 00004C (v01 BOCHS BXPCFACP 00000001 01000013) [ 0.054623] ACPI: FACP 0x00000000BFBF5000 000074 (v01 BOCHS BXPCFACP 00000001 BXPC 00000001) [ 0.056326] ACPI: DSDT 0x00000000BFBF6000 00238D (v01 BOCHS BXPCDSDT 00000001 BXPC 00000001) [ 0.058016] ACPI: FACS 0x00000000BFBFD000 000040 [ 0.058940] ACPI: APIC 0x00000000BFBF4000 000090 (v01 BOCHS BXPCAPIC 00000001 BXPC 00000001) [ 0.060627] ACPI: HPET 0x00000000BFBF3000 000038 (v01 BOCHS BXPCHPET 00000001 BXPC 00000001) [ 0.062304] ACPI: BGRT 0x00000000BE49B000 000038 (v01 INTEL EDK2 00000002 01000013) [ 0.063987] ACPI: iBFT 0x00000000BE453000 000800 (v01 BOCHS BXPCFACP 00000000 00000000) [ 0.065683] XXX acpi_tb_find_table_address: signature=iBFT [ 0.066754] XXX acpi_tb_find_table_address(EXIT): address=be453000 [ 0.067959] XXX reserve_ibft_region: table=ffffffffff240000, address=be453000, size=1000 [ 0.069534] XXX reserve_ibft_region: calling acpi_put_table(ffffffffff240000) Not sure if it's the right thing to do, but added "acpi_tb_find_table_address()" to return the physical address of a table to use with memblock_reserve(). virt_to_phys(table) does not seem to return the physical address for the iBFT table (it would be nice if struct acpi_table_header also had a "address" element for the physical address of the table). Ran 10 successful boots with the above without failure. George > > > Debug from the above: > > [ 0.020293] last_pfn = 0xbfedc max_arch_pfn = 0x400000000 > [ 0.050778] ACPI: Early table checksum verification disabled > [ 0.056475] ACPI: RSDP 0x00000000BFBFA014 000024 (v02 BOCHS ) > [ 0.057628] ACPI: XSDT 0x00000000BFBF90E8 00004C (v01 BOCHS > BXPCFACP 00000001 01000013) > [ 0.059341] ACPI: FACP 0x00000000BFBF5000 000074 (v01 BOCHS > BXPCFACP 00000001 BXPC 00000001) > [ 0.061043] ACPI: DSDT 0x00000000BFBF6000 00238D (v01 BOCHS > BXPCDSDT 00000001 BXPC 00000001) > [ 0.062740] ACPI: FACS 0x00000000BFBFD000 000040 > [ 0.063673] ACPI: APIC 0x00000000BFBF4000 000090 (v01 BOCHS > BXPCAPIC 00000001 BXPC 00000001) > [ 0.065369] ACPI: HPET 0x00000000BFBF3000 000038 (v01 BOCHS > BXPCHPET 00000001 BXPC 00000001) > [ 0.067061] ACPI: BGRT 0x00000000BE49B000 000038 (v01 INTEL > EDK2 00000002 01000013) > [ 0.068761] ACPI: iBFT 0x00000000BE453000 000800 (v01 BOCHS > BXPCFACP 00000000 00000000) > [ 0.070461] XXX reserve_ibft_region: table=ffffffffff240000, > 0x00000000BE453000, size=1000 > [ 0.072231] check: Scanning 1 areas for low memory corruption > > George >>> [ 30.308229] iBFT detected.. >>> [ 30.308796] >>> ================================================================== >>> [ 30.308890] BUG: KASAN: use-after-free in ibft_init+0x134/0xc33 >>> [ 30.308890] Read of size 4 at addr ffff8880be453004 by task >>> swapper/0/1 >>> [ 30.308890] >>> [ 30.308890] CPU: 1 PID: 1 Comm: swapper/0 Not tainted >>> 5.11.0-f9593a0 #12 >>> [ 30.308890] Hardware name: QEMU Standard PC (i440FX + PIIX, >>> 1996), BIOS >>> 0.0.0 02/06/2015 >>> [ 30.308890] Call Trace: >>> [ 30.308890] dump_stack+0xdb/0x120 >>> [ 30.308890] ? ibft_init+0x134/0xc33 >>> [ 30.308890] print_address_description.constprop.7+0x41/0x60 >>> [ 30.308890] ? ibft_init+0x134/0xc33 >>> [ 30.308890] ? ibft_init+0x134/0xc33 >>> [ 30.308890] kasan_report.cold.10+0x78/0xd1 >>> [ 30.308890] ? ibft_init+0x134/0xc33 >>> [ 30.308890] __asan_report_load_n_noabort+0xf/0x20 >>> [ 30.308890] ibft_init+0x134/0xc33 >>> [ 30.308890] ? write_comp_data+0x2f/0x90 >>> [ 30.308890] ? ibft_check_initiator_for+0x159/0x159 >>> [ 30.308890] ? write_comp_data+0x2f/0x90 >>> [ 30.308890] ? ibft_check_initiator_for+0x159/0x159 >>> [ 30.308890] do_one_initcall+0xc4/0x3e0 >>> [ 30.308890] ? perf_trace_initcall_level+0x3e0/0x3e0 >>> [ 30.308890] ? unpoison_range+0x14/0x40 >>> [ 30.308890] ? ____kasan_kmalloc.constprop.5+0x8f/0xc0 >>> [ 30.308890] ? kernel_init_freeable+0x420/0x652 >>> [ 30.308890] ? __kasan_kmalloc+0x9/0x10 >>> [ 30.308890] ? __sanitizer_cov_trace_pc+0x21/0x50 >>> [ 30.308890] kernel_init_freeable+0x596/0x652 >>> [ 30.308890] ? console_on_rootfs+0x7d/0x7d >>> [ 30.308890] ? __sanitizer_cov_trace_pc+0x21/0x50 >>> [ 30.308890] ? rest_init+0xf0/0xf0 >>> [ 30.308890] kernel_init+0x16/0x1d0 >>> [ 30.308890] ? rest_init+0xf0/0xf0 >>> [ 30.308890] ret_from_fork+0x22/0x30 >>> [ 30.308890] >>> [ 30.308890] The buggy address belongs to the page: >>> [ 30.308890] page:0000000001b7b17c refcount:0 mapcount:0 >>> mapping:0000000000000000 index:0x1 pfn:0xbe453 >>> [ 30.308890] flags: 0xfffffc0000000() >>> [ 30.308890] raw: 000fffffc0000000 ffffea0002ef9788 ffffea0002f91488 >>> 0000000000000000 >>> [ 30.308890] raw: 0000000000000001 0000000000000000 00000000ffffffff >>> 0000000000000000 >>> [ 30.308890] page dumped because: kasan: bad access detected >>> [ 30.308890] page_owner tracks the page as freed >>> [ 30.308890] page last allocated via order 0, migratetype Movable, >>> gfp_mask 0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), pid 204, ts >>> 28121288605 >>> [ 30.308890] prep_new_page+0xfb/0x140 >>> [ 30.308890] get_page_from_freelist+0x3503/0x5730 >>> [ 30.308890] __alloc_pages_nodemask+0x2d8/0x650 >>> [ 30.308890] alloc_pages_vma+0xe2/0x560 >>> [ 30.308890] __handle_mm_fault+0x930/0x26c0 >>> [ 30.308890] handle_mm_fault+0x1f9/0x810 >>> [ 30.308890] do_user_addr_fault+0x6f7/0xca0 >>> [ 30.308890] exc_page_fault+0xaf/0x1a0 >>> [ 30.308890] asm_exc_page_fault+0x1e/0x30 >>> [ 30.308890] page last free stack trace: >>> [ 30.308890] free_pcp_prepare+0x122/0x290 >>> [ 30.308890] free_unref_page_list+0xe6/0x490 >>> [ 30.308890] release_pages+0x2ed/0x1270 >>> [ 30.308890] free_pages_and_swap_cache+0x245/0x2e0 >>> [ 30.308890] tlb_flush_mmu+0x11e/0x680 >>> [ 30.308890] tlb_finish_mmu+0xa6/0x3e0 >>> [ 30.308890] exit_mmap+0x2b3/0x540 >>> [ 30.308890] mmput+0x11d/0x450 >>> [ 30.308890] do_exit+0xaa6/0x2d40 >>> [ 30.308890] do_group_exit+0x128/0x340 >>> [ 30.308890] __x64_sys_exit_group+0x43/0x50 >>> [ 30.308890] do_syscall_64+0x37/0x50 >>> [ 30.308890] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>> [ 30.308890] >>> [ 30.308890] Memory state around the buggy address: >>> [ 30.308890] ffff8880be452f00: ff ff ff ff ff ff ff ff ff ff ff >>> ff ff ff >>> ff ff >>> [ 30.308890] ffff8880be452f80: ff ff ff ff ff ff ff ff ff ff ff >>> ff ff ff >>> ff ff >>> [ 30.308890] >ffff8880be453000: ff ff ff ff ff ff ff ff ff ff ff >>> ff ff ff >>> ff ff >>> [ 30.308890] ^ >>> [ 30.308890] ffff8880be453080: ff ff ff ff ff ff ff ff ff ff ff >>> ff ff ff >>> ff ff >>> [ 30.308890] ffff8880be453100: ff ff ff ff ff ff ff ff ff ff ff >>> ff ff ff >>> ff ff >>> [ 30.308890] >>> ================================================================== >>> >>> George >>> >
Hi George, On Thu, Feb 25, 2021 at 08:19:18PM -0500, George Kennedy wrote: > > Mike, > > To get rid of the 0x00000000BE453000 hardcoding, I added the following patch > to your above patch to get the iBFT table "address" to use with > memblock_reserve(): > > diff --git a/drivers/acpi/acpica/tbfind.c b/drivers/acpi/acpica/tbfind.c > index 56d81e4..4bc7bf3 100644 > --- a/drivers/acpi/acpica/tbfind.c > +++ b/drivers/acpi/acpica/tbfind.c > @@ -120,3 +120,34 @@ > (void)acpi_ut_release_mutex(ACPI_MTX_TABLES); > return_ACPI_STATUS(status); > } > + > +acpi_physical_address > +acpi_tb_find_table_address(char *signature) > +{ > + acpi_physical_address address = 0; > + struct acpi_table_desc *table_desc; > + int i; > + > + ACPI_FUNCTION_TRACE(tb_find_table_address); > + > +printk(KERN_ERR "XXX acpi_tb_find_table_address: signature=%s\n", > signature); > + > + (void)acpi_ut_acquire_mutex(ACPI_MTX_TABLES); > + for (i = 0; i < acpi_gbl_root_table_list.current_table_count; ++i) { > + if (memcmp(&(acpi_gbl_root_table_list.tables[i].signature), > + signature, ACPI_NAMESEG_SIZE)) { > + > + /* Not the requested table */ > + > + continue; > + } > + > + /* Table with matching signature has been found */ > + table_desc = &acpi_gbl_root_table_list.tables[i]; > + address = table_desc->address; > + } > + > + (void)acpi_ut_release_mutex(ACPI_MTX_TABLES); > +printk(KERN_ERR "XXX acpi_tb_find_table_address(EXIT): address=%llx\n", > address); > + return address; > +} > diff --git a/drivers/firmware/iscsi_ibft_find.c > b/drivers/firmware/iscsi_ibft_find.c > index 95fc1a6..0de70b4 100644 > --- a/drivers/firmware/iscsi_ibft_find.c > +++ b/drivers/firmware/iscsi_ibft_find.c > @@ -28,6 +28,8 @@ > > #include <asm/mmzone.h> > > +extern acpi_physical_address acpi_tb_find_table_address(char *signature); > + > /* > * Physical location of iSCSI Boot Format Table. > */ > @@ -116,24 +118,32 @@ void __init reserve_ibft_region(void) > { > struct acpi_table_ibft *table; > unsigned long size; > + acpi_physical_address address; > > table = find_ibft(); > if (!table) > return; > > size = PAGE_ALIGN(table->header.length); > + address = acpi_tb_find_table_address(table->header.signature); > #if 0 > printk(KERN_ERR "XXX reserve_ibft_region: table=%llx, > virt_to_phys(table)=%llx, size=%lx\n", > (u64)table, virt_to_phys(table), size); > memblock_reserve(virt_to_phys(table), size); > #else > -printk(KERN_ERR "XXX reserve_ibft_region: table=%llx, 0x00000000BE453000, > size=%lx\n", > - (u64)table, size); > - memblock_reserve(0x00000000BE453000, size); > +printk(KERN_ERR "XXX reserve_ibft_region: table=%llx, address=%llx, > size=%lx\n", > + (u64)table, address, size); > + if (address) > + memblock_reserve(address, size); > + else > + printk(KERN_ERR "%s: Can't find table address\n", __func__); > #endif > > - if (efi_enabled(EFI_BOOT)) > + if (efi_enabled(EFI_BOOT)) { > +printk(KERN_ERR "XXX reserve_ibft_region: calling acpi_put_table(%llx)\n", > (u64)&table->header); > acpi_put_table(&table->header); > - else > + } else { > ibft_addr = table; > +printk(KERN_ERR "XXX reserve_ibft_region: ibft_addr=%llx\n", > (u64)ibft_addr); > + } > } > > Debug from the above: > [ 0.050646] ACPI: Early table checksum verification disabled > [ 0.051778] ACPI: RSDP 0x00000000BFBFA014 000024 (v02 BOCHS ) > [ 0.052922] ACPI: XSDT 0x00000000BFBF90E8 00004C (v01 BOCHS BXPCFACP > 00000001 01000013) > [ 0.054623] ACPI: FACP 0x00000000BFBF5000 000074 (v01 BOCHS BXPCFACP > 00000001 BXPC 00000001) > [ 0.056326] ACPI: DSDT 0x00000000BFBF6000 00238D (v01 BOCHS BXPCDSDT > 00000001 BXPC 00000001) > [ 0.058016] ACPI: FACS 0x00000000BFBFD000 000040 > [ 0.058940] ACPI: APIC 0x00000000BFBF4000 000090 (v01 BOCHS BXPCAPIC > 00000001 BXPC 00000001) > [ 0.060627] ACPI: HPET 0x00000000BFBF3000 000038 (v01 BOCHS BXPCHPET > 00000001 BXPC 00000001) > [ 0.062304] ACPI: BGRT 0x00000000BE49B000 000038 (v01 INTEL EDK2 > 00000002 01000013) > [ 0.063987] ACPI: iBFT 0x00000000BE453000 000800 (v01 BOCHS BXPCFACP > 00000000 00000000) > [ 0.065683] XXX acpi_tb_find_table_address: signature=iBFT > [ 0.066754] XXX acpi_tb_find_table_address(EXIT): address=be453000 > [ 0.067959] XXX reserve_ibft_region: table=ffffffffff240000, > address=be453000, size=1000 > [ 0.069534] XXX reserve_ibft_region: calling > acpi_put_table(ffffffffff240000) > > Not sure if it's the right thing to do, but added > "acpi_tb_find_table_address()" to return the physical address of a table to > use with memblock_reserve(). > > virt_to_phys(table) does not seem to return the physical address for the > iBFT table (it would be nice if struct acpi_table_header also had a > "address" element for the physical address of the table). virt_to_phys() does not work that early because then it is mapped with early_memremap() which uses different virtual to physical scheme. I'd say that acpi_tb_find_table_address() makes sense if we'd like to reserve ACPI tables outside of drivers/acpi. But probably we should simply reserve all the tables during acpi_table_init() so that any table that firmware put in the normal memory will be surely reserved. > Ran 10 successful boots with the above without failure. That's good news indeed :) > George > > > >
Hi Mike, On 2/26/2021 6:17 AM, Mike Rapoport wrote: > Hi George, > > On Thu, Feb 25, 2021 at 08:19:18PM -0500, George Kennedy wrote: >> Mike, >> >> To get rid of the 0x00000000BE453000 hardcoding, I added the following patch >> to your above patch to get the iBFT table "address" to use with >> memblock_reserve(): >> >> diff --git a/drivers/acpi/acpica/tbfind.c b/drivers/acpi/acpica/tbfind.c >> index 56d81e4..4bc7bf3 100644 >> --- a/drivers/acpi/acpica/tbfind.c >> +++ b/drivers/acpi/acpica/tbfind.c >> @@ -120,3 +120,34 @@ >> (void)acpi_ut_release_mutex(ACPI_MTX_TABLES); >> return_ACPI_STATUS(status); >> } >> + >> +acpi_physical_address >> +acpi_tb_find_table_address(char *signature) >> +{ >> + acpi_physical_address address = 0; >> + struct acpi_table_desc *table_desc; >> + int i; >> + >> + ACPI_FUNCTION_TRACE(tb_find_table_address); >> + >> +printk(KERN_ERR "XXX acpi_tb_find_table_address: signature=%s\n", >> signature); >> + >> + (void)acpi_ut_acquire_mutex(ACPI_MTX_TABLES); >> + for (i = 0; i < acpi_gbl_root_table_list.current_table_count; ++i) { >> + if (memcmp(&(acpi_gbl_root_table_list.tables[i].signature), >> + signature, ACPI_NAMESEG_SIZE)) { >> + >> + /* Not the requested table */ >> + >> + continue; >> + } >> + >> + /* Table with matching signature has been found */ >> + table_desc = &acpi_gbl_root_table_list.tables[i]; >> + address = table_desc->address; >> + } >> + >> + (void)acpi_ut_release_mutex(ACPI_MTX_TABLES); >> +printk(KERN_ERR "XXX acpi_tb_find_table_address(EXIT): address=%llx\n", >> address); >> + return address; >> +} >> diff --git a/drivers/firmware/iscsi_ibft_find.c >> b/drivers/firmware/iscsi_ibft_find.c >> index 95fc1a6..0de70b4 100644 >> --- a/drivers/firmware/iscsi_ibft_find.c >> +++ b/drivers/firmware/iscsi_ibft_find.c >> @@ -28,6 +28,8 @@ >> >> #include <asm/mmzone.h> >> >> +extern acpi_physical_address acpi_tb_find_table_address(char *signature); >> + >> /* >> * Physical location of iSCSI Boot Format Table. >> */ >> @@ -116,24 +118,32 @@ void __init reserve_ibft_region(void) >> { >> struct acpi_table_ibft *table; >> unsigned long size; >> + acpi_physical_address address; >> >> table = find_ibft(); >> if (!table) >> return; >> >> size = PAGE_ALIGN(table->header.length); >> + address = acpi_tb_find_table_address(table->header.signature); >> #if 0 >> printk(KERN_ERR "XXX reserve_ibft_region: table=%llx, >> virt_to_phys(table)=%llx, size=%lx\n", >> (u64)table, virt_to_phys(table), size); >> memblock_reserve(virt_to_phys(table), size); >> #else >> -printk(KERN_ERR "XXX reserve_ibft_region: table=%llx, 0x00000000BE453000, >> size=%lx\n", >> - (u64)table, size); >> - memblock_reserve(0x00000000BE453000, size); >> +printk(KERN_ERR "XXX reserve_ibft_region: table=%llx, address=%llx, >> size=%lx\n", >> + (u64)table, address, size); >> + if (address) >> + memblock_reserve(address, size); >> + else >> + printk(KERN_ERR "%s: Can't find table address\n", __func__); >> #endif >> >> - if (efi_enabled(EFI_BOOT)) >> + if (efi_enabled(EFI_BOOT)) { >> +printk(KERN_ERR "XXX reserve_ibft_region: calling acpi_put_table(%llx)\n", >> (u64)&table->header); >> acpi_put_table(&table->header); >> - else >> + } else { >> ibft_addr = table; >> +printk(KERN_ERR "XXX reserve_ibft_region: ibft_addr=%llx\n", >> (u64)ibft_addr); >> + } >> } >> >> Debug from the above: >> [ 0.050646] ACPI: Early table checksum verification disabled >> [ 0.051778] ACPI: RSDP 0x00000000BFBFA014 000024 (v02 BOCHS ) >> [ 0.052922] ACPI: XSDT 0x00000000BFBF90E8 00004C (v01 BOCHS BXPCFACP >> 00000001 01000013) >> [ 0.054623] ACPI: FACP 0x00000000BFBF5000 000074 (v01 BOCHS BXPCFACP >> 00000001 BXPC 00000001) >> [ 0.056326] ACPI: DSDT 0x00000000BFBF6000 00238D (v01 BOCHS BXPCDSDT >> 00000001 BXPC 00000001) >> [ 0.058016] ACPI: FACS 0x00000000BFBFD000 000040 >> [ 0.058940] ACPI: APIC 0x00000000BFBF4000 000090 (v01 BOCHS BXPCAPIC >> 00000001 BXPC 00000001) >> [ 0.060627] ACPI: HPET 0x00000000BFBF3000 000038 (v01 BOCHS BXPCHPET >> 00000001 BXPC 00000001) >> [ 0.062304] ACPI: BGRT 0x00000000BE49B000 000038 (v01 INTEL EDK2 >> 00000002 01000013) >> [ 0.063987] ACPI: iBFT 0x00000000BE453000 000800 (v01 BOCHS BXPCFACP >> 00000000 00000000) >> [ 0.065683] XXX acpi_tb_find_table_address: signature=iBFT >> [ 0.066754] XXX acpi_tb_find_table_address(EXIT): address=be453000 >> [ 0.067959] XXX reserve_ibft_region: table=ffffffffff240000, >> address=be453000, size=1000 >> [ 0.069534] XXX reserve_ibft_region: calling >> acpi_put_table(ffffffffff240000) >> >> Not sure if it's the right thing to do, but added >> "acpi_tb_find_table_address()" to return the physical address of a table to >> use with memblock_reserve(). >> >> virt_to_phys(table) does not seem to return the physical address for the >> iBFT table (it would be nice if struct acpi_table_header also had a >> "address" element for the physical address of the table). > virt_to_phys() does not work that early because then it is mapped with > early_memremap() which uses different virtual to physical scheme. > > I'd say that acpi_tb_find_table_address() makes sense if we'd like to > reserve ACPI tables outside of drivers/acpi. > > But probably we should simply reserve all the tables during > acpi_table_init() so that any table that firmware put in the normal memory > will be surely reserved. > >> Ran 10 successful boots with the above without failure. > That's good news indeed :) Wondering if we could do something like this instead (trying to keep changes minimal). Just do the memblock_reserve() for all the standard tables. diff --git a/drivers/acpi/acpica/tbinstal.c b/drivers/acpi/acpica/tbinstal.c index 0bb15ad..830f82c 100644 --- a/drivers/acpi/acpica/tbinstal.c +++ b/drivers/acpi/acpica/tbinstal.c @@ -7,6 +7,7 @@ * *****************************************************************************/ +#include <linux/memblock.h> #include <acpi/acpi.h> #include "accommon.h" #include "actables.h" @@ -14,6 +15,23 @@ #define _COMPONENT ACPI_TABLES ACPI_MODULE_NAME("tbinstal") +void +acpi_tb_reserve_standard_table(acpi_physical_address address, + struct acpi_table_header *header) +{ + struct acpi_table_header local_header; + + if ((ACPI_COMPARE_NAMESEG(header->signature, ACPI_SIG_FACS)) || + (ACPI_VALIDATE_RSDP_SIG(header->signature))) { + return; + } + /* Standard ACPI table with full common header */ + + memcpy(&local_header, header, sizeof(struct acpi_table_header)); + + memblock_reserve(address, PAGE_ALIGN(local_header.length)); +} + /******************************************************************************* * * FUNCTION: acpi_tb_install_table_with_override @@ -58,6 +76,9 @@ new_table_desc->flags, new_table_desc->pointer); + acpi_tb_reserve_standard_table(new_table_desc->address, + new_table_desc->pointer); + acpi_tb_print_table_header(new_table_desc->address, new_table_desc->pointer); There should be no harm in doing the memblock_reserve() for all the standard tables, right? Ran 10 boots with the above without failure. George >> George >>>
On Fri, Feb 26, 2021 at 11:16:06AM -0500, George Kennedy wrote: > On 2/26/2021 6:17 AM, Mike Rapoport wrote: > > Hi George, > > > > On Thu, Feb 25, 2021 at 08:19:18PM -0500, George Kennedy wrote: > > > > > > Not sure if it's the right thing to do, but added > > > "acpi_tb_find_table_address()" to return the physical address of a table to > > > use with memblock_reserve(). > > > > > > virt_to_phys(table) does not seem to return the physical address for the > > > iBFT table (it would be nice if struct acpi_table_header also had a > > > "address" element for the physical address of the table). > > > > virt_to_phys() does not work that early because then it is mapped with > > early_memremap() which uses different virtual to physical scheme. > > > > I'd say that acpi_tb_find_table_address() makes sense if we'd like to > > reserve ACPI tables outside of drivers/acpi. > > > > But probably we should simply reserve all the tables during > > acpi_table_init() so that any table that firmware put in the normal memory > > will be surely reserved. > > > Ran 10 successful boots with the above without failure. > > That's good news indeed :) > > Wondering if we could do something like this instead (trying to keep changes > minimal). Just do the memblock_reserve() for all the standard tables. I think something like this should work, but I'm not an ACPI expert to say if this the best way to reserve the tables. > diff --git a/drivers/acpi/acpica/tbinstal.c b/drivers/acpi/acpica/tbinstal.c > index 0bb15ad..830f82c 100644 > --- a/drivers/acpi/acpica/tbinstal.c > +++ b/drivers/acpi/acpica/tbinstal.c > @@ -7,6 +7,7 @@ > * > *****************************************************************************/ > > +#include <linux/memblock.h> > #include <acpi/acpi.h> > #include "accommon.h" > #include "actables.h" > @@ -14,6 +15,23 @@ > #define _COMPONENT ACPI_TABLES > ACPI_MODULE_NAME("tbinstal") > > +void > +acpi_tb_reserve_standard_table(acpi_physical_address address, > + struct acpi_table_header *header) > +{ > + struct acpi_table_header local_header; > + > + if ((ACPI_COMPARE_NAMESEG(header->signature, ACPI_SIG_FACS)) || > + (ACPI_VALIDATE_RSDP_SIG(header->signature))) { > + return; > + } > + /* Standard ACPI table with full common header */ > + > + memcpy(&local_header, header, sizeof(struct acpi_table_header)); > + > + memblock_reserve(address, PAGE_ALIGN(local_header.length)); > +} > + > /******************************************************************************* > * > * FUNCTION: acpi_tb_install_table_with_override > @@ -58,6 +76,9 @@ > new_table_desc->flags, > new_table_desc->pointer); > > + acpi_tb_reserve_standard_table(new_table_desc->address, > + new_table_desc->pointer); > + > acpi_tb_print_table_header(new_table_desc->address, > new_table_desc->pointer); > > There should be no harm in doing the memblock_reserve() for all the standard > tables, right? It should be ok to memblock_reserve() all the tables very early as long as we don't run out of static entries in memblock.reserved. We just need to make sure the tables are reserved before memblock allocations are possible, so we'd still need to move acpi_table_init() in x86::setup_arch() before e820__memblock_setup(). Not sure how early ACPI is initialized on arm64. > Ran 10 boots with the above without failure. > > George
On 2/28/2021 1:08 PM, Mike Rapoport wrote: > On Fri, Feb 26, 2021 at 11:16:06AM -0500, George Kennedy wrote: >> On 2/26/2021 6:17 AM, Mike Rapoport wrote: >>> Hi George, >>> >>> On Thu, Feb 25, 2021 at 08:19:18PM -0500, George Kennedy wrote: >>>> Not sure if it's the right thing to do, but added >>>> "acpi_tb_find_table_address()" to return the physical address of a table to >>>> use with memblock_reserve(). >>>> >>>> virt_to_phys(table) does not seem to return the physical address for the >>>> iBFT table (it would be nice if struct acpi_table_header also had a >>>> "address" element for the physical address of the table). >>> virt_to_phys() does not work that early because then it is mapped with >>> early_memremap() which uses different virtual to physical scheme. >>> >>> I'd say that acpi_tb_find_table_address() makes sense if we'd like to >>> reserve ACPI tables outside of drivers/acpi. >>> >>> But probably we should simply reserve all the tables during >>> acpi_table_init() so that any table that firmware put in the normal memory >>> will be surely reserved. >>>> Ran 10 successful boots with the above without failure. >>> That's good news indeed :) >> Wondering if we could do something like this instead (trying to keep changes >> minimal). Just do the memblock_reserve() for all the standard tables. > I think something like this should work, but I'm not an ACPI expert to say > if this the best way to reserve the tables. Adding ACPI maintainers to the CC list. > >> diff --git a/drivers/acpi/acpica/tbinstal.c b/drivers/acpi/acpica/tbinstal.c >> index 0bb15ad..830f82c 100644 >> --- a/drivers/acpi/acpica/tbinstal.c >> +++ b/drivers/acpi/acpica/tbinstal.c >> @@ -7,6 +7,7 @@ >> * >> *****************************************************************************/ >> >> +#include <linux/memblock.h> >> #include <acpi/acpi.h> >> #include "accommon.h" >> #include "actables.h" >> @@ -14,6 +15,23 @@ >> #define _COMPONENT ACPI_TABLES >> ACPI_MODULE_NAME("tbinstal") >> >> +void >> +acpi_tb_reserve_standard_table(acpi_physical_address address, >> + struct acpi_table_header *header) >> +{ >> + struct acpi_table_header local_header; >> + >> + if ((ACPI_COMPARE_NAMESEG(header->signature, ACPI_SIG_FACS)) || >> + (ACPI_VALIDATE_RSDP_SIG(header->signature))) { >> + return; >> + } >> + /* Standard ACPI table with full common header */ >> + >> + memcpy(&local_header, header, sizeof(struct acpi_table_header)); >> + >> + memblock_reserve(address, PAGE_ALIGN(local_header.length)); >> +} >> + >> /******************************************************************************* >> * >> * FUNCTION: acpi_tb_install_table_with_override >> @@ -58,6 +76,9 @@ >> new_table_desc->flags, >> new_table_desc->pointer); >> >> + acpi_tb_reserve_standard_table(new_table_desc->address, >> + new_table_desc->pointer); >> + >> acpi_tb_print_table_header(new_table_desc->address, >> new_table_desc->pointer); >> >> There should be no harm in doing the memblock_reserve() for all the standard >> tables, right? > It should be ok to memblock_reserve() all the tables very early as long as > we don't run out of static entries in memblock.reserved. > > We just need to make sure the tables are reserved before memblock > allocations are possible, so we'd still need to move acpi_table_init() in > x86::setup_arch() before e820__memblock_setup(). > Not sure how early ACPI is initialized on arm64. Thanks Mike. Will try to move the memblock_reserves() before e820__memblock_setup(). George > >> Ran 10 boots with the above without failure. >> >> George
On 3/1/2021 9:29 AM, George Kennedy wrote: > > > On 2/28/2021 1:08 PM, Mike Rapoport wrote: >> On Fri, Feb 26, 2021 at 11:16:06AM -0500, George Kennedy wrote: >>> On 2/26/2021 6:17 AM, Mike Rapoport wrote: >>>> Hi George, >>>> >>>> On Thu, Feb 25, 2021 at 08:19:18PM -0500, George Kennedy wrote: >>>>> Not sure if it's the right thing to do, but added >>>>> "acpi_tb_find_table_address()" to return the physical address of a >>>>> table to >>>>> use with memblock_reserve(). >>>>> >>>>> virt_to_phys(table) does not seem to return the physical address >>>>> for the >>>>> iBFT table (it would be nice if struct acpi_table_header also had a >>>>> "address" element for the physical address of the table). >>>> virt_to_phys() does not work that early because then it is mapped with >>>> early_memremap() which uses different virtual to physical scheme. >>>> >>>> I'd say that acpi_tb_find_table_address() makes sense if we'd like to >>>> reserve ACPI tables outside of drivers/acpi. >>>> >>>> But probably we should simply reserve all the tables during >>>> acpi_table_init() so that any table that firmware put in the normal >>>> memory >>>> will be surely reserved. >>>>> Ran 10 successful boots with the above without failure. >>>> That's good news indeed :) >>> Wondering if we could do something like this instead (trying to keep >>> changes >>> minimal). Just do the memblock_reserve() for all the standard tables. >> I think something like this should work, but I'm not an ACPI expert >> to say >> if this the best way to reserve the tables. > Adding ACPI maintainers to the CC list. >>> diff --git a/drivers/acpi/acpica/tbinstal.c >>> b/drivers/acpi/acpica/tbinstal.c >>> index 0bb15ad..830f82c 100644 >>> --- a/drivers/acpi/acpica/tbinstal.c >>> +++ b/drivers/acpi/acpica/tbinstal.c >>> @@ -7,6 +7,7 @@ >>> * >>> *****************************************************************************/ >>> >>> >>> +#include <linux/memblock.h> >>> #include <acpi/acpi.h> >>> #include "accommon.h" >>> #include "actables.h" >>> @@ -14,6 +15,23 @@ >>> #define _COMPONENT ACPI_TABLES >>> ACPI_MODULE_NAME("tbinstal") >>> >>> +void >>> +acpi_tb_reserve_standard_table(acpi_physical_address address, >>> + struct acpi_table_header *header) >>> +{ >>> + struct acpi_table_header local_header; >>> + >>> + if ((ACPI_COMPARE_NAMESEG(header->signature, ACPI_SIG_FACS)) || >>> + (ACPI_VALIDATE_RSDP_SIG(header->signature))) { >>> + return; >>> + } >>> + /* Standard ACPI table with full common header */ >>> + >>> + memcpy(&local_header, header, sizeof(struct acpi_table_header)); >>> + >>> + memblock_reserve(address, PAGE_ALIGN(local_header.length)); >>> +} >>> + >>> /******************************************************************************* >>> >>> * >>> * FUNCTION: acpi_tb_install_table_with_override >>> @@ -58,6 +76,9 @@ >>> new_table_desc->flags, >>> new_table_desc->pointer); >>> >>> + acpi_tb_reserve_standard_table(new_table_desc->address, >>> + new_table_desc->pointer); >>> + >>> acpi_tb_print_table_header(new_table_desc->address, >>> new_table_desc->pointer); >>> >>> There should be no harm in doing the memblock_reserve() for all the >>> standard >>> tables, right? >> It should be ok to memblock_reserve() all the tables very early as >> long as >> we don't run out of static entries in memblock.reserved. >> >> We just need to make sure the tables are reserved before memblock >> allocations are possible, so we'd still need to move >> acpi_table_init() in >> x86::setup_arch() before e820__memblock_setup(). >> Not sure how early ACPI is initialized on arm64. > > Thanks Mike. Will try to move the memblock_reserves() before > e820__memblock_setup(). Hi Mike, Moved acpi_table_init() in x86::setup_arch() before e820__memblock_setup() as you suggested. Ran 10 boots with the following without error. diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 740f3bdb..3b1dd24 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1047,6 +1047,7 @@ void __init setup_arch(char **cmdline_p) cleanup_highmap(); memblock_set_current_limit(ISA_END_ADDRESS); + acpi_boot_table_init(); e820__memblock_setup(); /* @@ -1140,8 +1141,6 @@ void __init setup_arch(char **cmdline_p) /* * Parse the ACPI tables for possible boot-time SMP configuration. */ - acpi_boot_table_init(); - early_acpi_boot_init(); initmem_init(); diff --git a/drivers/acpi/acpica/tbinstal.c b/drivers/acpi/acpica/tbinstal.c index 0bb15ad..7830109 100644 --- a/drivers/acpi/acpica/tbinstal.c +++ b/drivers/acpi/acpica/tbinstal.c @@ -7,6 +7,7 @@ * *****************************************************************************/ +#include <linux/memblock.h> #include <acpi/acpi.h> #include "accommon.h" #include "actables.h" @@ -16,6 +17,33 @@ /******************************************************************************* * + * FUNCTION: acpi_tb_reserve_standard_table + * + * PARAMETERS: address - Table physical address + * header - Table header + * + * RETURN: None + * + * DESCRIPTION: To avoid an acpi table page from being "stolen" by the buddy + * allocator run memblock_reserve() on all the standard acpi tables. + * + ******************************************************************************/ +void +acpi_tb_reserve_standard_table(acpi_physical_address address, + struct acpi_table_header *header) +{ + if ((ACPI_COMPARE_NAMESEG(header->signature, ACPI_SIG_FACS)) || + (ACPI_VALIDATE_RSDP_SIG(header->signature))) + return; + + if (header->length > PAGE_SIZE) /* same check as in acpi_map() */ + return; + + memblock_reserve(address, PAGE_ALIGN(header->length)); +} + +/******************************************************************************* + * * FUNCTION: acpi_tb_install_table_with_override * * PARAMETERS: new_table_desc - New table descriptor to install @@ -58,6 +86,9 @@ new_table_desc->flags, new_table_desc->pointer); + acpi_tb_reserve_standard_table(new_table_desc->address, + new_table_desc->pointer); + acpi_tb_print_table_header(new_table_desc->address, new_table_desc->pointer); George > > George >>> Ran 10 boots with the above without failure. >>> >>> George >
Hi George, On Mon, Mar 01, 2021 at 08:20:45PM -0500, George Kennedy wrote: > > > > > > > > > There should be no harm in doing the memblock_reserve() for all > > > > the standard > > > > tables, right? > > > It should be ok to memblock_reserve() all the tables very early as > > > long as > > > we don't run out of static entries in memblock.reserved. > > > > > > We just need to make sure the tables are reserved before memblock > > > allocations are possible, so we'd still need to move > > > acpi_table_init() in > > > x86::setup_arch() before e820__memblock_setup(). > > > Not sure how early ACPI is initialized on arm64. > > > > Thanks Mike. Will try to move the memblock_reserves() before > > e820__memblock_setup(). > > Hi Mike, > > Moved acpi_table_init() in x86::setup_arch() before e820__memblock_setup() > as you suggested. > > Ran 10 boots with the following without error. I'd suggest to send it as a formal patch to see what x86 and ACPI folks have to say about this. > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c > index 740f3bdb..3b1dd24 100644 > --- a/arch/x86/kernel/setup.c > +++ b/arch/x86/kernel/setup.c > @@ -1047,6 +1047,7 @@ void __init setup_arch(char **cmdline_p) > cleanup_highmap(); > > memblock_set_current_limit(ISA_END_ADDRESS); > + acpi_boot_table_init(); > e820__memblock_setup(); > > /* > @@ -1140,8 +1141,6 @@ void __init setup_arch(char **cmdline_p) > /* > * Parse the ACPI tables for possible boot-time SMP configuration. > */ > - acpi_boot_table_init(); > - > early_acpi_boot_init(); > > initmem_init(); > diff --git a/drivers/acpi/acpica/tbinstal.c b/drivers/acpi/acpica/tbinstal.c > index 0bb15ad..7830109 100644 > --- a/drivers/acpi/acpica/tbinstal.c > +++ b/drivers/acpi/acpica/tbinstal.c > @@ -7,6 +7,7 @@ > * > *****************************************************************************/ > > +#include <linux/memblock.h> > #include <acpi/acpi.h> > #include "accommon.h" > #include "actables.h" > @@ -16,6 +17,33 @@ > > /******************************************************************************* > * > + * FUNCTION: acpi_tb_reserve_standard_table > + * > + * PARAMETERS: address - Table physical address > + * header - Table header > + * > + * RETURN: None > + * > + * DESCRIPTION: To avoid an acpi table page from being "stolen" by the > buddy > + * allocator run memblock_reserve() on all the standard acpi > tables. > + * > + ******************************************************************************/ > +void > +acpi_tb_reserve_standard_table(acpi_physical_address address, > + struct acpi_table_header *header) > +{ > + if ((ACPI_COMPARE_NAMESEG(header->signature, ACPI_SIG_FACS)) || > + (ACPI_VALIDATE_RSDP_SIG(header->signature))) > + return; > + Why these should be excluded? > + if (header->length > PAGE_SIZE) /* same check as in acpi_map() */ > + return; I don't think this is required, I believe acpi_map() has this check because kmap() cannot handle multiple pages. > + > + memblock_reserve(address, PAGE_ALIGN(header->length)); > +} > + > +/******************************************************************************* > + * > * FUNCTION: acpi_tb_install_table_with_override > * > * PARAMETERS: new_table_desc - New table descriptor to install > @@ -58,6 +86,9 @@ > new_table_desc->flags, > new_table_desc->pointer); > > + acpi_tb_reserve_standard_table(new_table_desc->address, > + new_table_desc->pointer); > + > acpi_tb_print_table_header(new_table_desc->address, > new_table_desc->pointer); > > George
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0b55c9c95364..f10966e3b4a5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -108,6 +108,17 @@ typedef int __bitwise fpi_t; */ #define FPI_TO_TAIL ((__force fpi_t)BIT(1)) +/* + * Don't poison memory with KASAN. + * During boot, all non-reserved memblock memory is exposed to the buddy + * allocator. Poisoning all that memory lengthens boot time, especially on + * systems with large amount of RAM. This flag is used to skip that poisoning. + * Assuming that there are no references to those newly exposed pages before + * they are ever allocated, this has little effect on KASAN memory tracking. + * All memory allocated normally after boot gets poisoned as usual. + */ +#define FPI_SKIP_KASAN_POISON ((__force fpi_t)BIT(2)) + /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ static DEFINE_MUTEX(pcp_batch_high_lock); #define MIN_PERCPU_PAGELIST_FRACTION (8) @@ -384,10 +395,14 @@ static DEFINE_STATIC_KEY_TRUE(deferred_pages); * on-demand allocation and then freed again before the deferred pages * initialization is done, but this is not likely to happen. */ -static inline void kasan_free_nondeferred_pages(struct page *page, int order) +static inline void kasan_free_nondeferred_pages(struct page *page, int order, + fpi_t fpi_flags) { - if (!static_branch_unlikely(&deferred_pages)) - kasan_free_pages(page, order); + if (static_branch_unlikely(&deferred_pages)) + return; + if (fpi_flags & FPI_SKIP_KASAN_POISON) + return; + kasan_free_pages(page, order); } /* Returns true if the struct page for the pfn is uninitialised */ @@ -438,7 +453,13 @@ defer_init(int nid, unsigned long pfn, unsigned long end_pfn) return false; } #else -#define kasan_free_nondeferred_pages(p, o) kasan_free_pages(p, o) +static inline void kasan_free_nondeferred_pages(struct page *page, int order, + fpi_t fpi_flags) +{ + if (fpi_flags & FPI_SKIP_KASAN_POISON) + return; + kasan_free_pages(page, order); +} static inline bool early_page_uninitialised(unsigned long pfn) { @@ -1216,7 +1237,7 @@ static void kernel_init_free_pages(struct page *page, int numpages) } static __always_inline bool free_pages_prepare(struct page *page, - unsigned int order, bool check_free) + unsigned int order, bool check_free, fpi_t fpi_flags) { int bad = 0; @@ -1290,7 +1311,7 @@ static __always_inline bool free_pages_prepare(struct page *page, debug_pagealloc_unmap_pages(page, 1 << order); - kasan_free_nondeferred_pages(page, order); + kasan_free_nondeferred_pages(page, order, fpi_flags); return true; } @@ -1303,7 +1324,7 @@ static __always_inline bool free_pages_prepare(struct page *page, */ static bool free_pcp_prepare(struct page *page) { - return free_pages_prepare(page, 0, true); + return free_pages_prepare(page, 0, true, FPI_NONE); } static bool bulkfree_pcp_prepare(struct page *page) @@ -1323,9 +1344,9 @@ static bool bulkfree_pcp_prepare(struct page *page) static bool free_pcp_prepare(struct page *page) { if (debug_pagealloc_enabled_static()) - return free_pages_prepare(page, 0, true); + return free_pages_prepare(page, 0, true, FPI_NONE); else - return free_pages_prepare(page, 0, false); + return free_pages_prepare(page, 0, false, FPI_NONE); } static bool bulkfree_pcp_prepare(struct page *page) @@ -1533,7 +1554,7 @@ static void __free_pages_ok(struct page *page, unsigned int order, int migratetype; unsigned long pfn = page_to_pfn(page); - if (!free_pages_prepare(page, order, true)) + if (!free_pages_prepare(page, order, true, fpi_flags)) return; migratetype = get_pfnblock_migratetype(page, pfn); @@ -1570,7 +1591,7 @@ void __free_pages_core(struct page *page, unsigned int order) * Bypass PCP and place fresh pages right to the tail, primarily * relevant for memory onlining. */ - __free_pages_ok(page, order, FPI_TO_TAIL); + __free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON); } #ifdef CONFIG_NEED_MULTIPLE_NODES
During boot, all non-reserved memblock memory is exposed to the buddy allocator. Poisoning all that memory with KASAN lengthens boot time, especially on systems with large amount of RAM. This patch makes page_alloc to not call kasan_free_pages() on all new memory. __free_pages_core() is used when exposing fresh memory during system boot and when onlining memory during hotplug. This patch adds a new FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok() through free_pages_prepare() from __free_pages_core(). This has little impact on KASAN memory tracking. Assuming that there are no references to newly exposed pages before they are ever allocated, there won't be any intended (but buggy) accesses to that memory that KASAN would normally detect. However, with this patch, KASAN stops detecting wild and large out-of-bounds accesses that happen to land on a fresh memory page that was never allocated. This is taken as an acceptable trade-off. All memory allocated normally when the boot is over keeps getting poisoned as usual. Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d --- mm/page_alloc.c | 43 ++++++++++++++++++++++++++++++++----------- 1 file changed, 32 insertions(+), 11 deletions(-)