Message ID | 20170201124630.6016-4-takahiro.akashi@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Feb 01, 2017 at 09:46:24PM +0900, AKASHI Takahiro wrote: > arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres() > are meant to be called around kexec_load() in order to protect > the memory allocated for crash dump kernel once after it's loaded. > > The protection is implemented here by unmapping the region rather than > making it read-only. > To make the things work correctly, we also have to > - put the region in an isolated, page-level mapping initially, and > - move copying kexec's control_code_page to machine_kexec_prepare() > > Note that page-level mapping is also required to allow for shrinking > the size of memory, through /sys/kernel/kexec_crash_size, by any number > of multiple pages. Looking at kexec_crash_size_store(), I don't see where memory returned to the OS is mapped. AFAICT, if the region is protected when the user shrinks the region, the memory will not be mapped, yet handed over to the kernel for general allocation. Surely we need an arch-specific callback to handle that? e.g. arch_crash_release_region(unsigned long base, unsigned long size) { /* * Ensure the region is part of the linear map before we return * it to the OS. We won't unmap this again, so we can use block * mappings. */ create_pgd_mapping(&init_mm, start, __phys_to_virt(start), size, PAGE_KERNEL, false); } ... which we'd call from crash_shrink_memory() before we freed the reserved pages. [...] > +void arch_kexec_unprotect_crashkres(void) > +{ > + /* > + * We don't have to make page-level mappings here because > + * the crash dump kernel memory is not allowed to be shrunk > + * once the kernel is loaded. > + */ > + create_pgd_mapping(&init_mm, crashk_res.start, > + __phys_to_virt(crashk_res.start), > + resource_size(&crashk_res), PAGE_KERNEL, > + debug_pagealloc_enabled()); > + > + flush_tlb_all(); > +} We can lose the flush_tlb_all() here; TLBs aren't allowed to cache an invalid entry, so there's nothing to remove from the TLBs. [...] > @@ -538,6 +540,24 @@ static void __init map_mem(pgd_t *pgd) > if (memblock_is_nomap(reg)) > continue; > > +#ifdef CONFIG_KEXEC_CORE > + /* > + * While crash dump kernel memory is contained in a single > + * memblock for now, it should appear in an isolated mapping > + * so that we can independently unmap the region later. > + */ > + if (crashk_res.end && > + (start <= crashk_res.start) && > + ((crashk_res.end + 1) < end)) { > + if (crashk_res.start != start) > + __map_memblock(pgd, start, crashk_res.start); > + > + if ((crashk_res.end + 1) < end) > + __map_memblock(pgd, crashk_res.end + 1, end); > + > + continue; > + } > +#endif This wasn't quite what I had in mind. I had expected that here we would isolate the ranges we wanted to avoid mapping (with a comment as to why we couldn't move the memblock_isolate_range() calls earlier). In map_memblock(), we'd skip those ranges entirely. I believe the above isn't correct if we have a single memblock.memory region covering both the crashkernel and kernel regions. In that case, we'd erroneously map the portion which overlaps the kernel. It seems there are a number of subtle problems here. :/ Thanks, Mark.
On Wed, Feb 01, 2017 at 06:00:08PM +0000, Mark Rutland wrote: > On Wed, Feb 01, 2017 at 09:46:24PM +0900, AKASHI Takahiro wrote: > > arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres() > > are meant to be called around kexec_load() in order to protect > > the memory allocated for crash dump kernel once after it's loaded. > > > > The protection is implemented here by unmapping the region rather than > > making it read-only. > > To make the things work correctly, we also have to > > - put the region in an isolated, page-level mapping initially, and > > - move copying kexec's control_code_page to machine_kexec_prepare() > > > > Note that page-level mapping is also required to allow for shrinking > > the size of memory, through /sys/kernel/kexec_crash_size, by any number > > of multiple pages. > > Looking at kexec_crash_size_store(), I don't see where memory returned > to the OS is mapped. AFAICT, if the region is protected when the user > shrinks the region, the memory will not be mapped, yet handed over to > the kernel for general allocation. > > Surely we need an arch-specific callback to handle that? e.g. > > arch_crash_release_region(unsigned long base, unsigned long size) > { > /* > * Ensure the region is part of the linear map before we return > * it to the OS. We won't unmap this again, so we can use block > * mappings. > */ > create_pgd_mapping(&init_mm, start, __phys_to_virt(start), > size, PAGE_KERNEL, false); > } > > ... which we'd call from crash_shrink_memory() before we freed the > reserved pages. Another question is (how) does hyp map this region? Thanks, Mark.
On Wed, Feb 01, 2017 at 06:00:08PM +0000, Mark Rutland wrote: > On Wed, Feb 01, 2017 at 09:46:24PM +0900, AKASHI Takahiro wrote: > > arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres() > > are meant to be called around kexec_load() in order to protect > > the memory allocated for crash dump kernel once after it's loaded. > > > > The protection is implemented here by unmapping the region rather than > > making it read-only. > > To make the things work correctly, we also have to > > - put the region in an isolated, page-level mapping initially, and > > - move copying kexec's control_code_page to machine_kexec_prepare() > > > > Note that page-level mapping is also required to allow for shrinking > > the size of memory, through /sys/kernel/kexec_crash_size, by any number > > of multiple pages. > > Looking at kexec_crash_size_store(), I don't see where memory returned > to the OS is mapped. AFAICT, if the region is protected when the user > shrinks the region, the memory will not be mapped, yet handed over to > the kernel for general allocation. The region is protected only when the crash dump kernel is loaded, and after that, we are no longer able to shrink the region. > Surely we need an arch-specific callback to handle that? e.g. > > arch_crash_release_region(unsigned long base, unsigned long size) > { > /* > * Ensure the region is part of the linear map before we return > * it to the OS. We won't unmap this again, so we can use block > * mappings. > */ > create_pgd_mapping(&init_mm, start, __phys_to_virt(start), > size, PAGE_KERNEL, false); > } > > ... which we'd call from crash_shrink_memory() before we freed the > reserved pages. All the memory is mapped by my map_crashkernel() at boot time. > [...] > > > +void arch_kexec_unprotect_crashkres(void) > > +{ > > + /* > > + * We don't have to make page-level mappings here because > > + * the crash dump kernel memory is not allowed to be shrunk > > + * once the kernel is loaded. > > + */ > > + create_pgd_mapping(&init_mm, crashk_res.start, > > + __phys_to_virt(crashk_res.start), > > + resource_size(&crashk_res), PAGE_KERNEL, > > + debug_pagealloc_enabled()); > > + > > + flush_tlb_all(); > > +} > > We can lose the flush_tlb_all() here; TLBs aren't allowed to cache an > invalid entry, so there's nothing to remove from the TLBs. Ah, yes! > [...] > > > @@ -538,6 +540,24 @@ static void __init map_mem(pgd_t *pgd) > > if (memblock_is_nomap(reg)) > > continue; > > > > +#ifdef CONFIG_KEXEC_CORE > > + /* > > + * While crash dump kernel memory is contained in a single > > + * memblock for now, it should appear in an isolated mapping > > + * so that we can independently unmap the region later. > > + */ > > + if (crashk_res.end && > > + (start <= crashk_res.start) && > > + ((crashk_res.end + 1) < end)) { > > + if (crashk_res.start != start) > > + __map_memblock(pgd, start, crashk_res.start); > > + > > + if ((crashk_res.end + 1) < end) > > + __map_memblock(pgd, crashk_res.end + 1, end); > > + > > + continue; > > + } > > +#endif > > This wasn't quite what I had in mind. I had expected that here we would > isolate the ranges we wanted to avoid mapping (with a comment as to why > we couldn't move the memblock_isolate_range() calls earlier). In > map_memblock(), we'd skip those ranges entirely. > > I believe the above isn't correct if we have a single memblock.memory > region covering both the crashkernel and kernel regions. In that case, > we'd erroneously map the portion which overlaps the kernel. > > It seems there are a number of subtle problems here. :/ I didn't see any problems, but I will go back with memblock_isolate_range() here in map_mem(). Thanks, -Takahiro AKASHI > Thanks, > Mark.
Mark, On Wed, Feb 01, 2017 at 06:25:06PM +0000, Mark Rutland wrote: > On Wed, Feb 01, 2017 at 06:00:08PM +0000, Mark Rutland wrote: > > On Wed, Feb 01, 2017 at 09:46:24PM +0900, AKASHI Takahiro wrote: > > > arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres() > > > are meant to be called around kexec_load() in order to protect > > > the memory allocated for crash dump kernel once after it's loaded. > > > > > > The protection is implemented here by unmapping the region rather than > > > making it read-only. > > > To make the things work correctly, we also have to > > > - put the region in an isolated, page-level mapping initially, and > > > - move copying kexec's control_code_page to machine_kexec_prepare() > > > > > > Note that page-level mapping is also required to allow for shrinking > > > the size of memory, through /sys/kernel/kexec_crash_size, by any number > > > of multiple pages. > > > > Looking at kexec_crash_size_store(), I don't see where memory returned > > to the OS is mapped. AFAICT, if the region is protected when the user > > shrinks the region, the memory will not be mapped, yet handed over to > > the kernel for general allocation. > > > > Surely we need an arch-specific callback to handle that? e.g. > > > > arch_crash_release_region(unsigned long base, unsigned long size) > > { > > /* > > * Ensure the region is part of the linear map before we return > > * it to the OS. We won't unmap this again, so we can use block > > * mappings. > > */ > > create_pgd_mapping(&init_mm, start, __phys_to_virt(start), > > size, PAGE_KERNEL, false); > > } > > > > ... which we'd call from crash_shrink_memory() before we freed the > > reserved pages. > > Another question is (how) does hyp map this region? I don't get your point here. Hyp mode does care only physical memory in intermediate address, doesn't it? If this is not a matter now, I will post v32 tomorrow :) -Takahiro AKASHI > Thanks, > Mark.
Hi Akashi, Mark, On 01/02/17 18:25, Mark Rutland wrote: > On Wed, Feb 01, 2017 at 06:00:08PM +0000, Mark Rutland wrote: >> On Wed, Feb 01, 2017 at 09:46:24PM +0900, AKASHI Takahiro wrote: >>> arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres() >>> are meant to be called around kexec_load() in order to protect >>> the memory allocated for crash dump kernel once after it's loaded. >>> >>> The protection is implemented here by unmapping the region rather than >>> making it read-only. >>> To make the things work correctly, we also have to >>> - put the region in an isolated, page-level mapping initially, and >>> - move copying kexec's control_code_page to machine_kexec_prepare() >>> >>> Note that page-level mapping is also required to allow for shrinking >>> the size of memory, through /sys/kernel/kexec_crash_size, by any number >>> of multiple pages. >> Looking at kexec_crash_size_store(), I don't see where memory returned >> to the OS is mapped. AFAICT, if the region is protected when the user >> shrinks the region, the memory will not be mapped, yet handed over to >> the kernel for general allocation. kernel/kexec_core.c:crash_shrink_memory() will bailout: > if (kexec_crash_image) { > ret = -ENOENT; > goto unlock; > } So it should only be possible to return memory to the allocators when there is no crash image loaded, so the area is mapped. What happens when we unload the crash image? It looks like an unload is a call to do_kexec_load with nr_segments == 0, do_kexec_load() has: > if (flags & KEXEC_ON_CRASH) { > dest_image = &kexec_crash_image; > if (kexec_crash_image) > arch_kexec_unprotect_crashkres(); So we unprotect the region when we unload the kernel causing it to be remapped. Provided the load/protect and {load,unload}/unprotect are kept in sync, I think this is safe. Given the core code can unload a crash image, this hunk has me worried: +void arch_kexec_unprotect_crashkres(void) +{ + /* + * We don't have to make page-level mappings here because + * the crash dump kernel memory is not allowed to be shrunk + * once the kernel is loaded. + */ + create_pgd_mapping(&init_mm, crashk_res.start, + __phys_to_virt(crashk_res.start), + resource_size(&crashk_res), PAGE_KERNEL, + debug_pagealloc_enabled()); I don't think this is true if the order is: load -> protect, unload -> unprotect, shrink. The shrink will happen with potentially non-page-size mappings. Thanks, James
Hi, On Thu, Feb 02, 2017 at 07:31:30PM +0900, AKASHI Takahiro wrote: > On Wed, Feb 01, 2017 at 06:00:08PM +0000, Mark Rutland wrote: > > On Wed, Feb 01, 2017 at 09:46:24PM +0900, AKASHI Takahiro wrote: > > > arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres() > > > are meant to be called around kexec_load() in order to protect > > > the memory allocated for crash dump kernel once after it's loaded. > > > > > > The protection is implemented here by unmapping the region rather than > > > making it read-only. > > > To make the things work correctly, we also have to > > > - put the region in an isolated, page-level mapping initially, and > > > - move copying kexec's control_code_page to machine_kexec_prepare() > > > > > > Note that page-level mapping is also required to allow for shrinking > > > the size of memory, through /sys/kernel/kexec_crash_size, by any number > > > of multiple pages. > > > > Looking at kexec_crash_size_store(), I don't see where memory returned > > to the OS is mapped. AFAICT, if the region is protected when the user > > shrinks the region, the memory will not be mapped, yet handed over to > > the kernel for general allocation. > > The region is protected only when the crash dump kernel is loaded, > and after that, we are no longer able to shrink the region. Ah, sorry. My misunderstanding strikes again. That should be fine; sorry for the noise, and thanks for explaining. > > > @@ -538,6 +540,24 @@ static void __init map_mem(pgd_t *pgd) > > > if (memblock_is_nomap(reg)) > > > continue; > > > > > > +#ifdef CONFIG_KEXEC_CORE > > > + /* > > > + * While crash dump kernel memory is contained in a single > > > + * memblock for now, it should appear in an isolated mapping > > > + * so that we can independently unmap the region later. > > > + */ > > > + if (crashk_res.end && > > > + (start <= crashk_res.start) && > > > + ((crashk_res.end + 1) < end)) { > > > + if (crashk_res.start != start) > > > + __map_memblock(pgd, start, crashk_res.start); > > > + > > > + if ((crashk_res.end + 1) < end) > > > + __map_memblock(pgd, crashk_res.end + 1, end); > > > + > > > + continue; > > > + } > > > +#endif > > > > This wasn't quite what I had in mind. I had expected that here we would > > isolate the ranges we wanted to avoid mapping (with a comment as to why > > we couldn't move the memblock_isolate_range() calls earlier). In > > map_memblock(), we'd skip those ranges entirely. > > > > I believe the above isn't correct if we have a single memblock.memory > > region covering both the crashkernel and kernel regions. In that case, > > we'd erroneously map the portion which overlaps the kernel. > > > > It seems there are a number of subtle problems here. :/ > > I didn't see any problems, but I will go back with memblock_isolate_range() > here in map_mem(). Imagine we have phyiscal memory: singe RAM bank: |---------------------------------------------------| kernel image: |---| crashkernel: |------| ... we reserve the image and crashkernel region, but these would still remain part of the memory memblock, and we'd have a memblock layout like: memblock.memory: |---------------------------------------------------| memblock.reserved: |---| |------| ... in map_mem() we iterate over memblock.memory, so we only have a single entry to handle in this case. With the code above, we'd find that it overlaps the crashk_res, and we'd map the parts which don't overlap, e.g. memblock.memory: |---------------------------------------------------| crashkernel: |------| mapped regions: |-----------------------------| |------------| ... hwoever, this means we've mapped the portion which overlaps with the kernel's linear alias (i.e. the case that we try to handle in __map_memblock()). What we actually wanted was: memblock.memory: |---------------------------------------------------| kernel image: |---| crashkernel: |------| mapped regions: |------| |----------------| |------------| To handle all cases I think we have to isolate *both* the image and crashkernel in map_mem(). That would leave use with: memblock.memory: |------||---||----------------||------||------------| memblock.reserved: |---| |------| ... so then we can check for overlap with either the kernel or crashkernel in __map_memblock(), and return early, e.g. __map_memblock(...) if (overlaps_with_kernel(...)) return; if (overlaps_with_crashekrenl(...)) return; __create_pgd_mapping(...); } We can pull the kernel alias mapping out of __map_memblock() and put it at the end of map_mem(). Does that make sense? Thanks, Mark.
James, On Thu, Feb 02, 2017 at 10:45:58AM +0000, James Morse wrote: > Hi Akashi, Mark, > > On 01/02/17 18:25, Mark Rutland wrote: > > On Wed, Feb 01, 2017 at 06:00:08PM +0000, Mark Rutland wrote: > >> On Wed, Feb 01, 2017 at 09:46:24PM +0900, AKASHI Takahiro wrote: > >>> arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres() > >>> are meant to be called around kexec_load() in order to protect > >>> the memory allocated for crash dump kernel once after it's loaded. > >>> > >>> The protection is implemented here by unmapping the region rather than > >>> making it read-only. > >>> To make the things work correctly, we also have to > >>> - put the region in an isolated, page-level mapping initially, and > >>> - move copying kexec's control_code_page to machine_kexec_prepare() > >>> > >>> Note that page-level mapping is also required to allow for shrinking > >>> the size of memory, through /sys/kernel/kexec_crash_size, by any number > >>> of multiple pages. > > >> Looking at kexec_crash_size_store(), I don't see where memory returned > >> to the OS is mapped. AFAICT, if the region is protected when the user > >> shrinks the region, the memory will not be mapped, yet handed over to > >> the kernel for general allocation. > > kernel/kexec_core.c:crash_shrink_memory() will bailout: > > if (kexec_crash_image) { > > ret = -ENOENT; > > goto unlock; > > } > > So it should only be possible to return memory to the allocators when there is > no crash image loaded, so the area is mapped. > > What happens when we unload the crash image? It looks like an unload is a call Thank you for this heads-up! I've almost forgot this feature. > to do_kexec_load with nr_segments == 0, do_kexec_load() has: > > if (flags & KEXEC_ON_CRASH) { > > dest_image = &kexec_crash_image; > > if (kexec_crash_image) > > arch_kexec_unprotect_crashkres(); > > So we unprotect the region when we unload the kernel causing it to be remapped. > Provided the load/protect and {load,unload}/unprotect are kept in sync, I think > this is safe. > > > Given the core code can unload a crash image, this hunk has me worried: > +void arch_kexec_unprotect_crashkres(void) > +{ > + /* > + * We don't have to make page-level mappings here because > + * the crash dump kernel memory is not allowed to be shrunk > + * once the kernel is loaded. > + */ > + create_pgd_mapping(&init_mm, crashk_res.start, > + __phys_to_virt(crashk_res.start), > + resource_size(&crashk_res), PAGE_KERNEL, > + debug_pagealloc_enabled()); > > > I don't think this is true if the order is: load -> protect, unload -> > unprotect, shrink. The shrink will happen with potentially non-page-size mappings. So we have to always do page-mapping, here. -Takahiro AKASHI > > Thanks, > > James >
Hi, On Thu, Feb 02, 2017 at 10:45:58AM +0000, James Morse wrote: > On 01/02/17 18:25, Mark Rutland wrote: > > On Wed, Feb 01, 2017 at 06:00:08PM +0000, Mark Rutland wrote: > >> On Wed, Feb 01, 2017 at 09:46:24PM +0900, AKASHI Takahiro wrote: > >>> arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres() > >>> are meant to be called around kexec_load() in order to protect > >>> the memory allocated for crash dump kernel once after it's loaded. > >>> > >>> The protection is implemented here by unmapping the region rather than > >>> making it read-only. > >>> To make the things work correctly, we also have to > >>> - put the region in an isolated, page-level mapping initially, and > >>> - move copying kexec's control_code_page to machine_kexec_prepare() > >>> > >>> Note that page-level mapping is also required to allow for shrinking > >>> the size of memory, through /sys/kernel/kexec_crash_size, by any number > >>> of multiple pages. > > >> Looking at kexec_crash_size_store(), I don't see where memory returned > >> to the OS is mapped. AFAICT, if the region is protected when the user > >> shrinks the region, the memory will not be mapped, yet handed over to > >> the kernel for general allocation. > > kernel/kexec_core.c:crash_shrink_memory() will bailout: > > if (kexec_crash_image) { > > ret = -ENOENT; > > goto unlock; > > } > > So it should only be possible to return memory to the allocators when there is > no crash image loaded, so the area is mapped. Ah. That being the case, there should be no problem. Any part given back to the linear map will be at page granularity, but that won't result in any functional problem. > Given the core code can unload a crash image, this hunk has me worried: > +void arch_kexec_unprotect_crashkres(void) > +{ > + /* > + * We don't have to make page-level mappings here because > + * the crash dump kernel memory is not allowed to be shrunk > + * once the kernel is loaded. > + */ > + create_pgd_mapping(&init_mm, crashk_res.start, > + __phys_to_virt(crashk_res.start), > + resource_size(&crashk_res), PAGE_KERNEL, > + debug_pagealloc_enabled()); > > > I don't think this is true if the order is: load -> protect, unload -> > unprotect, shrink. The shrink will happen with potentially non-page-size mappings. Let's consistently use page mappings for the crashkernel region. Thanks, Mark.
On Thu, Feb 02, 2017 at 07:39:06PM +0900, AKASHI Takahiro wrote: > Mark, > > On Wed, Feb 01, 2017 at 06:25:06PM +0000, Mark Rutland wrote: > > On Wed, Feb 01, 2017 at 06:00:08PM +0000, Mark Rutland wrote: > > > On Wed, Feb 01, 2017 at 09:46:24PM +0900, AKASHI Takahiro wrote: > > > > arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres() > > > > are meant to be called around kexec_load() in order to protect > > > > the memory allocated for crash dump kernel once after it's loaded. > > > > > > > > The protection is implemented here by unmapping the region rather than > > > > making it read-only. > > > > To make the things work correctly, we also have to > > > > - put the region in an isolated, page-level mapping initially, and > > > > - move copying kexec's control_code_page to machine_kexec_prepare() > > > > > > > > Note that page-level mapping is also required to allow for shrinking > > > > the size of memory, through /sys/kernel/kexec_crash_size, by any number > > > > of multiple pages. > > > > > > Looking at kexec_crash_size_store(), I don't see where memory returned > > > to the OS is mapped. AFAICT, if the region is protected when the user > > > shrinks the region, the memory will not be mapped, yet handed over to > > > the kernel for general allocation. > > > > > > Surely we need an arch-specific callback to handle that? e.g. > > > > > > arch_crash_release_region(unsigned long base, unsigned long size) > > > { > > > /* > > > * Ensure the region is part of the linear map before we return > > > * it to the OS. We won't unmap this again, so we can use block > > > * mappings. > > > */ > > > create_pgd_mapping(&init_mm, start, __phys_to_virt(start), > > > size, PAGE_KERNEL, false); > > > } > > > > > > ... which we'd call from crash_shrink_memory() before we freed the > > > reserved pages. > > > > Another question is (how) does hyp map this region? > > I don't get your point here. > Hyp mode does care only physical memory in intermediate address, doesn't it? My concern was that hyp may map the region; and thus buggy code at hyp can corrupt the region (and/or hyp may conflict w.r.t. attributes). We mght have to ensure hyp doesn't map the crashkernel region, and to case us pain, disallow freeing of any part of the region. I'll dig into this. Thanks, Mark. > If this is not a matter now, I will post v32 tomorrow :) > > -Takahiro AKASHI > > > > Thanks, > > Mark.
On Thu, Feb 02, 2017 at 11:16:37AM +0000, Mark Rutland wrote: > Hi, > > On Thu, Feb 02, 2017 at 07:31:30PM +0900, AKASHI Takahiro wrote: > > On Wed, Feb 01, 2017 at 06:00:08PM +0000, Mark Rutland wrote: > > > On Wed, Feb 01, 2017 at 09:46:24PM +0900, AKASHI Takahiro wrote: > > > > arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres() > > > > are meant to be called around kexec_load() in order to protect > > > > the memory allocated for crash dump kernel once after it's loaded. > > > > > > > > The protection is implemented here by unmapping the region rather than > > > > making it read-only. > > > > To make the things work correctly, we also have to > > > > - put the region in an isolated, page-level mapping initially, and > > > > - move copying kexec's control_code_page to machine_kexec_prepare() > > > > > > > > Note that page-level mapping is also required to allow for shrinking > > > > the size of memory, through /sys/kernel/kexec_crash_size, by any number > > > > of multiple pages. > > > > > > Looking at kexec_crash_size_store(), I don't see where memory returned > > > to the OS is mapped. AFAICT, if the region is protected when the user > > > shrinks the region, the memory will not be mapped, yet handed over to > > > the kernel for general allocation. > > > > The region is protected only when the crash dump kernel is loaded, > > and after that, we are no longer able to shrink the region. > > Ah, sorry. My misunderstanding strikes again. That should be fine; sorry > for the noise, and thanks for explaining. > > > > > @@ -538,6 +540,24 @@ static void __init map_mem(pgd_t *pgd) > > > > if (memblock_is_nomap(reg)) > > > > continue; > > > > > > > > +#ifdef CONFIG_KEXEC_CORE > > > > + /* > > > > + * While crash dump kernel memory is contained in a single > > > > + * memblock for now, it should appear in an isolated mapping > > > > + * so that we can independently unmap the region later. > > > > + */ > > > > + if (crashk_res.end && > > > > + (start <= crashk_res.start) && > > > > + ((crashk_res.end + 1) < end)) { > > > > + if (crashk_res.start != start) > > > > + __map_memblock(pgd, start, crashk_res.start); > > > > + > > > > + if ((crashk_res.end + 1) < end) > > > > + __map_memblock(pgd, crashk_res.end + 1, end); > > > > + > > > > + continue; > > > > + } > > > > +#endif > > > > > > This wasn't quite what I had in mind. I had expected that here we would > > > isolate the ranges we wanted to avoid mapping (with a comment as to why > > > we couldn't move the memblock_isolate_range() calls earlier). In > > > map_memblock(), we'd skip those ranges entirely. > > > > > > I believe the above isn't correct if we have a single memblock.memory > > > region covering both the crashkernel and kernel regions. In that case, > > > we'd erroneously map the portion which overlaps the kernel. > > > > > > It seems there are a number of subtle problems here. :/ > > > > I didn't see any problems, but I will go back with memblock_isolate_range() > > here in map_mem(). > > Imagine we have phyiscal memory: > > singe RAM bank: |---------------------------------------------------| > kernel image: |---| > crashkernel: |------| > > ... we reserve the image and crashkernel region, but these would still > remain part of the memory memblock, and we'd have a memblock layout > like: > > memblock.memory: |---------------------------------------------------| > memblock.reserved: |---| |------| > > ... in map_mem() we iterate over memblock.memory, so we only have a > single entry to handle in this case. With the code above, we'd find that > it overlaps the crashk_res, and we'd map the parts which don't overlap, > e.g. > > memblock.memory: |---------------------------------------------------| > crashkernel: |------| > mapped regions: |-----------------------------| |------------| I'm afraid that you might be talking about my v30. The code in v31 was a bit modified, and now > ... hwoever, this means we've mapped the portion which overlaps with the > kernel's linear alias (i.e. the case that we try to handle in > __map_memblock()). What we actually wanted was: > > memblock.memory: |---------------------------------------------------| > kernel image: |---| > crashkernel: |------| |-----------(A)---------------| |----(B)-----| __map_memblock() is called against each of (A) and (B), so I think we will get > mapped regions: |------| |----------------| |------------| this mapping. > > > To handle all cases I think we have to isolate *both* the image and > crashkernel in map_mem(). That would leave use with: > > memblock.memory: |------||---||----------------||------||------------| > memblock.reserved: |---| |------| > > ... so then we can check for overlap with either the kernel or > crashkernel in __map_memblock(), and return early, e.g. > > __map_memblock(...) > if (overlaps_with_kernel(...)) > return; > if (overlaps_with_crashekrenl(...)) > return; > > __create_pgd_mapping(...); > } > > We can pull the kernel alias mapping out of __map_memblock() and put it > at the end of map_mem(). > > Does that make sense? OK, I now understand your anticipation. Thanks, -Takahiro AKASHI > Thanks, > Mark.
On Thu, Feb 02, 2017 at 11:36:04PM +0900, AKASHI Takahiro wrote: > On Thu, Feb 02, 2017 at 11:16:37AM +0000, Mark Rutland wrote: > > On Thu, Feb 02, 2017 at 07:31:30PM +0900, AKASHI Takahiro wrote: > > > On Wed, Feb 01, 2017 at 06:00:08PM +0000, Mark Rutland wrote: > > > > On Wed, Feb 01, 2017 at 09:46:24PM +0900, AKASHI Takahiro wrote: > > Imagine we have phyiscal memory: > > > > singe RAM bank: |---------------------------------------------------| > > kernel image: |---| > > crashkernel: |------| > > > > ... we reserve the image and crashkernel region, but these would still > > remain part of the memory memblock, and we'd have a memblock layout > > like: > > > > memblock.memory: |---------------------------------------------------| > > memblock.reserved: |---| |------| > > > > ... in map_mem() we iterate over memblock.memory, so we only have a > > single entry to handle in this case. With the code above, we'd find that > > it overlaps the crashk_res, and we'd map the parts which don't overlap, > > e.g. > > > > memblock.memory: |---------------------------------------------------| > > crashkernel: |------| > > mapped regions: |-----------------------------| |------------| > > I'm afraid that you might be talking about my v30. > The code in v31 was a bit modified, and now > > > ... hwoever, this means we've mapped the portion which overlaps with the > > kernel's linear alias (i.e. the case that we try to handle in > > __map_memblock()). What we actually wanted was: > > > > memblock.memory: |---------------------------------------------------| > > kernel image: |---| > > crashkernel: |------| > > |-----------(A)---------------| |----(B)-----| > > __map_memblock() is called against each of (A) and (B), > so I think we will get > > > mapped regions: |------| |----------------| |------------| > > this mapping. Ah, I see. You are correct; this will work. Clearly I needed more coffee before considering this. :/ > > To handle all cases I think we have to isolate *both* the image and > > crashkernel in map_mem(). That would leave use with: > > > > memblock.memory: |------||---||----------------||------||------------| > > memblock.reserved: |---| |------| > > > > ... so then we can check for overlap with either the kernel or > > crashkernel in __map_memblock(), and return early, e.g. > > > > __map_memblock(...) > > if (overlaps_with_kernel(...)) > > return; > > if (overlaps_with_crashekrenl(...)) > > return; > > > > __create_pgd_mapping(...); > > } > > > > We can pull the kernel alias mapping out of __map_memblock() and put it > > at the end of map_mem(). > > > > Does that make sense? > > OK, I now understand your anticipation. Thinking about it, we can do this consistently in one place if we use nomap (temporarily), which would allow us to simplify the existing code. We'd have to introduce memblock_clear_nomap(), but that's less of an internal detail than memblock_isolate_range(), so that seems reasonable. I think we can handle this like: /* * Create a linear mapping for all memblock memory regions which are mappable. */ static void __init __map_mem(pgd_t *pgd) { struct memblock_region *reg; /* map all the memory banks */ for_each_memblock(memory, reg) { phys_addr_t start = reg->base; phys_addr_t end = start + reg->size; if (start >= end) break; if (memblock_is_nomap(reg)) continue; __create_pgd_mapping(pgd, start, __phys_to_virt(start), end - start, PAGE_KERNEL, early_pgtable_alloc, debug_pagealloc_enabled()); } } /* * Create a linear mapping for mappable memory, handling regions which have * special mapping requirements. */ static void __init map_mem(pgd_t *pgd) { unsigned long kernel_start = __pa(_text); unsigned long kernel_end = __pa(__init_begin); memblock_mark_nomap(kernel_start, kernel_end); // mark crashkernel nomap here __map_mem(pgd); /* * Map the linear alias of the [_text, __init_begin) interval as * read-only/non-executable. This makes the contents of the * region accessible to subsystems such as hibernate, but * protects it from inadvertent modification or execution. */ __create_pgd_mapping(pgd, kernel_start, __phys_to_virt(kernel_start), kernel_end - kernel_start, PAGE_KERNEL_RO, early_pgtable_alloc, debug_pagealloc_enabled()); memblock_clear_nomap(kernel_start, kernel_end); // map crashkernel here // clear nomap for crashkernel } Thoughts? Thanks, Mark.
Mark, On Thu, Feb 02, 2017 at 11:54:25AM +0000, Mark Rutland wrote: > On Thu, Feb 02, 2017 at 07:39:06PM +0900, AKASHI Takahiro wrote: > > Mark, > > > > On Wed, Feb 01, 2017 at 06:25:06PM +0000, Mark Rutland wrote: > > > On Wed, Feb 01, 2017 at 06:00:08PM +0000, Mark Rutland wrote: > > > > On Wed, Feb 01, 2017 at 09:46:24PM +0900, AKASHI Takahiro wrote: > > > > > arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres() > > > > > are meant to be called around kexec_load() in order to protect > > > > > the memory allocated for crash dump kernel once after it's loaded. > > > > > > > > > > The protection is implemented here by unmapping the region rather than > > > > > making it read-only. > > > > > To make the things work correctly, we also have to > > > > > - put the region in an isolated, page-level mapping initially, and > > > > > - move copying kexec's control_code_page to machine_kexec_prepare() > > > > > > > > > > Note that page-level mapping is also required to allow for shrinking > > > > > the size of memory, through /sys/kernel/kexec_crash_size, by any number > > > > > of multiple pages. > > > > > > > > Looking at kexec_crash_size_store(), I don't see where memory returned > > > > to the OS is mapped. AFAICT, if the region is protected when the user > > > > shrinks the region, the memory will not be mapped, yet handed over to > > > > the kernel for general allocation. > > > > > > > > Surely we need an arch-specific callback to handle that? e.g. > > > > > > > > arch_crash_release_region(unsigned long base, unsigned long size) > > > > { > > > > /* > > > > * Ensure the region is part of the linear map before we return > > > > * it to the OS. We won't unmap this again, so we can use block > > > > * mappings. > > > > */ > > > > create_pgd_mapping(&init_mm, start, __phys_to_virt(start), > > > > size, PAGE_KERNEL, false); > > > > } > > > > > > > > ... which we'd call from crash_shrink_memory() before we freed the > > > > reserved pages. > > > > > > Another question is (how) does hyp map this region? > > > > I don't get your point here. > > Hyp mode does care only physical memory in intermediate address, doesn't it? > > My concern was that hyp may map the region; and thus buggy code at hyp > can corrupt the region (and/or hyp may conflict w.r.t. attributes). Grep'ing create_hyp_mappings() under arch/arm(64)/kvm shows that we have only a few small regions of memory mapped in hyp mode. I also confirmed that there is no active mapping for crash dump kernel memory by checking mmu tables with DS-5 debugger. > We mght have to ensure hyp doesn't map the crashkernel region, and to > case us pain, disallow freeing of any part of the region. So I don't believe we need to worry such a case. Thanks, -Takahiro AKASHI > I'll dig into this. > > Thanks, > Mark. > > > If this is not a matter now, I will post v32 tomorrow :) > > > > -Takahiro AKASHI > > > > > > > Thanks, > > > Mark.
Hi, On Fri, Feb 03, 2017 at 10:45:39AM +0900, AKASHI Takahiro wrote: > On Thu, Feb 02, 2017 at 11:54:25AM +0000, Mark Rutland wrote: > > On Thu, Feb 02, 2017 at 07:39:06PM +0900, AKASHI Takahiro wrote: > > > On Wed, Feb 01, 2017 at 06:25:06PM +0000, Mark Rutland wrote: > > > > On Wed, Feb 01, 2017 at 06:00:08PM +0000, Mark Rutland wrote: > > > > Another question is (how) does hyp map this region? > > > > > > I don't get your point here. > > > Hyp mode does care only physical memory in intermediate address, doesn't it? > > > > My concern was that hyp may map the region; and thus buggy code at hyp > > can corrupt the region (and/or hyp may conflict w.r.t. attributes). > > Grep'ing create_hyp_mappings() under arch/arm(64)/kvm shows that > we have only a few small regions of memory mapped in hyp mode. > I also confirmed that there is no active mapping for crash dump kernel > memory by checking mmu tables with DS-5 debugger. Ok, that sounds fine. I was under the mistaken impression that hyp had a copy of the whole linear mapping, but it seems it lazily maps that in (at page granularity) as required. So there should be no alias for the crashkernel region. > > We mght have to ensure hyp doesn't map the crashkernel region, and to > > case us pain, disallow freeing of any part of the region. > > So I don't believe we need to worry such a case. I agree. Thanks, Mark.
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c index bc96c8a7fc79..016f2dd693aa 100644 --- a/arch/arm64/kernel/machine_kexec.c +++ b/arch/arm64/kernel/machine_kexec.c @@ -14,6 +14,7 @@ #include <asm/cacheflush.h> #include <asm/cpu_ops.h> +#include <asm/mmu.h> #include <asm/mmu_context.h> #include "cpu-reset.h" @@ -22,8 +23,6 @@ extern const unsigned char arm64_relocate_new_kernel[]; extern const unsigned long arm64_relocate_new_kernel_size; -static unsigned long kimage_start; - /** * kexec_image_info - For debugging output. */ @@ -64,7 +63,7 @@ void machine_kexec_cleanup(struct kimage *kimage) */ int machine_kexec_prepare(struct kimage *kimage) { - kimage_start = kimage->start; + void *reboot_code_buffer; kexec_image_info(kimage); @@ -73,6 +72,21 @@ int machine_kexec_prepare(struct kimage *kimage) return -EBUSY; } + reboot_code_buffer = + phys_to_virt(page_to_phys(kimage->control_code_page)); + + /* + * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use + * after the kernel is shut down. + */ + memcpy(reboot_code_buffer, arm64_relocate_new_kernel, + arm64_relocate_new_kernel_size); + + /* Flush the reboot_code_buffer in preparation for its execution. */ + __flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size); + flush_icache_range((uintptr_t)reboot_code_buffer, + arm64_relocate_new_kernel_size); + return 0; } @@ -143,7 +157,6 @@ static void kexec_segment_flush(const struct kimage *kimage) void machine_kexec(struct kimage *kimage) { phys_addr_t reboot_code_buffer_phys; - void *reboot_code_buffer; /* * New cpus may have become stuck_in_kernel after we loaded the image. @@ -151,7 +164,6 @@ void machine_kexec(struct kimage *kimage) BUG_ON(cpus_are_stuck_in_kernel() || (num_online_cpus() > 1)); reboot_code_buffer_phys = page_to_phys(kimage->control_code_page); - reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys); kexec_image_info(kimage); @@ -159,31 +171,17 @@ void machine_kexec(struct kimage *kimage) kimage->control_code_page); pr_debug("%s:%d: reboot_code_buffer_phys: %pa\n", __func__, __LINE__, &reboot_code_buffer_phys); - pr_debug("%s:%d: reboot_code_buffer: %p\n", __func__, __LINE__, - reboot_code_buffer); pr_debug("%s:%d: relocate_new_kernel: %p\n", __func__, __LINE__, arm64_relocate_new_kernel); pr_debug("%s:%d: relocate_new_kernel_size: 0x%lx(%lu) bytes\n", __func__, __LINE__, arm64_relocate_new_kernel_size, arm64_relocate_new_kernel_size); - /* - * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use - * after the kernel is shut down. - */ - memcpy(reboot_code_buffer, arm64_relocate_new_kernel, - arm64_relocate_new_kernel_size); - - /* Flush the reboot_code_buffer in preparation for its execution. */ - __flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size); - flush_icache_range((uintptr_t)reboot_code_buffer, - arm64_relocate_new_kernel_size); - /* Flush the kimage list and its buffers. */ kexec_list_flush(kimage); /* Flush the new image if already in place. */ - if (kimage->head & IND_DONE) + if ((kimage != kexec_crash_image) && (kimage->head & IND_DONE)) kexec_segment_flush(kimage); pr_info("Bye!\n"); @@ -201,7 +199,7 @@ void machine_kexec(struct kimage *kimage) */ cpu_soft_restart(1, reboot_code_buffer_phys, kimage->head, - kimage_start, 0); + kimage->start, 0); BUG(); /* Should never get here. */ } @@ -210,3 +208,28 @@ void machine_crash_shutdown(struct pt_regs *regs) { /* Empty routine needed to avoid build errors. */ } + +void arch_kexec_protect_crashkres(void) +{ + kexec_segment_flush(kexec_crash_image); + + remove_pgd_mapping(&init_mm, __phys_to_virt(crashk_res.start), + resource_size(&crashk_res)); + + flush_tlb_all(); +} + +void arch_kexec_unprotect_crashkres(void) +{ + /* + * We don't have to make page-level mappings here because + * the crash dump kernel memory is not allowed to be shrunk + * once the kernel is loaded. + */ + create_pgd_mapping(&init_mm, crashk_res.start, + __phys_to_virt(crashk_res.start), + resource_size(&crashk_res), PAGE_KERNEL, + debug_pagealloc_enabled()); + + flush_tlb_all(); +} diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 9d3cea1db3b4..87861e62316a 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -22,6 +22,8 @@ #include <linux/kernel.h> #include <linux/errno.h> #include <linux/init.h> +#include <linux/ioport.h> +#include <linux/kexec.h> #include <linux/libfdt.h> #include <linux/mman.h> #include <linux/nodemask.h> @@ -538,6 +540,24 @@ static void __init map_mem(pgd_t *pgd) if (memblock_is_nomap(reg)) continue; +#ifdef CONFIG_KEXEC_CORE + /* + * While crash dump kernel memory is contained in a single + * memblock for now, it should appear in an isolated mapping + * so that we can independently unmap the region later. + */ + if (crashk_res.end && + (start <= crashk_res.start) && + ((crashk_res.end + 1) < end)) { + if (crashk_res.start != start) + __map_memblock(pgd, start, crashk_res.start); + + if ((crashk_res.end + 1) < end) + __map_memblock(pgd, crashk_res.end + 1, end); + + continue; + } +#endif __map_memblock(pgd, start, end); } } @@ -623,6 +643,21 @@ static void __init map_kernel(pgd_t *pgd) kasan_copy_shadow(pgd); } +#ifdef CONFIG_KEXEC_CORE +static int __init map_crashkernel(void) +{ + /* page-level mapping only to allow for shrinking */ + if (crashk_res.end) + create_pgd_mapping(&init_mm, crashk_res.start, + __phys_to_virt(crashk_res.start), + resource_size(&crashk_res), PAGE_KERNEL, + true); + + return 0; +} +subsys_initcall(map_crashkernel); +#endif + /* * paging_init() sets up the page tables, initialises the zone memory * maps and sets up the zero page.
arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres() are meant to be called around kexec_load() in order to protect the memory allocated for crash dump kernel once after it's loaded. The protection is implemented here by unmapping the region rather than making it read-only. To make the things work correctly, we also have to - put the region in an isolated, page-level mapping initially, and - move copying kexec's control_code_page to machine_kexec_prepare() Note that page-level mapping is also required to allow for shrinking the size of memory, through /sys/kernel/kexec_crash_size, by any number of multiple pages. Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> --- arch/arm64/kernel/machine_kexec.c | 65 ++++++++++++++++++++++++++------------- arch/arm64/mm/mmu.c | 35 +++++++++++++++++++++ 2 files changed, 79 insertions(+), 21 deletions(-)