Message ID | 20210531122959.23499-3-rppt@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | consolidate "System RAM" resources setup | expand |
On Mon, May 31, 2021 at 03:29:56PM +0300, Mike Rapoport wrote: > + code_resource.start = __pa_symbol(_text); > + code_resource.end = __pa_symbol(_etext)-1; > + rodata_resource.start = __pa_symbol(__start_rodata); > + rodata_resource.end = __pa_symbol(__end_rodata)-1; > + data_resource.start = __pa_symbol(_sdata); > + data_resource.end = __pa_symbol(_edata)-1; > + bss_resource.start = __pa_symbol(__bss_start); > + bss_resource.end = __pa_symbol(__bss_stop)-1; This falls short on 32-bit ARM. The old code was: - kernel_code.start = virt_to_phys(_text); - kernel_code.end = virt_to_phys(__init_begin - 1); - kernel_data.start = virt_to_phys(_sdata); - kernel_data.end = virt_to_phys(_end - 1); If I look at one of my kernels: c0008000 T _text c0b5b000 R __end_rodata ... exception and unwind tables live here ... c0c00000 T __init_begin c0e00000 D _sdata c0e68870 D _edata c0e68870 B __bss_start c0e995d4 B __bss_stop c0e995d4 B _end So the original covers _text..__init_begin-1 which includes the exception and unwind tables. Your version above omits these, which leaves them exposed.
On Tue, Jun 01, 2021 at 02:54:15PM +0100, Russell King (Oracle) wrote: > On Mon, May 31, 2021 at 03:29:56PM +0300, Mike Rapoport wrote: > > + code_resource.start = __pa_symbol(_text); > > + code_resource.end = __pa_symbol(_etext)-1; > > + rodata_resource.start = __pa_symbol(__start_rodata); > > + rodata_resource.end = __pa_symbol(__end_rodata)-1; > > + data_resource.start = __pa_symbol(_sdata); > > + data_resource.end = __pa_symbol(_edata)-1; > > + bss_resource.start = __pa_symbol(__bss_start); > > + bss_resource.end = __pa_symbol(__bss_stop)-1; > > This falls short on 32-bit ARM. The old code was: > > - kernel_code.start = virt_to_phys(_text); > - kernel_code.end = virt_to_phys(__init_begin - 1); > - kernel_data.start = virt_to_phys(_sdata); > - kernel_data.end = virt_to_phys(_end - 1); > > If I look at one of my kernels: > > c0008000 T _text > c0b5b000 R __end_rodata > ... exception and unwind tables live here ... > c0c00000 T __init_begin > c0e00000 D _sdata > c0e68870 D _edata > c0e68870 B __bss_start > c0e995d4 B __bss_stop > c0e995d4 B _end > > So the original covers _text..__init_begin-1 which includes the > exception and unwind tables. Your version above omits these, which > leaves them exposed. Right, this needs to be fixed. Is there any reason the exception and unwind tables cannot be placed between _sdata and _edata? It seems to me that they were left outside for purely historical reasons. Commit ee951c630c5c ("ARM: 7568/1: Sort exception table at compile time") moved the exception tables out of .data section before _sdata existed. Commit 14c4a533e099 ("ARM: 8583/1: mm: fix location of _etext") moved _etext before the unwind tables and didn't bother to put them into data or rodata areas.
On Wed, Jun 02, 2021 at 11:33:10AM +0300, Mike Rapoport wrote: > On Tue, Jun 01, 2021 at 02:54:15PM +0100, Russell King (Oracle) wrote: > > On Mon, May 31, 2021 at 03:29:56PM +0300, Mike Rapoport wrote: > > > + code_resource.start = __pa_symbol(_text); > > > + code_resource.end = __pa_symbol(_etext)-1; > > > + rodata_resource.start = __pa_symbol(__start_rodata); > > > + rodata_resource.end = __pa_symbol(__end_rodata)-1; > > > + data_resource.start = __pa_symbol(_sdata); > > > + data_resource.end = __pa_symbol(_edata)-1; > > > + bss_resource.start = __pa_symbol(__bss_start); > > > + bss_resource.end = __pa_symbol(__bss_stop)-1; > > > > This falls short on 32-bit ARM. The old code was: > > > > - kernel_code.start = virt_to_phys(_text); > > - kernel_code.end = virt_to_phys(__init_begin - 1); > > - kernel_data.start = virt_to_phys(_sdata); > > - kernel_data.end = virt_to_phys(_end - 1); > > > > If I look at one of my kernels: > > > > c0008000 T _text > > c0b5b000 R __end_rodata > > ... exception and unwind tables live here ... > > c0c00000 T __init_begin > > c0e00000 D _sdata > > c0e68870 D _edata > > c0e68870 B __bss_start > > c0e995d4 B __bss_stop > > c0e995d4 B _end > > > > So the original covers _text..__init_begin-1 which includes the > > exception and unwind tables. Your version above omits these, which > > leaves them exposed. > > Right, this needs to be fixed. Is there any reason the exception and unwind > tables cannot be placed between _sdata and _edata? > > It seems to me that they were left outside for purely historical reasons. > Commit ee951c630c5c ("ARM: 7568/1: Sort exception table at compile time") > moved the exception tables out of .data section before _sdata existed. > Commit 14c4a533e099 ("ARM: 8583/1: mm: fix location of _etext") moved > _etext before the unwind tables and didn't bother to put them into data or > rodata areas. You can not assume that all sections will be between these symbols. This isn't specific to 32-bit ARM. If you look at x86's vmlinux.lds.in, you will see that BUG_TABLE and ORC_UNWIND_TABLE are after _edata, along with many other undiscarded sections before __bss_start. So it seems your assumptions in trying to clean this up are somewhat false.
On Wed, Jun 02, 2021 at 11:15:21AM +0100, Russell King (Oracle) wrote: > On Wed, Jun 02, 2021 at 11:33:10AM +0300, Mike Rapoport wrote: > > On Tue, Jun 01, 2021 at 02:54:15PM +0100, Russell King (Oracle) wrote: > > > On Mon, May 31, 2021 at 03:29:56PM +0300, Mike Rapoport wrote: > > > > + code_resource.start = __pa_symbol(_text); > > > > + code_resource.end = __pa_symbol(_etext)-1; > > > > + rodata_resource.start = __pa_symbol(__start_rodata); > > > > + rodata_resource.end = __pa_symbol(__end_rodata)-1; > > > > + data_resource.start = __pa_symbol(_sdata); > > > > + data_resource.end = __pa_symbol(_edata)-1; > > > > + bss_resource.start = __pa_symbol(__bss_start); > > > > + bss_resource.end = __pa_symbol(__bss_stop)-1; > > > > > > This falls short on 32-bit ARM. The old code was: > > > > > > - kernel_code.start = virt_to_phys(_text); > > > - kernel_code.end = virt_to_phys(__init_begin - 1); > > > - kernel_data.start = virt_to_phys(_sdata); > > > - kernel_data.end = virt_to_phys(_end - 1); > > > > > > If I look at one of my kernels: > > > > > > c0008000 T _text > > > c0b5b000 R __end_rodata > > > ... exception and unwind tables live here ... > > > c0c00000 T __init_begin > > > c0e00000 D _sdata > > > c0e68870 D _edata > > > c0e68870 B __bss_start > > > c0e995d4 B __bss_stop > > > c0e995d4 B _end > > > > > > So the original covers _text..__init_begin-1 which includes the > > > exception and unwind tables. Your version above omits these, which > > > leaves them exposed. > > > > Right, this needs to be fixed. Is there any reason the exception and unwind > > tables cannot be placed between _sdata and _edata? > > > > It seems to me that they were left outside for purely historical reasons. > > Commit ee951c630c5c ("ARM: 7568/1: Sort exception table at compile time") > > moved the exception tables out of .data section before _sdata existed. > > Commit 14c4a533e099 ("ARM: 8583/1: mm: fix location of _etext") moved > > _etext before the unwind tables and didn't bother to put them into data or > > rodata areas. > > You can not assume that all sections will be between these symbols. This > isn't specific to 32-bit ARM. If you look at x86's vmlinux.lds.in, you > will see that BUG_TABLE and ORC_UNWIND_TABLE are after _edata, along > with many other undiscarded sections before __bss_start. But if you look at x86's setup_arch() all these never make it to the resource tree. So there are holes in /proc/iomem between the kernel resources. > So it seems your assumptions in trying to clean this up are somewhat > false. My assumption was that there is complete lack of consistency between what is reserved memory and how it is reported in /proc/iomem or /sys/firmware/memmap for that matter. I'm not trying to clean this up, I'm trying to make different views of the physical memory consistent. Consolidating several similar per-arch implementations is the first step in this direction.
On Wed, Jun 02, 2021 at 04:54:17PM +0300, Mike Rapoport wrote: > On Wed, Jun 02, 2021 at 11:15:21AM +0100, Russell King (Oracle) wrote: > > On Wed, Jun 02, 2021 at 11:33:10AM +0300, Mike Rapoport wrote: > > > On Tue, Jun 01, 2021 at 02:54:15PM +0100, Russell King (Oracle) wrote: > > > > If I look at one of my kernels: > > > > > > > > c0008000 T _text > > > > c0b5b000 R __end_rodata > > > > ... exception and unwind tables live here ... > > > > c0c00000 T __init_begin > > > > c0e00000 D _sdata > > > > c0e68870 D _edata > > > > c0e68870 B __bss_start > > > > c0e995d4 B __bss_stop > > > > c0e995d4 B _end > > > > > > > > So the original covers _text..__init_begin-1 which includes the > > > > exception and unwind tables. Your version above omits these, which > > > > leaves them exposed. > > > > > > Right, this needs to be fixed. Is there any reason the exception and unwind > > > tables cannot be placed between _sdata and _edata? > > > > > > It seems to me that they were left outside for purely historical reasons. > > > Commit ee951c630c5c ("ARM: 7568/1: Sort exception table at compile time") > > > moved the exception tables out of .data section before _sdata existed. > > > Commit 14c4a533e099 ("ARM: 8583/1: mm: fix location of _etext") moved > > > _etext before the unwind tables and didn't bother to put them into data or > > > rodata areas. > > > > You can not assume that all sections will be between these symbols. This > > isn't specific to 32-bit ARM. If you look at x86's vmlinux.lds.in, you > > will see that BUG_TABLE and ORC_UNWIND_TABLE are after _edata, along > > with many other undiscarded sections before __bss_start. > > But if you look at x86's setup_arch() all these never make it to the > resource tree. So there are holes in /proc/iomem between the kernel > resources. Also true. However, my point was to counter your claim that these sections should be part of the .text/.data/.rodata etc sections in the output vmlinux. There is, however, a more important point. The __ex_table section must exist and be separate from the .text/.data/.rodata sections in the output ELF file, as sorttable (the exception table sorter) relies on this to be able to find the table and sort it. So, it isn't entirely "for historical reasons" as you said two messages ago. > > So it seems your assumptions in trying to clean this up are somewhat > > false. > > My assumption was that there is complete lack of consistency between what > is reserved memory and how it is reported in /proc/iomem or > /sys/firmware/memmap for that matter. I'm not trying to clean this up, I'm > trying to make different views of the physical memory consistent. > Consolidating several similar per-arch implementations is the first step in > this direction. It looks to me that there is quite a number of things that need fixing. One glaring thing is the kernel's init memory - should that be counted as reserved memory? It's marked as such in memblock and /proc/iomem, yet we free these pages into the page allocator after boot meaning they are just like any other page in the memory allocator - they are most certainly not "reserved" at that point. So, what is reported as reserved in firmware maps will be different from memblock. Memblock includes kernel boot-time allocations, which count as "reserved" but are not part of the firmware maps - these will be for things like early page tables and the struct page array. So, you're never going to get consistency between memblock and firmware. Memblock and /proc/iomem should be fairly consistent - areas marked as reserved in memblock seem to be propagated into /proc/iomem, including areas around the kernel image (the resources that you're changing in your patch.) Here's an example: /sys/kernel/debug/memblock/reserved: 1: 0x0000000081210000..0x0000000082d6efff 2: 0x0000000082d71000..0x0000000082d7ffff 81210000-821cffff : Kernel code 821d0000-8246ffff : reserved 82470000-82d7ffff : Kernel data This is aarch64, which isn't as accurate as 32-bit ARM in /proc/iomem: /sys/kernel/debug/memblock/reserved: 1: 0x0000000040200000..0x0000000040ea1c17 /proc/iomem: 40008000-40bfffff : Kernel code 40e00000-40ea1c17 : Kernel data 32-bit ARM doesn't forward the memblock reserved areas into /proc/iomem because they are kernel allocations. In the example I show above for 32-bit ARM, there are no firmware reserved regions, yet there are 19 memblock "reserved" regions. I think part of the problem here is understanding what "reserved" means in these cases. For something passed to the kernel from firmware, it's an area that firmware doesn't want the OS to use. For memblock, it is those areas plus allocations made early on during kernel boot before the page allocator is up and running, and includes areas of memory that these allocations must avoid (e.g. due to initramfs or device tree temporarily residing there.) Then there's differences in what should be placed in /proc/iomem. Now, bear in mind that /proc/iomem is a user API, one which userspace depends on. If we start going around making /proc/iomem report stuff like kernel boot time reservations as "reserved" memory, we will end up breaking the kexec tooling on some platforms. For example, kexec tooling for 32-bit ARM parses /proc/iomem, looking for "System RAM", "System RAM (boot alias)" and "reserved" regions. So, I think changes to make this "more consistent" come with high risk.
On Wed, Jun 02, 2021 at 04:51:41PM +0100, Russell King (Oracle) wrote: > On Wed, Jun 02, 2021 at 04:54:17PM +0300, Mike Rapoport wrote: > > On Wed, Jun 02, 2021 at 11:15:21AM +0100, Russell King (Oracle) wrote: > > > On Wed, Jun 02, 2021 at 11:33:10AM +0300, Mike Rapoport wrote: > > > > On Tue, Jun 01, 2021 at 02:54:15PM +0100, Russell King (Oracle) wrote: > > > > > If I look at one of my kernels: > > > > > > > > > > c0008000 T _text > > > > > c0b5b000 R __end_rodata > > > > > ... exception and unwind tables live here ... > > > > > c0c00000 T __init_begin > > > > > c0e00000 D _sdata > > > > > c0e68870 D _edata > > > > > c0e68870 B __bss_start > > > > > c0e995d4 B __bss_stop > > > > > c0e995d4 B _end > > > > > > > > > > So the original covers _text..__init_begin-1 which includes the > > > > > exception and unwind tables. Your version above omits these, which > > > > > leaves them exposed. > > > > > > > > Right, this needs to be fixed. Is there any reason the exception and unwind > > > > tables cannot be placed between _sdata and _edata? > > > > > > > > It seems to me that they were left outside for purely historical reasons. > > > > Commit ee951c630c5c ("ARM: 7568/1: Sort exception table at compile time") > > > > moved the exception tables out of .data section before _sdata existed. > > > > Commit 14c4a533e099 ("ARM: 8583/1: mm: fix location of _etext") moved > > > > _etext before the unwind tables and didn't bother to put them into data or > > > > rodata areas. > > > > > > You can not assume that all sections will be between these symbols. This > > > isn't specific to 32-bit ARM. If you look at x86's vmlinux.lds.in, you > > > will see that BUG_TABLE and ORC_UNWIND_TABLE are after _edata, along > > > with many other undiscarded sections before __bss_start. > > > > But if you look at x86's setup_arch() all these never make it to the > > resource tree. So there are holes in /proc/iomem between the kernel > > resources. > > Also true. However, my point was to counter your claim that these > sections should be part of the .text/.data/.rodata etc sections in the > output vmlinux. > > There is, however, a more important point. The __ex_table section > must exist and be separate from the .text/.data/.rodata sections in > the output ELF file, as sorttable (the exception table sorter) relies > on this to be able to find the table and sort it. > > So, it isn't entirely "for historical reasons" as you said two messages > ago. Back then when __ex_table was moved from .data section, _sdata and _edata were part of the .data section. Today they are not. So something like the patch below will ensure for instance that __ex_table would be a part of "Kernel data" in /proc/iomem without moving it to the .data section: diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S index f7f4620d59c3..2991feceab31 100644 --- a/arch/arm/kernel/vmlinux.lds.S +++ b/arch/arm/kernel/vmlinux.lds.S @@ -72,13 +72,6 @@ SECTIONS RO_DATA(PAGE_SIZE) - . = ALIGN(4); - __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) { - __start___ex_table = .; - ARM_MMU_KEEP(*(__ex_table)) - __stop___ex_table = .; - } - #ifdef CONFIG_ARM_UNWIND ARM_UNWIND_SECTIONS #endif @@ -143,6 +136,14 @@ SECTIONS __init_end = .; _sdata = .; + + . = ALIGN(4); + __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) { + __start___ex_table = .; + ARM_MMU_KEEP(*(__ex_table)) + __stop___ex_table = .; + } + RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE) _edata = .; > Now, bear in mind that /proc/iomem is a user API, one which userspace > depends on. If we start going around making /proc/iomem report stuff > like kernel boot time reservations as "reserved" memory, we will end up > breaking the kexec tooling on some platforms. For example, kexec > tooling for 32-bit ARM parses /proc/iomem, looking for "System RAM", > "System RAM (boot alias)" and "reserved" regions. > > So, I think changes to make this "more consistent" come with high > risk. I agree there is a risk but I don't think it's high. It does not look like the minor changes in "reserved" reporting in /proc/iomem will break kexec tooling. Anyway the amount of reserved and free memory depends on a particular system, kernel version, configuration and command line. I have no intention to report kernel boot time reservations to /proc/iomem on architectures that do not report them there today, although this also does not seem like a significant factor. On the other hand, making /proc/iomem reporting consistent among architectures will allow to reduce complexity of both the kernel and kexec tools in the long run.
On Wed, Jun 02, 2021 at 09:43:32PM +0300, Mike Rapoport wrote: > Back then when __ex_table was moved from .data section, _sdata and _edata > were part of the .data section. Today they are not. So something like the > patch below will ensure for instance that __ex_table would be a part of > "Kernel data" in /proc/iomem without moving it to the .data section: > > diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S > index f7f4620d59c3..2991feceab31 100644 > --- a/arch/arm/kernel/vmlinux.lds.S > +++ b/arch/arm/kernel/vmlinux.lds.S > @@ -72,13 +72,6 @@ SECTIONS > > RO_DATA(PAGE_SIZE) > > - . = ALIGN(4); > - __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) { > - __start___ex_table = .; > - ARM_MMU_KEEP(*(__ex_table)) > - __stop___ex_table = .; > - } > - > #ifdef CONFIG_ARM_UNWIND > ARM_UNWIND_SECTIONS > #endif > @@ -143,6 +136,14 @@ SECTIONS > __init_end = .; > > _sdata = .; > + > + . = ALIGN(4); > + __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) { > + __start___ex_table = .; > + ARM_MMU_KEEP(*(__ex_table)) > + __stop___ex_table = .; > + } > + > RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE) > _edata = .; This example has undesirable security implications. It moves the exception table out of the read-only mappings into the read-write mappings, thereby providing a way for an attacker to bypass the read-only protection on the kernel and manipulate code pointers at potentially known addresses for distro built kernels. > I agree there is a risk but I don't think it's high. It does not look like > the minor changes in "reserved" reporting in /proc/iomem will break kexec > tooling. What makes you come to that conclusion? The kexec tools architecture backends get to decide what they do when parsing /proc/iomem. Currently, only firmware areas are marked reserved in /proc/iomem on 32-bit ARM. This is read by kexec, and entered into its memory_range[] table as either RAM, or RESERVED. kexec uses this to search for a suitable hole in the memory map to place the kernel in physical memory. The addition of what I will call ficticious "reserved" areas by the host kernel because the host kernel happened to use them _will_ have an impact on this. They _are_ ficticious, because they are purely an artifact of the host kernel being run, and are of no consequence to tooling such as kexec. What such tooling is interested in is which areas it needs to avoid because of firmware. I think what isn't helping here is that you haven't adequately described what your overall objective actually is. Framing it in terms of wanting the reserved memory to be consistent between the various kernel "interfaces" such as /proc/iomem, the memblock debugfs and firmware is very ambiguous and open to different interpretations, whcih I think is what the problem is here. > Anyway the amount of reserved and free memory depends on a > particular system, kernel version, configuration and command line. > I have no intention to report kernel boot time reservations > to /proc/iomem on architectures that do not report them there today, > although this also does not seem like a significant factor. You seem to be missing the point I've tried to make. The areas in memblock that are marked "reserved" are the areas of reserved memory from the firmware _plus_ the areas that the kernel has made during boot which are of no consequence to userspace. Wanting /proc/iomem, memblock and firmware to all agree on the values that they mark as "reserved" is IMHO unrealistic.
On Wed, Jun 02, 2021 at 09:15:02PM +0100, Russell King (Oracle) wrote: > On Wed, Jun 02, 2021 at 09:43:32PM +0300, Mike Rapoport wrote: > > Back then when __ex_table was moved from .data section, _sdata and _edata > > were part of the .data section. Today they are not. So something like the > > patch below will ensure for instance that __ex_table would be a part of > > "Kernel data" in /proc/iomem without moving it to the .data section: > > > This example has undesirable security implications. It moves the > exception table out of the read-only mappings into the read-write > mappings, thereby providing a way for an attacker to bypass the > read-only protection on the kernel and manipulate code pointers at > potentially known addresses for distro built kernels. My point was that __ex_table can be in "Kernel data" or "Kernel rodata" without loosing the ability to sort it. > You seem to be missing the point I've tried to make. The areas in > memblock that are marked "reserved" are the areas of reserved memory > from the firmware _plus_ the areas that the kernel has made during > boot which are of no consequence to userspace. I know what areas are marked "reserved" in memblock. I never suggested to report "ficticious" reserved areas in /proc/iomem unless an architecture already reports them there, like arm64 for example. You are right I should have described better the overall objective, but sill I feel that we keep missing each other points. I'll update the descriptions for the next repost, hopefully it'll help.
diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c index 30430e7c1b03..6f3c82cc0b0d 100644 --- a/arch/s390/kernel/setup.c +++ b/arch/s390/kernel/setup.c @@ -481,80 +481,9 @@ static void __init setup_lowcore_dat_on(void) __ctl_set_bit(0, 28); } -static struct resource code_resource = { - .name = "Kernel code", - .flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM, -}; - -static struct resource data_resource = { - .name = "Kernel data", - .flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM, -}; - -static struct resource bss_resource = { - .name = "Kernel bss", - .flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM, -}; - -static struct resource __initdata *standard_resources[] = { - &code_resource, - &data_resource, - &bss_resource, -#ifdef CONFIG_CRASH_DUMP - &crashk_res, -#endif -}; - static void __init setup_resources(void) { - struct resource *res, *std_res, *sub_res; - phys_addr_t start, end; - int j; - u64 i; - - code_resource.start = (unsigned long) _text; - code_resource.end = (unsigned long) _etext - 1; - data_resource.start = (unsigned long) _etext; - data_resource.end = (unsigned long) _edata - 1; - bss_resource.start = (unsigned long) __bss_start; - bss_resource.end = (unsigned long) __bss_stop - 1; - - for_each_mem_range(i, &start, &end) { - res = memblock_alloc(sizeof(*res), 8); - if (!res) - panic("%s: Failed to allocate %zu bytes align=0x%x\n", - __func__, sizeof(*res), 8); - res->flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM; - - res->name = "System RAM"; - res->start = start; - /* - * In memblock, end points to the first byte after the - * range while in resourses, end points to the last byte in - * the range. - */ - res->end = end - 1; - request_resource(&iomem_resource, res); - - for (j = 0; j < ARRAY_SIZE(standard_resources); j++) { - std_res = standard_resources[j]; - if (!std_res->end || std_res->start < res->start || - std_res->start > res->end) - continue; - if (std_res->end > res->end) { - sub_res = memblock_alloc(sizeof(*sub_res), 8); - if (!sub_res) - panic("%s: Failed to allocate %zu bytes align=0x%x\n", - __func__, sizeof(*sub_res), 8); - *sub_res = *std_res; - sub_res->end = res->end; - std_res->start = res->end + 1; - request_resource(res, sub_res); - } else { - request_resource(res, std_res); - } - } - } + memblock_setup_resources(); } static void __init setup_ident_map_size(void) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 5984fff3f175..44c29ebae842 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -121,6 +121,8 @@ void memblock_free_all(void); void reset_node_managed_pages(pg_data_t *pgdat); void reset_all_zones_managed_pages(void); +void memblock_setup_resources(void); + /* Low level functions */ void __next_mem_range(u64 *idx, int nid, enum memblock_flags flags, struct memblock_type *type_a, diff --git a/mm/memblock.c b/mm/memblock.c index afaefa8fc6ab..504435753259 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -16,6 +16,8 @@ #include <linux/kmemleak.h> #include <linux/seq_file.h> #include <linux/memblock.h> +#include <linux/ioport.h> +#include <linux/kexec.h> #include <asm/sections.h> #include <linux/io.h> @@ -2062,6 +2064,91 @@ void __init memblock_free_all(void) totalram_pages_add(pages); } +static struct resource code_resource = { + .name = "Kernel code", + .flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM, +}; + +static struct resource rodata_resource = { + .name = "Kernel rodata", + .flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM, +}; + +static struct resource data_resource = { + .name = "Kernel data", + .flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM, +}; + +static struct resource bss_resource = { + .name = "Kernel bss", + .flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM, +}; + +static struct resource __initdata *standard_resources[] = { + &code_resource, + &rodata_resource, + &data_resource, + &bss_resource, +#ifdef CONFIG_KEXEC_CORE + &crashk_res, +#endif +}; + +void __init memblock_setup_resources(void) +{ + struct resource *res, *kres, *sub_res; + phys_addr_t start, end; + int j; + u64 i; + + code_resource.start = __pa_symbol(_text); + code_resource.end = __pa_symbol(_etext)-1; + rodata_resource.start = __pa_symbol(__start_rodata); + rodata_resource.end = __pa_symbol(__end_rodata)-1; + data_resource.start = __pa_symbol(_sdata); + data_resource.end = __pa_symbol(_edata)-1; + bss_resource.start = __pa_symbol(__bss_start); + bss_resource.end = __pa_symbol(__bss_stop)-1; + + for_each_mem_range(i, &start, &end) { + res = memblock_alloc(sizeof(*res), 8); + if (!res) + panic("%s: Failed to allocate %zu bytes align=0x%x\n", + __func__, sizeof(*res), 8); + res->flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM; + + res->name = "System RAM"; + res->start = start; + + /* + * In memblock, end points to the first byte after the + * range while in resourses, end points to the last byte in + * the range. + */ + res->end = end - 1; + request_resource(&iomem_resource, res); + + for (j = 0; j < ARRAY_SIZE(standard_resources); j++) { + kres = standard_resources[j]; + if (!kres->end || kres->start < res->start || + kres->start > res->end) + continue; + if (kres->end > res->end) { + sub_res = memblock_alloc(sizeof(*sub_res), 8); + if (!sub_res) + panic("%s: Failed to allocate %zu bytes align=0x%x\n", + __func__, sizeof(*sub_res), 8); + *sub_res = *kres; + sub_res->end = res->end; + kres->start = res->end + 1; + request_resource(res, sub_res); + } else { + request_resource(res, kres); + } + } + } +} + #if defined(CONFIG_DEBUG_FS) && defined(CONFIG_ARCH_KEEP_MEMBLOCK) static int memblock_debug_show(struct seq_file *m, void *private)