Message ID | 20221228143831.396245-1-Jason@zx2c4.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [qemu] x86: don't let decompressed kernel image clobber setup_data | expand |
Hi Jason, On 28/12/22 15:38, Jason A. Donenfeld wrote: > The setup_data links are appended to the compressed kernel image. Since > the kernel image is typically loaded at 0x100000, setup_data lives at > `0x100000 + compressed_size`, which does not get relocated during the > kernel's boot process. > > The kernel typically decompresses the image starting at address > 0x1000000 (note: there's one more zero there than the decompressed image > above). This usually is fine for most kernels. > > However, if the compressed image is actually quite large, then > setup_data will live at a `0x100000 + compressed_size` that extends into > the decompressed zone at 0x1000000. In other words, if compressed_size > is larger than `0x1000000 - 0x100000`, then the decompression step will > clobber setup_data, resulting in crashes. > > Fix this by detecting that possibility, and if it occurs, put setup_data > *after* the end of the decompressed kernel, so that it doesn't get > clobbered. > > One caveat is that this only works for images less than around 64 > megabytes, so just bail out in that case. This is unfortunate, but I > don't currently have a way of fixing it. > > Cc: x86@kernel.org > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > --- > hw/i386/x86.c | 30 ++++++++++++++++++++++++++++++ > 1 file changed, 30 insertions(+) > > diff --git a/hw/i386/x86.c b/hw/i386/x86.c > index 78cc131926..628fd2b2e9 100644 > --- a/hw/i386/x86.c > +++ b/hw/i386/x86.c > @@ -1077,6 +1077,36 @@ void x86_load_linux(X86MachineState *x86ms, > } > fclose(f); > > + /* If a setup_data is going to be used, make sure that the bootloader won't > + * decompress into it and clobber those bytes. */ > + if (dtb_filename || !legacy_no_rng_seed) { > + uint32_t payload_offset = ldl_p(setup + 0x248); > + uint32_t payload_length = ldl_p(setup + 0x24c); > + uint32_t target_address = ldl_p(setup + 0x258); Nitpicking, can the Linux kernel add these magic values in arch/x86/include/uapi/asm/bootparam.h? Or can we use offsetof(setup_header) to get them? > + uint32_t decompressed_length = ldl_p(kernel + payload_offset + payload_length - 4); > + > + uint32_t estimated_setup_data_length = 4096 * 16; > + uint32_t start_setup_data = prot_addr + kernel_size; > + uint32_t end_setup_data = start_setup_data + estimated_setup_data_length; > + uint32_t start_target = target_address; > + uint32_t end_target = target_address + decompressed_length; Maybe we can simply use 'unsigned' type. > + if ((start_setup_data >= start_target && start_setup_data < end_target) || > + (end_setup_data >= start_target && end_setup_data < end_target)) { > + uint32_t padded_size = target_address + decompressed_length - prot_addr; > + > + /* The early stage can't address past around 64 MB from the original > + * mapping, so just give up in that case. */ > + if (padded_size < 62 * 1024 * 1024) You mention 64 but check for 62, is that expected? You can use the MiB definitions to ease code review: 64 * MiB. > + kernel_size = padded_size; > + else { > + fprintf(stderr, "qemu: Kernel image too large to hold setup_data\n"); > + dtb_filename = NULL; > + legacy_no_rng_seed = true; > + } > + } > + } Fix looks good, glad you figured out the problem. Regards, Phil.
On Wed, Dec 28, 2022 at 05:02:22PM +0100, Philippe Mathieu-Daudé wrote: > Hi Jason, > > On 28/12/22 15:38, Jason A. Donenfeld wrote: > > The setup_data links are appended to the compressed kernel image. Since > > the kernel image is typically loaded at 0x100000, setup_data lives at > > `0x100000 + compressed_size`, which does not get relocated during the > > kernel's boot process. > > > > The kernel typically decompresses the image starting at address > > 0x1000000 (note: there's one more zero there than the decompressed image *compressed image > > + uint32_t target_address = ldl_p(setup + 0x258); > > Nitpicking, can the Linux kernel add these magic values in > arch/x86/include/uapi/asm/bootparam.h? Or can we use > offsetof(setup_header) to get them? I suspect the reason that x86.c has always had those hardcoded offsets is because this is how it's documented in Documentation/x86/boot.rst? > > + if ((start_setup_data >= start_target && start_setup_data < end_target) || > > + (end_setup_data >= start_target && end_setup_data < end_target)) { > > + uint32_t padded_size = target_address + decompressed_length - prot_addr; > > + > > + /* The early stage can't address past around 64 MB from the original > > + * mapping, so just give up in that case. */ > > + if (padded_size < 62 * 1024 * 1024) > > You mention 64 but check for 62, is that expected? You can use the MiB > definitions to ease code review: 64 * MiB. 62 is intentional. But I'm still not really sure what's up. 63 doesn't work. I haven't totally worked out why this is, or why the 64 MiB limit exists in the first place. Maybe because this is a very early mapping set up by real mode? Or because another mapping is placed over it that's executable? There's that 2MiB*4096 gdt entry, but that'd cover all 4 gigs. So I really don't know yet. I'll continue to poke at it, but on the off chance somebody here understands what's up, that'd save me a bunch of head scratching. > Fix looks good, glad you figured out the problem. I mean, kind of. The solution here sucks, especially given that in the worst case, setup_data just gets dropped. I'm half inclined to consider this a kernel bug instead, and add some code to relocate setup_data prior to decompression, and then fix up all the links. It seems like this would be a lot more robust. I just wish the people who wrote this stuff would chime in. I've had x86@kernel.org CC'd but so far, no input from them. Jason
HELLO H. PETER ANVIN, E L L O On Wed, Dec 28, 2022 at 05:30:30PM +0100, Jason A. Donenfeld wrote: > > Fix looks good, glad you figured out the problem. > > I mean, kind of. The solution here sucks, especially given that in the > worst case, setup_data just gets dropped. I'm half inclined to consider > this a kernel bug instead, and add some code to relocate setup_data > prior to decompression, and then fix up all the links. It seems like > this would be a lot more robust. > > I just wish the people who wrote this stuff would chime in. I've had > x86@kernel.org CC'd but so far, no input from them. Apparently you are the x86 boot guru. What do you want to happen here? Your input would be very instrumental. Jason
On December 28, 2022 8:57:54 AM PST, "Jason A. Donenfeld" <Jason@zx2c4.com> wrote: >HELLO H. PETER ANVIN, >E >L >L >O > >On Wed, Dec 28, 2022 at 05:30:30PM +0100, Jason A. Donenfeld wrote: >> > Fix looks good, glad you figured out the problem. >> >> I mean, kind of. The solution here sucks, especially given that in the >> worst case, setup_data just gets dropped. I'm half inclined to consider >> this a kernel bug instead, and add some code to relocate setup_data >> prior to decompression, and then fix up all the links. It seems like >> this would be a lot more robust. >> >> I just wish the people who wrote this stuff would chime in. I've had >> x86@kernel.org CC'd but so far, no input from them. > >Apparently you are the x86 boot guru. What do you want to happen here? >Your input would be very instrumental. > >Jason Hi! Glad you asked. So the kernel load addresses are parameterized in the kernel image setup header. One of the things that are so parameterized are the size and possible realignment of the kernel image in memory. I'm very confused where you are getting the 64 MB number from. There should not be any such limitation. In general, setup_data should be able to go anywhere the initrd can go, and so is subject to the same address cap (896 MB for old kernels, 4 GB on newer ones; this address too is enumerated in the header.) If you want to put setup_data above 4 GB, it *should* be ok if and only if the kernel supports loading the initrd high, too (again, enumerated in the header. TL;DR: put setup_data where you put the initrd (before or after doesn't matter.) To be maximally conservative, link the setup_data list in order from lowest to highest address; currently there is no such item of relevance, but in the future there may be setup_data items needed by the BIOS part of the bootstrap in which case they would have to be < 1 MB and precede any items > 1 MB for obvious reasons. That being said, with BIOS dying it is not all that likely that such entries will ever be needed.
On 12/28/22 15:58, H. Peter Anvin wrote: > On December 28, 2022 8:57:54 AM PST, "Jason A. Donenfeld" <Jason@zx2c4.com> wrote: >> HELLO H. PETER ANVIN, >> E >> L >> L >> O >> >> On Wed, Dec 28, 2022 at 05:30:30PM +0100, Jason A. Donenfeld wrote: >>>> Fix looks good, glad you figured out the problem. >>> >>> I mean, kind of. The solution here sucks, especially given that in the >>> worst case, setup_data just gets dropped. I'm half inclined to consider >>> this a kernel bug instead, and add some code to relocate setup_data >>> prior to decompression, and then fix up all the links. It seems like >>> this would be a lot more robust. >>> >>> I just wish the people who wrote this stuff would chime in. I've had >>> x86@kernel.org CC'd but so far, no input from them. >> >> Apparently you are the x86 boot guru. What do you want to happen here? >> Your input would be very instrumental. >> >> Jason > > Hi! > > Glad you asked. > > So the kernel load addresses are parameterized in the kernel image > setup header. One of the things that are so parameterized are the > size and possible realignment of the kernel image in memory. > > I'm very confused where you are getting the 64 MB number from. There > should not be any such limitation. > > In general, setup_data should be able to go anywhere the initrd can > go, and so is subject to the same address cap (896 MB for old > kernels, 4 GB on newer ones; this address too is enumerated in the > header.) > > If you want to put setup_data above 4 GB, it *should* be ok if and > only if the kernel supports loading the initrd high, too (again, > enumerated in the header. > > TL;DR: put setup_data where you put the initrd (before or after > doesn't matter.) > > To be maximally conservative, link the setup_data list in order from > lowest to highest address; currently there is no such item of > relevance, but in the future there may be setup_data items needed by > the BIOS part of the bootstrap in which case they would have to be < > 1 MB and precede any items > 1 MB for obvious reasons. That being > said, with BIOS dying it is not all that likely that such entries > will ever be needed. > So let me try for an algorithm. Attached as a text file to avoid line break damage. -hpa Here is an attempted description with pseudo-C code: First of all, take a 4K page of memory and *initialize it to zero*. { #include <asm/bootparam.h> /* From the uapi kernel sources */ /* Allocated somewhere in your code... */ extern unsigned char *kernel_image; /* Kernel file */ extern struct boot_params *boot_params; /* 4K buffer */ extern uint32_t kernel_image_size; /* Size of kernel file */ /* Callbacks into your code */ extern bool is_bios_boot(void); extern uint32_t end_of_low_memory(void); /* For BIOS boot */ /* * This MUST return an alignment address between start_address * and max_address... */ extern uint64_t maybe_relocate_kernel(uint64_t start_address, uint64_t max_address, uint32_t alignment); /* * Convenience pointer into the kernel image; modifications * done here should be reflected in the loaded kernel image */ struct setup_header * const kernel_setup_header = (struct setup_header *)(kernel_image + 0x1f1); /* Initialize boot_params to zero!!! */ memset(boot_params, 0, sizeof *boot_params); } Copy the setup header starting at file offset 0x1f1 to offset 0x1f1 into that page: { int setup_length = kernel_setup_header->header == 0x53726448 ? (kernel_setup_header->jump >> 8) + 17 : 15; memcpy(&boot_params->hdr, kernel_setup_header, setup_length); } Now you can compute values including ones are omitted by older kernels: { /* * Split between the part of the kernel to be loaded into * low memory (for 16-bit boot, otherwise it can be safely * omitted) and the part to be loaded into high memory. */ if (!boot_params->hdr.setup_sects) boot_param->hdr.setup_sects = 4; int high_kernel_start = (boot_param->hdr.setup_sects+1) << 9; /* * Highest permitted address for the high part of the kernel image, * initrd, command line (*except for 16-bit boot*), and setup_data * * max_initrd_addr here is exclusive */ uint64_t max_initrd_addr = (uint64_t)boot_params->hdr.initrd_addr_max + 1; if (boot_params->hdr.version < 0x0200) max_initrd_addr = 0; /* No initrd supported */ else if (boot_params->hdr.version < 0x0203) max_initrd_addr = 0x38000000; else if (boot_params->hdr.version >= 0x020c && (boot_params->hdr.xloadflags & XLF_CAN_BE_LOADED_ABOVE_4G)) max_initrd_addr = (uint64_t)1 << 52; /* Architecture-imposed limit */ /* * Maximum command line size *including terminating null* */ unsigned int cmdline_size; if (boot_params->hdr.version < 0x0200) cmdline_size = 0; /* No command line supported */ else if (boot_params->hdr.version < 0x0206) boot_params->hdr.cmdline_size = 256; else boot_params->hdr.cmdline_size + 1; /* Command line size including terminating null */ /* * Load addresses for the low and high kernels, respectively */ uint32_t low_kernel_address; uint64_t cmdline_addr; /* Address to load the command line */ if (is_bios_boot()) { if (!(boot_params->hdr.loadflags & LOADED_HIGH)) { low_kernel_address = 0x90000; } else { /* * Recommended to be the lowest available address between * 0x10000 and 0x90000 */ low_kernel_address = preferred_low_kernel_address(); } uint32_t lowkernel_max; lowkernel_max = low_kernel_address + 0x10000; if (boot_params.hdr.version >= 0x0202) lowkernel_max += (cmdline_size + 15) & ~15; /* * end_of_low_memory() is usually given by *(uint8_t *)0x413 << 10 */ if (lowkernel_max > end_of_low_memory()) lowkernel_max = end_of_low_memory(); cmdline_addr = (lowkernel_max - cmdline_size) & ~15; if (boot_params->hdr.version >= 0x0202) kernel_setup_header->cmd_line_ptr = cmdline_addr; else if (boot_params->hdr.version >= 0x0200) kernel_setup_header->setup_move_size = lowkernel_max - low_kernel_address; if (boot_params.hdr.version >= 0x0201) { kernel_setup_header->heap_end_ptr = cmdline_addr - low_kernel_address - 0x0200; kernel_setup_header->loadflags |= CAN_USE_HEAP; } } else { low_kernel_address = 0; /* Not used for non-BIOS boot */ cmdline_addr = 0; /* Not assigned yet */ } /* * Default load address for the high kernel, and if it can be relocated */ uint64_t high_kernel_address; uint32_t high_kernel_size; /* The amount of memory the high kernel needs */ bool relocatable_kernel = false; uint32_t high_kernel_alignment = 0x400000; /* Kernel runtime alignment */ if (!(boot_params->hdr.loadflags & LOADED_HIGH)) { high_kernel_address = 0x10000; } else { if (boot_params->hdr.version >= 0x020a) high_kernel_address = boot_params->hdr.pref_address; else high_kernel_address = 0x100000; if (boot_params->hdr.version >= 0x0205 && boot_params->hdr.relocatable_kernel) { relocatable_kernel = true; high_kernel_alignment = boot_params->hdr.kernel_alignment; } } /* * Linear memory area needed by the kernel */ uint32_t kernel_mem_size; if (boot_params->hdr.version >= 0x020a) kernel_mem_size = boot_params->hdr.init_size; else kernel_mem_size = kernel_image_size << 2; /* Pure guesswork... */ /* Relocate the kernel load address if desired */ if (relocatable_kernel) { high_kernel_address = maybe_relocate_kernel(high_kernel_address, max_initrd_addr - kernel_mem_size, high_kernel_aligment); } /* Adjust for possible internal kernel realigment */ kernel_mem_size += (-high_kernel_address) & (high_kernel_alignment - 1); /* * Determine the minimum safe address for loading initrd, setup_data, * and, if cmdline_addr == 0 (i.e. !is_bios_boot()), the command line. */ uint64_t min_initrd_addr = high_kernel_address + kernel_mem_size; }
Hi, Read this message in a fixed width text editor with a lot of columns. On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote: > Glad you asked. > > So the kernel load addresses are parameterized in the kernel image > setup header. One of the things that are so parameterized are the size > and possible realignment of the kernel image in memory. > > I'm very confused where you are getting the 64 MB number from. There > should not be any such limitation. Currently, QEMU appends it to the kernel image, not to the initramfs as you suggest below. So, that winds up looking, currently, like: kernel image setup_data |--------------------------||----------------| 0x100000 0x100000+l1 0x100000+l1+l2 The problem is that this decompresses to 0x1000000 (one more zero). So if l1 is > (0x1000000-0x100000), then this winds up looking like: kernel image setup_data |--------------------------||----------------| 0x100000 0x100000+l1 0x100000+l1+l2 d e c o m p r e s s e d k e r n e l |-------------------------------------------------------------| 0x1000000 0x1000000+l3 The decompressed kernel seemingly overwriting the compressed kernel image isn't a problem, because that gets relocated to a higher address early on in the boot process. setup_data, however, stays in the same place, since those links are self referential and nothing fixes them up. So the decompressed kernel clobbers it. The solution in this commit adds a bunch of padding between the kernel image and setup_data to avoid this. That looks like this: kernel image padding setup_data |--------------------------||---------------------------------------------------||----------------| 0x100000 0x100000+l1 0x1000000+l3 0x1000000+l3+l2 d e c o m p r e s s e d k e r n e l |-------------------------------------------------------------| 0x1000000 0x1000000+l3 This way, the decompressed kernel doesn't clobber setup_data. The problem is that if 0x1000000+l3-0x100000 is around 62 megabytes, then the bootloader crashes when trying to dereference setup_data's ->len param at the end of initialize_identity_maps() in ident_map_64.c. I don't know why it does this. If I could remove the 62 megabyte restriction, then I could keep with this technique and all would be well. > In general, setup_data should be able to go anywhere the initrd can > go, and so is subject to the same address cap (896 MB for old kernels, > 4 GB on newer ones; this address too is enumerated in the header.) It would be theoretically possible to attach it to the initrd image instead of to the kernel image. As a last resort, I guess I can look into doing that. However, that's going to require some serious rework and plumbing of a lot of different components. So if I can make it work as is, that'd be ideal. However, I need to figure out this weird 62 meg limitation. Any ideas on that? Jason
On 29/12/22 03:31, Jason A. Donenfeld wrote: > Hi, > > Read this message in a fixed width text editor with a lot of columns. > > On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote: >> Glad you asked. >> >> So the kernel load addresses are parameterized in the kernel image >> setup header. One of the things that are so parameterized are the size >> and possible realignment of the kernel image in memory. >> >> I'm very confused where you are getting the 64 MB number from. There >> should not be any such limitation. [...] Thanks for the diagrams. Feel free to include them in the commit description ;) >> In general, setup_data should be able to go anywhere the initrd can >> go, and so is subject to the same address cap (896 MB for old kernels, >> 4 GB on newer ones; this address too is enumerated in the header.) > > It would be theoretically possible to attach it to the initrd image > instead of to the kernel image. As a last resort, I guess I can look > into doing that. However, that's going to require some serious rework > and plumbing of a lot of different components. So if I can make it work > as is, that'd be ideal. However, I need to figure out this weird 62 meg > limitation. > > Any ideas on that? Could it be a limitation (internal buffer) of the decompressor?
On December 28, 2022 6:31:07 PM PST, "Jason A. Donenfeld" <Jason@zx2c4.com> wrote: >Hi, > >Read this message in a fixed width text editor with a lot of columns. > >On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote: >> Glad you asked. >> >> So the kernel load addresses are parameterized in the kernel image >> setup header. One of the things that are so parameterized are the size >> and possible realignment of the kernel image in memory. >> >> I'm very confused where you are getting the 64 MB number from. There >> should not be any such limitation. > >Currently, QEMU appends it to the kernel image, not to the initramfs as >you suggest below. So, that winds up looking, currently, like: > > kernel image setup_data > |--------------------------||----------------| >0x100000 0x100000+l1 0x100000+l1+l2 > >The problem is that this decompresses to 0x1000000 (one more zero). So >if l1 is > (0x1000000-0x100000), then this winds up looking like: > > kernel image setup_data > |--------------------------||----------------| >0x100000 0x100000+l1 0x100000+l1+l2 > > d e c o m p r e s s e d k e r n e l > |-------------------------------------------------------------| > 0x1000000 0x1000000+l3 > >The decompressed kernel seemingly overwriting the compressed kernel >image isn't a problem, because that gets relocated to a higher address >early on in the boot process. setup_data, however, stays in the same >place, since those links are self referential and nothing fixes them up. >So the decompressed kernel clobbers it. > >The solution in this commit adds a bunch of padding between the kernel >image and setup_data to avoid this. That looks like this: > > kernel image padding setup_data > |--------------------------||---------------------------------------------------||----------------| >0x100000 0x100000+l1 0x1000000+l3 0x1000000+l3+l2 > > d e c o m p r e s s e d k e r n e l > |-------------------------------------------------------------| > 0x1000000 0x1000000+l3 > >This way, the decompressed kernel doesn't clobber setup_data. > >The problem is that if 0x1000000+l3-0x100000 is around 62 megabytes, >then the bootloader crashes when trying to dereference setup_data's >->len param at the end of initialize_identity_maps() in ident_map_64.c. >I don't know why it does this. If I could remove the 62 megabyte >restriction, then I could keep with this technique and all would be >well. > >> In general, setup_data should be able to go anywhere the initrd can >> go, and so is subject to the same address cap (896 MB for old kernels, >> 4 GB on newer ones; this address too is enumerated in the header.) > >It would be theoretically possible to attach it to the initrd image >instead of to the kernel image. As a last resort, I guess I can look >into doing that. However, that's going to require some serious rework >and plumbing of a lot of different components. So if I can make it work >as is, that'd be ideal. However, I need to figure out this weird 62 meg >limitation. > >Any ideas on that? > >Jason Ok, the code I sent will figure out the minimum amount of padding that you need (min_initrd_addr) as well.
On December 28, 2022 6:31:07 PM PST, "Jason A. Donenfeld" <Jason@zx2c4.com> wrote: >Hi, > >Read this message in a fixed width text editor with a lot of columns. > >On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote: >> Glad you asked. >> >> So the kernel load addresses are parameterized in the kernel image >> setup header. One of the things that are so parameterized are the size >> and possible realignment of the kernel image in memory. >> >> I'm very confused where you are getting the 64 MB number from. There >> should not be any such limitation. > >Currently, QEMU appends it to the kernel image, not to the initramfs as >you suggest below. So, that winds up looking, currently, like: > > kernel image setup_data > |--------------------------||----------------| >0x100000 0x100000+l1 0x100000+l1+l2 > >The problem is that this decompresses to 0x1000000 (one more zero). So >if l1 is > (0x1000000-0x100000), then this winds up looking like: > > kernel image setup_data > |--------------------------||----------------| >0x100000 0x100000+l1 0x100000+l1+l2 > > d e c o m p r e s s e d k e r n e l > |-------------------------------------------------------------| > 0x1000000 0x1000000+l3 > >The decompressed kernel seemingly overwriting the compressed kernel >image isn't a problem, because that gets relocated to a higher address >early on in the boot process. setup_data, however, stays in the same >place, since those links are self referential and nothing fixes them up. >So the decompressed kernel clobbers it. > >The solution in this commit adds a bunch of padding between the kernel >image and setup_data to avoid this. That looks like this: > > kernel image padding setup_data > |--------------------------||---------------------------------------------------||----------------| >0x100000 0x100000+l1 0x1000000+l3 0x1000000+l3+l2 > > d e c o m p r e s s e d k e r n e l > |-------------------------------------------------------------| > 0x1000000 0x1000000+l3 > >This way, the decompressed kernel doesn't clobber setup_data. > >The problem is that if 0x1000000+l3-0x100000 is around 62 megabytes, >then the bootloader crashes when trying to dereference setup_data's >->len param at the end of initialize_identity_maps() in ident_map_64.c. >I don't know why it does this. If I could remove the 62 megabyte >restriction, then I could keep with this technique and all would be >well. > >> In general, setup_data should be able to go anywhere the initrd can >> go, and so is subject to the same address cap (896 MB for old kernels, >> 4 GB on newer ones; this address too is enumerated in the header.) > >It would be theoretically possible to attach it to the initrd image >instead of to the kernel image. As a last resort, I guess I can look >into doing that. However, that's going to require some serious rework >and plumbing of a lot of different components. So if I can make it work >as is, that'd be ideal. However, I need to figure out this weird 62 meg >limitation. > >Any ideas on that? > >Jason As far as a crash... that sounds like a big and a pretty serious one at that. Could you let me know what kernel you are using and how *exactly* you are booting it?
On Wed, Dec 28, 2022 at 11:31:34PM -0800, H. Peter Anvin wrote: > As far as a crash... that sounds like a big and a pretty serious one at that. > > Could you let me know what kernel you are using and how *exactly* you are booting it? Right, with CONFIG_X86_VERBOSE_BOOTUP=y in a guest here, it says: early console in extract_kernel input_data: 0x000000000be073a8 input_len: 0x00000000008cfc43 output: 0x0000000001000000 output_len: 0x000000000b600a98 kernel_total_size: 0x000000000ac26000 needed_size: 0x000000000b800000 trampoline_32bit: 0x000000000009d000 so that's a ~9M kernel which gets decompressed at 0x1000000 and the output len is, what, ~180M which looks like plenty to me...
On Thu, Dec 29, 2022 at 01:47:49PM +0100, Borislav Petkov wrote: > On Wed, Dec 28, 2022 at 11:31:34PM -0800, H. Peter Anvin wrote: > > As far as a crash... that sounds like a big and a pretty serious one at that. > > > > Could you let me know what kernel you are using and how *exactly* you are booting it? > > Right, with CONFIG_X86_VERBOSE_BOOTUP=y in a guest here, it says: > > early console in extract_kernel > input_data: 0x000000000be073a8 > input_len: 0x00000000008cfc43 > output: 0x0000000001000000 > output_len: 0x000000000b600a98 > kernel_total_size: 0x000000000ac26000 > needed_size: 0x000000000b800000 > trampoline_32bit: 0x000000000009d000 > > so that's a ~9M kernel which gets decompressed at 0x1000000 and the > output len is, what, ~180M which looks like plenty to me... I think you might have misunderstood the thread. First, to reproduce the bug that this patch fixes, you need a kernel with a compressed size of around 16 megs, not 9. Secondly, that crash is well understood and doesn't need to be reproduced; this patch fixes it. Rather, the question now is how to improve this patch to remove the 62 meg limit. I'll follow up with hpa's request for reproduction info. Jason
Hi, On Wed, Dec 28, 2022 at 11:31:34PM -0800, H. Peter Anvin wrote: > On December 28, 2022 6:31:07 PM PST, "Jason A. Donenfeld" <Jason@zx2c4.com> wrote: > >Hi, > > > >Read this message in a fixed width text editor with a lot of columns. > > > >On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote: > >> Glad you asked. > >> > >> So the kernel load addresses are parameterized in the kernel image > >> setup header. One of the things that are so parameterized are the size > >> and possible realignment of the kernel image in memory. > >> > >> I'm very confused where you are getting the 64 MB number from. There > >> should not be any such limitation. > > > >Currently, QEMU appends it to the kernel image, not to the initramfs as > >you suggest below. So, that winds up looking, currently, like: > > > > kernel image setup_data > > |--------------------------||----------------| > >0x100000 0x100000+l1 0x100000+l1+l2 > > > >The problem is that this decompresses to 0x1000000 (one more zero). So > >if l1 is > (0x1000000-0x100000), then this winds up looking like: > > > > kernel image setup_data > > |--------------------------||----------------| > >0x100000 0x100000+l1 0x100000+l1+l2 > > > > d e c o m p r e s s e d k e r n e l > > |-------------------------------------------------------------| > > 0x1000000 0x1000000+l3 > > > >The decompressed kernel seemingly overwriting the compressed kernel > >image isn't a problem, because that gets relocated to a higher address > >early on in the boot process. setup_data, however, stays in the same > >place, since those links are self referential and nothing fixes them up. > >So the decompressed kernel clobbers it. > > > >The solution in this commit adds a bunch of padding between the kernel > >image and setup_data to avoid this. That looks like this: > > > > kernel image padding setup_data > > |--------------------------||---------------------------------------------------||----------------| > >0x100000 0x100000+l1 0x1000000+l3 0x1000000+l3+l2 > > > > d e c o m p r e s s e d k e r n e l > > |-------------------------------------------------------------| > > 0x1000000 0x1000000+l3 > > > >This way, the decompressed kernel doesn't clobber setup_data. > > > >The problem is that if 0x1000000+l3-0x100000 is around 62 megabytes, > >then the bootloader crashes when trying to dereference setup_data's > >->len param at the end of initialize_identity_maps() in ident_map_64.c. > >I don't know why it does this. If I could remove the 62 megabyte > >restriction, then I could keep with this technique and all would be > >well. > > > >> In general, setup_data should be able to go anywhere the initrd can > >> go, and so is subject to the same address cap (896 MB for old kernels, > >> 4 GB on newer ones; this address too is enumerated in the header.) > > > >It would be theoretically possible to attach it to the initrd image > >instead of to the kernel image. As a last resort, I guess I can look > >into doing that. However, that's going to require some serious rework > >and plumbing of a lot of different components. So if I can make it work > >as is, that'd be ideal. However, I need to figure out this weird 62 meg > >limitation. > > > >Any ideas on that? > > > >Jason > > As far as a crash... that sounds like a big and a pretty serious one at that. > > Could you let me know what kernel you are using and how *exactly* you are booting it? I'll attach a .config file. Apply the patch at the top of this thread to qemu, except make one modification: diff --git a/hw/i386/x86.c b/hw/i386/x86.c index 628fd2b2e9..a61ee23e13 100644 --- a/hw/i386/x86.c +++ b/hw/i386/x86.c @@ -1097,7 +1097,7 @@ void x86_load_linux(X86MachineState *x86ms, /* The early stage can't address past around 64 MB from the original * mapping, so just give up in that case. */ - if (padded_size < 62 * 1024 * 1024) + if (true || padded_size < 62 * 1024 * 1024) kernel_size = padded_size; else { fprintf(stderr, "qemu: Kernel image too large to hold setup_data\n"); Then build qemu. Run it with `-kernel bzImage`, based on the kernel built with the .config I attached. You'll see that the CPU triple faults when hitting this line: sd = (struct setup_data *)boot_params->hdr.setup_data; while (sd) { unsigned long sd_addr = (unsigned long)sd; kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd) + sd->len); <---- sd = (struct setup_data *)sd->next; } , because it dereferences *sd. This does not happen if the decompressed size of the kernel is < 62 megs. So that's the "big and pretty serious" bug that might be worthy of investigation. Jason
Er, .config attached now.
On Fri, Dec 30, 2022 at 04:54:27PM +0100, Jason A. Donenfeld wrote: > > Right, with CONFIG_X86_VERBOSE_BOOTUP=y in a guest here, it says: > > > > early console in extract_kernel > > input_data: 0x000000000be073a8 > > input_len: 0x00000000008cfc43 > > output: 0x0000000001000000 > > output_len: 0x000000000b600a98 > > kernel_total_size: 0x000000000ac26000 > > needed_size: 0x000000000b800000 > > trampoline_32bit: 0x000000000009d000 > > > > so that's a ~9M kernel which gets decompressed at 0x1000000 and the > > output len is, what, ~180M which looks like plenty to me... > > I think you might have misunderstood the thread. Maybe... I've been trying to make sense of all the decompression dancing we're doing and the diagrams you've drawn and there's a comment over extract_kernel() which basically says that the compressed image is at the back of the memory buffer input_data: 0x000000000be073a8 and we decompress to a smaller address output: 0x0000000001000000 And, it *looks* like setup_data is at an address somewhere >= 0x1000000 so when we start decompressing, we overwrite it. I guess extract_kernel() could also dump the setup_data addresses so that we can verify that... > First, to reproduce the bug that this patch fixes, you need a kernel with a > compressed size of around 16 megs, not 9. Not only that - you need setup_data to be placed somewhere just so that it gets overwritten during decompression. Also, see below. > Secondly, that crash is well understood and doesn't need to be reproduced; Is it? Where do you get the 0x100000 as the starting address of the kernel image? Because input_data above says that the input address is somewhere higher... Btw, here's an allyesconfig: early console in setup code No EFI environment detected. early console in extract_kernel input_data: 0x000000001bd003a8 input_len: 0x000000000cde2963 output: 0x0000000001000000 output_len: 0x0000000027849d74 kernel_total_size: 0x0000000025830000 needed_size: 0x0000000027a00000 trampoline_32bit: 0x000000000009d000 Physical KASLR using RDRAND RDTSC... Virtual KASLR using RDRAND RDTSC... That kernel has compressed size of ~205M and the output buffer is of size ~632M. > this patch fixes it. Rather, the question now is how to improve this patch > to remove the 62 meg limit. Yeah, I'm wondering what that 62M limit is too, actually. But maybe I'm misunderstanding...
On Fri, Dec 30, 2022 at 6:01 PM Borislav Petkov <bp@alien8.de> wrote: > > On Fri, Dec 30, 2022 at 04:54:27PM +0100, Jason A. Donenfeld wrote: > > > Right, with CONFIG_X86_VERBOSE_BOOTUP=y in a guest here, it says: > > > > > > early console in extract_kernel > > > input_data: 0x000000000be073a8 > > > input_len: 0x00000000008cfc43 > > > output: 0x0000000001000000 > > > output_len: 0x000000000b600a98 > > > kernel_total_size: 0x000000000ac26000 > > > needed_size: 0x000000000b800000 > > > trampoline_32bit: 0x000000000009d000 > > > > > > so that's a ~9M kernel which gets decompressed at 0x1000000 and the > > > output len is, what, ~180M which looks like plenty to me... > > > > I think you might have misunderstood the thread. > > Maybe... > > I've been trying to make sense of all the decompression dancing we're doing and > the diagrams you've drawn and there's a comment over extract_kernel() which > basically says that the compressed image is at the back of the memory buffer > > input_data: 0x000000000be073a8 > > and we decompress to a smaller address > > output: 0x0000000001000000 > > And, it *looks* like setup_data is at an address somewhere >= 0x1000000 so when > we start decompressing, we overwrite it. > > I guess extract_kernel() could also dump the setup_data addresses so that we can > verify that... > > > First, to reproduce the bug that this patch fixes, you need a kernel with a > > compressed size of around 16 megs, not 9. > > Not only that - you need setup_data to be placed somewhere just so that it gets > overwritten during decompression. Also, see below. > > > Secondly, that crash is well understood and doesn't need to be reproduced; > > Is it? > > Where do you get the 0x100000 as the starting address of the kernel image? > > Because input_data above says that the input address is somewhere higher... Look closer at the boot process. The compressed image is initially at 0x100000, but it gets relocated to a safer area at the end of startup_64: /* * Copy the compressed kernel to the end of our buffer * where decompression in place becomes safe. */ pushq %rsi leaq (_bss-8)(%rip), %rsi leaq rva(_bss-8)(%rbx), %rdi movl $(_bss - startup_32), %ecx shrl $3, %ecx std rep movsq cld popq %rsi /* * The GDT may get overwritten either during the copy we just did or * during extract_kernel below. To avoid any issues, repoint the GDTR * to the new copy of the GDT. */ leaq rva(gdt64)(%rbx), %rax leaq rva(gdt)(%rbx), %rdx movq %rdx, 2(%rax) lgdt (%rax) /* * Jump to the relocated address. */ leaq rva(.Lrelocated)(%rbx), %rax jmp *%rax And that address winds up being calculated with a combination of the image size and the init_size param, so it's safe. Decompression won't overwrite the compressed kernel. HOWEVER, qemu currently appends setup_data to the end of the compressed kernel image, and this part isn't moved, and setup_data links aren't walked/relocated. So that means the original address remains, of 0x100000. Jason
On Wed, Dec 28, 2022 at 03:38:30PM +0100, Jason A. Donenfeld wrote: > The setup_data links are appended to the compressed kernel image. Since > the kernel image is typically loaded at 0x100000, setup_data lives at > `0x100000 + compressed_size`, which does not get relocated during the > kernel's boot process. > > The kernel typically decompresses the image starting at address > 0x1000000 (note: there's one more zero there than the decompressed image > above). This usually is fine for most kernels. > > However, if the compressed image is actually quite large, then > setup_data will live at a `0x100000 + compressed_size` that extends into > the decompressed zone at 0x1000000. In other words, if compressed_size > is larger than `0x1000000 - 0x100000`, then the decompression step will > clobber setup_data, resulting in crashes. > > Fix this by detecting that possibility, and if it occurs, put setup_data > *after* the end of the decompressed kernel, so that it doesn't get > clobbered. > > One caveat is that this only works for images less than around 64 > megabytes, so just bail out in that case. This is unfortunate, but I > don't currently have a way of fixing it. I've got a different solution now that tries to piggy back on cmdline. I'll send a v2. It avoids the 62MiB crash situation of this one and seems to work fine.
On December 30, 2022 7:59:30 AM PST, "Jason A. Donenfeld" <Jason@zx2c4.com> wrote: >Hi, > >On Wed, Dec 28, 2022 at 11:31:34PM -0800, H. Peter Anvin wrote: >> On December 28, 2022 6:31:07 PM PST, "Jason A. Donenfeld" <Jason@zx2c4.com> wrote: >> >Hi, >> > >> >Read this message in a fixed width text editor with a lot of columns. >> > >> >On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote: >> >> Glad you asked. >> >> >> >> So the kernel load addresses are parameterized in the kernel image >> >> setup header. One of the things that are so parameterized are the size >> >> and possible realignment of the kernel image in memory. >> >> >> >> I'm very confused where you are getting the 64 MB number from. There >> >> should not be any such limitation. >> > >> >Currently, QEMU appends it to the kernel image, not to the initramfs as >> >you suggest below. So, that winds up looking, currently, like: >> > >> > kernel image setup_data >> > |--------------------------||----------------| >> >0x100000 0x100000+l1 0x100000+l1+l2 >> > >> >The problem is that this decompresses to 0x1000000 (one more zero). So >> >if l1 is > (0x1000000-0x100000), then this winds up looking like: >> > >> > kernel image setup_data >> > |--------------------------||----------------| >> >0x100000 0x100000+l1 0x100000+l1+l2 >> > >> > d e c o m p r e s s e d k e r n e l >> > |-------------------------------------------------------------| >> > 0x1000000 0x1000000+l3 >> > >> >The decompressed kernel seemingly overwriting the compressed kernel >> >image isn't a problem, because that gets relocated to a higher address >> >early on in the boot process. setup_data, however, stays in the same >> >place, since those links are self referential and nothing fixes them up. >> >So the decompressed kernel clobbers it. >> > >> >The solution in this commit adds a bunch of padding between the kernel >> >image and setup_data to avoid this. That looks like this: >> > >> > kernel image padding setup_data >> > |--------------------------||---------------------------------------------------||----------------| >> >0x100000 0x100000+l1 0x1000000+l3 0x1000000+l3+l2 >> > >> > d e c o m p r e s s e d k e r n e l >> > |-------------------------------------------------------------| >> > 0x1000000 0x1000000+l3 >> > >> >This way, the decompressed kernel doesn't clobber setup_data. >> > >> >The problem is that if 0x1000000+l3-0x100000 is around 62 megabytes, >> >then the bootloader crashes when trying to dereference setup_data's >> >->len param at the end of initialize_identity_maps() in ident_map_64.c. >> >I don't know why it does this. If I could remove the 62 megabyte >> >restriction, then I could keep with this technique and all would be >> >well. >> > >> >> In general, setup_data should be able to go anywhere the initrd can >> >> go, and so is subject to the same address cap (896 MB for old kernels, >> >> 4 GB on newer ones; this address too is enumerated in the header.) >> > >> >It would be theoretically possible to attach it to the initrd image >> >instead of to the kernel image. As a last resort, I guess I can look >> >into doing that. However, that's going to require some serious rework >> >and plumbing of a lot of different components. So if I can make it work >> >as is, that'd be ideal. However, I need to figure out this weird 62 meg >> >limitation. >> > >> >Any ideas on that? >> > >> >Jason >> >> As far as a crash... that sounds like a big and a pretty serious one at that. >> >> Could you let me know what kernel you are using and how *exactly* you are booting it? > >I'll attach a .config file. Apply the patch at the top of this thread to >qemu, except make one modification: > >diff --git a/hw/i386/x86.c b/hw/i386/x86.c >index 628fd2b2e9..a61ee23e13 100644 >--- a/hw/i386/x86.c >+++ b/hw/i386/x86.c >@@ -1097,7 +1097,7 @@ void x86_load_linux(X86MachineState *x86ms, > > /* The early stage can't address past around 64 MB from the original > * mapping, so just give up in that case. */ >- if (padded_size < 62 * 1024 * 1024) >+ if (true || padded_size < 62 * 1024 * 1024) > kernel_size = padded_size; > else { > fprintf(stderr, "qemu: Kernel image too large to hold setup_data\n"); > >Then build qemu. Run it with `-kernel bzImage`, based on the kernel >built with the .config I attached. > >You'll see that the CPU triple faults when hitting this line: > > sd = (struct setup_data *)boot_params->hdr.setup_data; > while (sd) { > unsigned long sd_addr = (unsigned long)sd; > > kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd) + sd->len); <---- > sd = (struct setup_data *)sd->next; > } > >, because it dereferences *sd. This does not happen if the decompressed >size of the kernel is < 62 megs. > >So that's the "big and pretty serious" bug that might be worthy of >investigation. > >Jason No kidding. Dereferencing data *before you map it* is generally frowned upon. This needs to be split into to making calls. *Facepalm*
On Fri, Dec 30, 2022 at 06:07:24PM +0100, Jason A. Donenfeld wrote: > Look closer at the boot process. The compressed image is initially at > 0x100000, but it gets relocated to a safer area at the end of > startup_64: That is the address we're executing here from, rip here looks like 0x100xxx. > /* > * Copy the compressed kernel to the end of our buffer > * where decompression in place becomes safe. > */ > pushq %rsi > leaq (_bss-8)(%rip), %rsi > leaq rva(_bss-8)(%rbx), %rdi when you get to here, it looks something like this: leaq (_bss-8)(%rip), %rsi # 0x9e7ff8 leaq rva(_bss-8)(%rbx), %rdi # 0xc6eeff8 so the source address is that _bss thing and we copy... > movl $(_bss - startup_32), %ecx > shrl $3, %ecx > std ... backwards since DF=1. Up to: # rsi = 0xffff8 # rdi = 0xbe06ff8 Ok, so the source address is 0x100000. Good. > HOWEVER, qemu currently appends setup_data to the end of the > compressed kernel image, Yeah, you mean the kernel which starts executing at 0x100000, i.e., that part which is compressed/head_64.S and which does the above and the relocation etc. > and this part isn't moved, and setup_data links aren't walked/relocated. So > that means the original address remains, of 0x100000. See above: when it starts copying the kernel image backwards to a higher address, that last byte is at 0x9e7ff8 so I'm guessing qemu has put setup_data *after* that address. And that doesn't get copied ofc. So far, so good. Now later, we extract the compressed kernel created with the mkpiggy magic: input_data: .incbin "arch/x86/boot/compressed/vmlinux.bin.gz" input_data_end: by doing /* * Do the extraction, and jump to the new kernel.. */ pushq %rsi /* Save the real mode argument */ 0x13d00 movq %rsi, %rdi /* real mode address */ 0x13d00 leaq boot_heap(%rip), %rsi /* malloc area for uncompression */ 0xc6ef000 leaq input_data(%rip), %rdx /* input_data */ 0xbe073a8 movl input_len(%rip), %ecx /* input_len */ 0x8cfe13 movq %rbp, %r8 /* output target address */ 0x1000000 movl output_len(%rip), %r9d /* decompressed length, end of relocs */ call extract_kernel /* returns kernel location in %rax */ popq %rsi (actual addresses at the end.) Now, when you say you triplefault somewhere in initialize_identity_maps() when trying to access setup_data, then if you look a couple of lines before that call we do call load_stage2_idt which sets up a boottime #PF handler do_boot_page_fault() and it actually does call kernel_add_identity_map() so *actually* it should map any unmapped setup_data addresses. So why doesn't it do that and why do you triplefault? Hmmm.
On December 30, 2022 11:54:11 AM PST, Borislav Petkov <bp@alien8.de> wrote: >On Fri, Dec 30, 2022 at 06:07:24PM +0100, Jason A. Donenfeld wrote: >> Look closer at the boot process. The compressed image is initially at >> 0x100000, but it gets relocated to a safer area at the end of >> startup_64: > >That is the address we're executing here from, rip here looks like 0x100xxx. > >> /* >> * Copy the compressed kernel to the end of our buffer >> * where decompression in place becomes safe. >> */ >> pushq %rsi >> leaq (_bss-8)(%rip), %rsi >> leaq rva(_bss-8)(%rbx), %rdi > >when you get to here, it looks something like this: > > leaq (_bss-8)(%rip), %rsi # 0x9e7ff8 > leaq rva(_bss-8)(%rbx), %rdi # 0xc6eeff8 > >so the source address is that _bss thing and we copy... > >> movl $(_bss - startup_32), %ecx >> shrl $3, %ecx >> std > >... backwards since DF=1. > >Up to: > ># rsi = 0xffff8 ># rdi = 0xbe06ff8 > >Ok, so the source address is 0x100000. Good. > >> HOWEVER, qemu currently appends setup_data to the end of the >> compressed kernel image, > >Yeah, you mean the kernel which starts executing at 0x100000, i.e., that part >which is compressed/head_64.S and which does the above and the relocation etc. > >> and this part isn't moved, and setup_data links aren't walked/relocated. So >> that means the original address remains, of 0x100000. > >See above: when it starts copying the kernel image backwards to a higher >address, that last byte is at 0x9e7ff8 so I'm guessing qemu has put setup_data >*after* that address. And that doesn't get copied ofc. > >So far, so good. > >Now later, we extract the compressed kernel created with the mkpiggy magic: > >input_data: >.incbin "arch/x86/boot/compressed/vmlinux.bin.gz" >input_data_end: > >by doing > >/* > * Do the extraction, and jump to the new kernel.. > */ > > pushq %rsi /* Save the real mode argument */ 0x13d00 > movq %rsi, %rdi /* real mode address */ 0x13d00 > leaq boot_heap(%rip), %rsi /* malloc area for uncompression */ 0xc6ef000 > leaq input_data(%rip), %rdx /* input_data */ 0xbe073a8 > movl input_len(%rip), %ecx /* input_len */ 0x8cfe13 > movq %rbp, %r8 /* output target address */ 0x1000000 > movl output_len(%rip), %r9d /* decompressed length, end of relocs */ > call extract_kernel /* returns kernel location in %rax */ > popq %rsi > >(actual addresses at the end.) > >Now, when you say you triplefault somewhere in initialize_identity_maps() when >trying to access setup_data, then if you look a couple of lines before that call >we do > > call load_stage2_idt > >which sets up a boottime #PF handler do_boot_page_fault() and it actually does >call kernel_add_identity_map() so *actually* it should map any unmapped >setup_data addresses. > >So why doesn't it do that and why do you triplefault? > >Hmmm. > See the other thread fork. They have identified the problem already.
On Fri, Dec 30, 2022 at 01:58:39PM -0800, H. Peter Anvin wrote:
> See the other thread fork. They have identified the problem already.
Not sure I follow. Is there another thread where somebody worked out why
this 62meg limit was happening?
Note that I sent v2/v3, to fix the original problem in a different way,
and if that looks good to the QEMU maintainers, then we can all be happy
with that. But I *haven't* addressed and still don't fully understand
why the 62meg limit applied to my v1 in the way it does. Did you find a
bug there to fix? If so, please do CC me.
Jason
On 12/30/22 14:10, Jason A. Donenfeld wrote: > On Fri, Dec 30, 2022 at 01:58:39PM -0800, H. Peter Anvin wrote: >> See the other thread fork. They have identified the problem already. > > Not sure I follow. Is there another thread where somebody worked out why > this 62meg limit was happening? > > Note that I sent v2/v3, to fix the original problem in a different way, > and if that looks good to the QEMU maintainers, then we can all be happy > with that. But I *haven't* addressed and still don't fully understand > why the 62meg limit applied to my v1 in the way it does. Did you find a > bug there to fix? If so, please do CC me. > Yes, you yourself posted the problem: > Then build qemu. Run it with `-kernel bzImage`, based on the kernel > built with the .config I attached. > > You'll see that the CPU triple faults when hitting this line: > > sd = (struct setup_data *)boot_params->hdr.setup_data; > while (sd) { > unsigned long sd_addr = (unsigned long)sd; > > kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd) + sd->len); <---- > sd = (struct setup_data *)sd->next; > } > > , because it dereferences *sd. This does not happen if the decompressed > size of the kernel is < 62 megs. > > So that's the "big and pretty serious" bug that might be worthy of > investigation. This needs to be something like: kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd)); kernel_add_identity_map(sd_addr + sizeof(*sd), sd_addr + sizeof(*sd) + sd->len); TThe 62 MB limit mentioned in boot.rst is unrelated, and only applies to very, very old kernels that used INT 15h, AH=88h to probe memory. -hpa
On 12/30/22 17:06, H. Peter Anvin wrote > > TThe 62 MB limit mentioned in boot.rst is unrelated, and only applies to > very, very old kernels that used INT 15h, AH=88h to probe memory. > I am 88% sure this was fixed long before setup_data was created, as it was created originally to carry e820 info for more than 128(!) memory segments. However, as we see here, it is never certain that bugs didn't creep in in the meantime... -hpa
On Fri, Dec 30, 2022 at 04:59:30PM +0100, Jason A. Donenfeld wrote: > I'll attach a .config file. Apply the patch at the top of this thread to > qemu, Hmm, so the patch at the top of this thread is fixing the clobbering of setup_data. But I tried latest qemu with your .config and it boots ok here. So how do I repro the original issue you're trying to fix? Thx.
On Sat, Dec 31, 2022 at 10:48:16AM +0100, Borislav Petkov wrote: > On Fri, Dec 30, 2022 at 04:59:30PM +0100, Jason A. Donenfeld wrote: > > I'll attach a .config file. Apply the patch at the top of this thread to > > qemu, > > Hmm, so the patch at the top of this thread is fixing the clobbering of > setup_data. > > But I tried latest qemu with your .config and it boots ok here. So how do I > repro the original issue you're trying to fix? Nothing special... `-kernel bzImage` should be enough to do it. Eric reported it, and then I was able to repro trivially. Sure you got the right version? Jason
On Fri, Dec 30, 2022 at 05:06:55PM -0800, H. Peter Anvin wrote: > > > On 12/30/22 14:10, Jason A. Donenfeld wrote: > > On Fri, Dec 30, 2022 at 01:58:39PM -0800, H. Peter Anvin wrote: > >> See the other thread fork. They have identified the problem already. > > > > Not sure I follow. Is there another thread where somebody worked out why > > this 62meg limit was happening? > > > > Note that I sent v2/v3, to fix the original problem in a different way, > > and if that looks good to the QEMU maintainers, then we can all be happy > > with that. But I *haven't* addressed and still don't fully understand > > why the 62meg limit applied to my v1 in the way it does. Did you find a > > bug there to fix? If so, please do CC me. > > > > Yes, you yourself posted the problem: > > > Then build qemu. Run it with `-kernel bzImage`, based on the kernel > > built with the .config I attached. > > > > You'll see that the CPU triple faults when hitting this line: > > > > sd = (struct setup_data *)boot_params->hdr.setup_data; > > while (sd) { > > unsigned long sd_addr = (unsigned long)sd; > > > > kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd) + sd->len); <---- > > sd = (struct setup_data *)sd->next; > > } > > > > , because it dereferences *sd. This does not happen if the decompressed > > size of the kernel is < 62 megs. > > > > So that's the "big and pretty serious" bug that might be worthy of > > investigation. > > This needs to be something like: > > kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd)); > kernel_add_identity_map(sd_addr + sizeof(*sd), > sd_addr + sizeof(*sd) + sd->len); > Oh, right, duh. Thanks for spelling it out. Jason
On Sat, Dec 31, 2022 at 01:54:50PM +0100, Jason A. Donenfeld wrote: > Nothing special... `-kernel bzImage` should be enough to do it. Eric > reported it, and then I was able to repro trivially. Sure you got the > right version? Yeah, qemu executables confusion here - had wrongly something older of the version 7.1... Now made sure I'm actually booting with the latest qemu: QEMU emulator version 7.2.50 (v7.2.0-333-g222059a0fccf) With that the kernel with your config hangs early during boot and the stack trace is below. Seeing how it says trapnr 14, then that looks like something you are seeing. But lemme poke at it more. #0 0xffffffff84738576 in native_halt () at ./arch/x86/include/asm/irqflags.h:57 #1 halt () at ./arch/x86/include/asm/irqflags.h:98 #2 early_fixup_exception (regs=regs@entry=0xffffffff84007dc8, trapnr=trapnr@entry=14) at arch/x86/mm/extable.c:340 #3 0xffffffff846ff465 in do_early_exception (regs=0xffffffff84007dc8, trapnr=14) at arch/x86/kernel/head64.c:424 #4 0xffffffff846ff14f in early_idt_handler_common () at arch/x86/kernel/head_64.S:483 #5 0x2404c74100000cd0 in ?? () #6 0xffffffffff20073c in ?? () #7 0x0000000000000010 in fixed_percpu_data () #8 0xdffffc0000000000 in ?? () #9 0xffffffff84007ea8 in init_thread_union () #10 0xffffffffff200cd0 in ?? () #11 0x0000000000000000 in ?? ()
On Fri, Dec 30, 2022 at 05:06:55PM -0800, H. Peter Anvin wrote: > This needs to be something like: > > kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd)); > kernel_add_identity_map(sd_addr + sizeof(*sd), > sd_addr + sizeof(*sd) + sd->len); It still #PFs with that: (gdb) bt #0 0xffffffff84738576 in native_halt () at ./arch/x86/include/asm/irqflags.h:57 #1 halt () at ./arch/x86/include/asm/irqflags.h:98 #2 early_fixup_exception (regs=regs@entry=0xffffffff84007dc8, trapnr=trapnr@entry=14) at arch/x86/mm/extable.c:340 #3 0xffffffff846ff465 in do_early_exception (regs=0xffffffff84007dc8, trapnr=14) at arch/x86/kernel/head64.c:424 #4 0xffffffff846ff14f in early_idt_handler_common () at arch/x86/kernel/head_64.S:483 #5 0xc149f9894908788d in ?? () #6 0xffffffffff2003fc in ?? () #7 0x0000000000000010 in fixed_percpu_data () #8 0xdffffc0000000000 in ?? () #9 0xffffffff84007ea8 in init_thread_union () #10 0xffffffffff20088d in ?? () #11 0x0000000000000000 in ?? () /me goes to dig more.
On Sat, Dec 31, 2022 at 02:35:45PM +0100, Borislav Petkov wrote: > On Sat, Dec 31, 2022 at 01:54:50PM +0100, Jason A. Donenfeld wrote: > > Nothing special... `-kernel bzImage` should be enough to do it. Eric > > reported it, and then I was able to repro trivially. Sure you got the > > right version? > > Yeah, qemu executables confusion here - had wrongly something older of the > version 7.1... > > Now made sure I'm actually booting with the latest qemu: > > QEMU emulator version 7.2.50 (v7.2.0-333-g222059a0fccf) > > With that the kernel with your config hangs early during boot and the stack > trace is below. > > Seeing how it says trapnr 14, then that looks like something you are seeing. > > But lemme poke at it more. Yes. The cause is what I've described in the commit message. There are two proposed fixes, the v1, which has the 62 MiB limitation due to a kernel bug, and the v3, which seems to work fine, and is simpler, and I suspect that's the one QEMU upstream should take. https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/
On Sat, Dec 31, 2022 at 02:40:59PM +0100, Borislav Petkov wrote: > On Fri, Dec 30, 2022 at 05:06:55PM -0800, H. Peter Anvin wrote: > > This needs to be something like: > > > > kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd)); > > kernel_add_identity_map(sd_addr + sizeof(*sd), > > sd_addr + sizeof(*sd) + sd->len); > > It still #PFs with that: > > (gdb) bt > #0 0xffffffff84738576 in native_halt () at ./arch/x86/include/asm/irqflags.h:57 > #1 halt () at ./arch/x86/include/asm/irqflags.h:98 > #2 early_fixup_exception (regs=regs@entry=0xffffffff84007dc8, trapnr=trapnr@entry=14) at arch/x86/mm/extable.c:340 > #3 0xffffffff846ff465 in do_early_exception (regs=0xffffffff84007dc8, trapnr=14) at arch/x86/kernel/head64.c:424 > #4 0xffffffff846ff14f in early_idt_handler_common () at arch/x86/kernel/head_64.S:483 > #5 0xc149f9894908788d in ?? () > #6 0xffffffffff2003fc in ?? () > #7 0x0000000000000010 in fixed_percpu_data () > #8 0xdffffc0000000000 in ?? () > #9 0xffffffff84007ea8 in init_thread_union () > #10 0xffffffffff20088d in ?? () > #11 0x0000000000000000 in ?? () > > /me goes to dig more. Are you using patch v1 minus the 62 MiB thing? If you haven't applied patch v1 and then removed the 62 MiB limitation in it, then you've misunderstood the conversation again. Please see my reproduction steps to Peter: https://lore.kernel.org/lkml/Y68K4mPuz6edQkCX@zx2c4.com/ Jason
On Sat, Dec 31, 2022 at 02:44:08PM +0100, Jason A. Donenfeld wrote:
> Are you using patch v1 minus the 62 MiB thing?
No, I want to see the original failure - the one that prompted you to send
https://lore.kernel.org/r/20221228143831.396245-1-Jason@zx2c4.com
On Sat, Dec 31, 2022 at 02:48:12PM +0100, Borislav Petkov wrote: > On Sat, Dec 31, 2022 at 02:44:08PM +0100, Jason A. Donenfeld wrote: > > Are you using patch v1 minus the 62 MiB thing? > > No, I want to see the original failure - the one that prompted you to send > > https://lore.kernel.org/r/20221228143831.396245-1-Jason@zx2c4.com That failure is unrelated to the ident mapping issue Peter and I discussed. The original failure is described in the commit message: decompression clobbers the data, so sd->next points to garbage.
On Sat, Dec 31, 2022 at 02:51:28PM +0100, Jason A. Donenfeld wrote: > That failure is unrelated to the ident mapping issue Peter and > I discussed. The original failure is described in the commit message: > decompression clobbers the data, so sd->next points to garbage. Right, and the fact that the kernel overwrites it still feels kinda wrong: the kernel knows where setup_data is - the address is in the setup header so *actually*, it should take care of not to clobber it. But hpa would correct me if I'm missing an aspect about whose responsibility it is to do what here... Thx.
On Sat, Dec 31, 2022 at 03:24:32PM +0100, Borislav Petkov wrote: > On Sat, Dec 31, 2022 at 02:51:28PM +0100, Jason A. Donenfeld wrote: > > That failure is unrelated to the ident mapping issue Peter and > > I discussed. The original failure is described in the commit message: > > decompression clobbers the data, so sd->next points to garbage. > > Right So with that understanding confirmed, I'm confused at your surprise that hpa's unrelated fix to the different issue didn't fix this issue. > and the fact that the kernel overwrites it still feels kinda wrong: the > kernel knows where setup_data is - the address is in the setup header so > *actually*, it should take care of not to clobber it. Yea, technically the bootloader could relocate all the setup_data links by copying them and updating ->next. This wouldn't be so hard to do. (Special care would have to be taken, though, to zero out SETUP_RNG_SEED, though, for forward secrecy and such.) But since the kernel doesn't do this now, and the 62MiB bug also seems to apply to existing kernels, for the purposes of QEMU for now, I think the v3 patch is probably best, since it'll handle existing kernels. Alternatively, setup_data could be relocated, the boot param protocol could be bumped, and then QEMU could conditionalized it's use of setup_data based on that protocol version. That'd work, but seems a bit more involved. So maybe for now, v3 works? Hopefully that looks like a correct approach to hpa, anyhow: https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/ I think it should fit with what he described would work. Jason
On Sat, Dec 31, 2022 at 07:22:47PM +0100, Jason A. Donenfeld wrote: > So with that understanding confirmed, I'm confused at your surprise that > hpa's unrelated fix to the different issue didn't fix this issue. No surprise there - I used a qemu variant without your patch to prevent the setup_data clobbering so hpa's fix can't help there. > But since the kernel doesn't do this now, and the 62MiB bug also seems > to apply to existing kernels, for the purposes of QEMU for now, I think > the v3 patch is probably best, since it'll handle existing kernels. Right, we can't fix all those guests which are out there. > Alternatively, setup_data could be relocated, the boot param protocol > could be bumped, and then QEMU could conditionalized it's use of > setup_data based on that protocol version. That'd work, but seems a bit > more involved. I think this is at least worth considering because the kernel overwriting setup_data after having been told where that setup_data is located is not really nice. Well not explicitly at least - it gets the pointer to the first element and something needs to traverse that list to know which addresses are live. But still, that info is there so perhaps we need to take setup_data into consideration too before decompressing...
On 12/31/22 11:00, Borislav Petkov wrote: > On Sat, Dec 31, 2022 at 07:22:47PM +0100, Jason A. Donenfeld wrote: >> So with that understanding confirmed, I'm confused at your surprise that >> hpa's unrelated fix to the different issue didn't fix this issue. > > No surprise there - I used a qemu variant without your patch to prevent the > setup_data clobbering so hpa's fix can't help there. > >> But since the kernel doesn't do this now, and the 62MiB bug also seems >> to apply to existing kernels, for the purposes of QEMU for now, I think >> the v3 patch is probably best, since it'll handle existing kernels. > > Right, we can't fix all those guests which are out there. > >> Alternatively, setup_data could be relocated, the boot param protocol >> could be bumped, and then QEMU could conditionalized it's use of >> setup_data based on that protocol version. That'd work, but seems a bit >> more involved. > > I think this is at least worth considering because the kernel overwriting > setup_data after having been told where that setup_data is located is not really > nice. > > Well not explicitly at least - it gets the pointer to the first element and > something needs to traverse that list to know which addresses are live. But > still, that info is there so perhaps we need to take setup_data into > consideration too before decompressing... > As far as the decompression itself goes, it should only a problem if we are using physical KASLR since otherwise the kernel has a guaranteed safe zone already allocated by the boot loader. However, if physical KASLR is in use, then the decompressor needs to know everything there is to know about the memory map. However, there also seems to be some kind of interaction with AMD SEV-SNP. The bug appears to have been introduced by: b57feed2cc2622ae14b2fa62f19e973e5e0a60cf x86/compressed/64: Add identity mappings for setup_data entries https://lore.kernel.org/r/TYCPR01MB694815CD815E98945F63C99183B49@TYCPR01MB6948.jpnprd01.prod.outlook.com ... which was included in version 5.19, so it is relatively recent. For a small amount of setup_data, the solution of just putting it next to the command line makes a lot of sense, and should be safe indefinitely. -hpa
On 12/31/22 19:21, H. Peter Anvin wrote: >> >>> Alternatively, setup_data could be relocated, the boot param protocol >>> could be bumped, and then QEMU could conditionalized it's use of >>> setup_data based on that protocol version. That'd work, but seems a bit >>> more involved. >> >> I think this is at least worth considering because the kernel overwriting >> setup_data after having been told where that setup_data is located is >> not really >> nice. >> >> Well not explicitly at least - it gets the pointer to the first >> element and >> something needs to traverse that list to know which addresses are >> live. But >> still, that info is there so perhaps we need to take setup_data into >> consideration too before decompressing... >> It would probably be a good idea to add a "maximum physical address for initrd/setup_data/cmdline" field to struct kernel_info, though. It appears right now that those fields are being identity-mapped in the decompressor, and that means that if 48-bit addressing is used, physical memory may extend past the addressable range. -hpa
On 12/31/22 10:22, Jason A. Donenfeld wrote: > On Sat, Dec 31, 2022 at 03:24:32PM +0100, Borislav Petkov wrote: >> On Sat, Dec 31, 2022 at 02:51:28PM +0100, Jason A. Donenfeld wrote: >>> That failure is unrelated to the ident mapping issue Peter and >>> I discussed. The original failure is described in the commit message: >>> decompression clobbers the data, so sd->next points to garbage. >> >> Right > > So with that understanding confirmed, I'm confused at your surprise that > hpa's unrelated fix to the different issue didn't fix this issue. > If decompression does clobber the data, then we *also* need to figure out why that is. There are basically three possibilities: 1. If physical KASLR is NOT used: a. The boot loader doesn't honor the kernel safe area properly; b. Somewhere in the process a bug in the calculation of the kernel safe area has crept in. 2. If physical KASLR IS used: The decompressor doesn't correctly keep track of nor relocate all the keep-out zones before picking a target address. One is a bootloader bug, two is a kernel bug. My guess is (2) is the culprit, but (1b) should be checked, too. -hpa
On 1.1.2023 6.33, H. Peter Anvin wrote: > > > On 12/31/22 10:22, Jason A. Donenfeld wrote: >> On Sat, Dec 31, 2022 at 03:24:32PM +0100, Borislav Petkov wrote: >>> On Sat, Dec 31, 2022 at 02:51:28PM +0100, Jason A. Donenfeld wrote: >>>> That failure is unrelated to the ident mapping issue Peter and >>>> I discussed. The original failure is described in the commit message: >>>> decompression clobbers the data, so sd->next points to garbage. >>> >>> Right >> >> So with that understanding confirmed, I'm confused at your surprise that >> hpa's unrelated fix to the different issue didn't fix this issue. >> > > If decompression does clobber the data, then we *also* need to figure > out why that is. There are basically three possibilities: > > 1. If physical KASLR is NOT used: > > a. The boot loader doesn't honor the kernel safe area properly; > b. Somewhere in the process a bug in the calculation of the > kernel safe area has crept in. > > 2. If physical KASLR IS used: > > The decompressor doesn't correctly keep track of nor relocate > all the keep-out zones before picking a target address. Seems setup_data is not included in those mem_avoid regions. > > One is a bootloader bug, two is a kernel bug. My guess is (2) is the > culprit, but (1b) should be checked, too. > > -hpa > --Mika
On 12/31/22 20:55, Mika Penttilä wrote: >> >> If decompression does clobber the data, then we *also* need to figure >> out why that is. There are basically three possibilities: >> >> 1. If physical KASLR is NOT used: >> >> a. The boot loader doesn't honor the kernel safe area properly; >> b. Somewhere in the process a bug in the calculation of the >> kernel safe area has crept in. >> >> 2. If physical KASLR IS used: >> >> The decompressor doesn't correctly keep track of nor relocate >> all the keep-out zones before picking a target address. > > Seems setup_data is not included in those mem_avoid regions. > [facepalm] >> >> One is a bootloader bug, two is a kernel bugs. My guess is (2) is the >> culprit, but (1b) should be checked, too. >> Correction: two are kernel bugs, i.e. (1b) and (2) are both kernel bugs. -hpa
On Sat, Dec 31, 2022 at 07:21:06PM -0800, H. Peter Anvin wrote: > As far as the decompression itself goes, it should only a problem if we are > using physical KASLR since otherwise the kernel has a guaranteed safe zone > already allocated by the boot loader. However, if physical KASLR is in use, No KASLR in Jason's config AFAICT: $ grep RANDOMIZE .config CONFIG_ARCH_HAS_ELF_RANDOMIZE=y CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET=y CONFIG_RANDOMIZE_KSTACK_OFFSET=y # CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT is not set > then the decompressor needs to know everything there is to know about the > memory map. Yeah, we do have that but as you folks establish later in the thread, those setup_data regions would need to be avoided too. ;-\ > However, there also seems to be some kind of interaction with AMD SEV-SNP. > > > The bug appears to have been introduced by: > > b57feed2cc2622ae14b2fa62f19e973e5e0a60cf > x86/compressed/64: Add identity mappings for setup_data entries > https://lore.kernel.org/r/TYCPR01MB694815CD815E98945F63C99183B49@TYCPR01MB6948.jpnprd01.prod.outlook.com > > ... which was included in version 5.19, so it is relatively recent. Right. We need that for the CC blob: b190a043c49a ("x86/sev: Add SEV-SNP feature detection/setup") > For a small amount of setup_data, the solution of just putting it next to > the command line makes a lot of sense, and should be safe indefinitely. Ok. Thx.
On Sat, Dec 31, 2022 at 07:31:21PM -0800, H. Peter Anvin wrote: > It would probably be a good idea to add a "maximum physical address for > initrd/setup_data/cmdline" field to struct kernel_info, though. It appears > right now that those fields are being identity-mapped in the decompressor, > and that means that if 48-bit addressing is used, physical memory may extend > past the addressable range. Yeah, we will probably need that too. Btw, looka here - it can't get any more obvious than that after dumping setup_data too: early console in setup code early console in extract_kernel input_data: 0x00000000040f92bf input_len: 0x0000000000f1c325 output: 0x0000000001000000 output_len: 0x0000000003c5e7d8 kernel_total_size: 0x0000000004428000 needed_size: 0x0000000004600000 boot_params->hdr.setup_data: 0x00000000010203b0 trampoline_32bit: 0x000000000009d000 Decompressing Linux... Parsing ELF... done. Booting the kernel. <EOF> Aligning them vertically: output: 0x0000000001000000 output_len: 0x0000000003c5e7d8 kernel_total_size: 0x0000000004428000 needed_size: 0x0000000004600000 boot_params->hdr.setup_data: 0x00000000010203b0
On Mon, Jan 02, 2023 at 07:01:50AM +0100, Borislav Petkov wrote: > On Sat, Dec 31, 2022 at 07:31:21PM -0800, H. Peter Anvin wrote: > > It would probably be a good idea to add a "maximum physical address for > > initrd/setup_data/cmdline" field to struct kernel_info, though. It appears > > right now that those fields are being identity-mapped in the decompressor, > > and that means that if 48-bit addressing is used, physical memory may extend > > past the addressable range. > > Yeah, we will probably need that too. > > Btw, looka here - it can't get any more obvious than that after dumping > setup_data too: > > early console in setup code > early console in extract_kernel > input_data: 0x00000000040f92bf > input_len: 0x0000000000f1c325 > output: 0x0000000001000000 > output_len: 0x0000000003c5e7d8 > kernel_total_size: 0x0000000004428000 > needed_size: 0x0000000004600000 > boot_params->hdr.setup_data: 0x00000000010203b0 > trampoline_32bit: 0x000000000009d000 > > Decompressing Linux... Parsing ELF... done. > Booting the kernel. > <EOF> > > Aligning them vertically: > > output: 0x0000000001000000 > output_len: 0x0000000003c5e7d8 > kernel_total_size: 0x0000000004428000 > needed_size: 0x0000000004600000 > boot_params->hdr.setup_data: 0x00000000010203b0 Ok, waait a minute: ============ ============ Field name: pref_address Type: read (reloc) Offset/size: 0x258/8 Protocol: 2.10+ ============ ============ This field, if nonzero, represents a preferred load address for the kernel. A relocating bootloader should attempt to load at this address if possible. A non-relocatable kernel will unconditionally move itself and to run at this address. so a kernel loader (qemu in this case) already knows where the kernel goes: boot_params->hdr.setup_data: 0x0000000001020450 boot_params->hdr.pref_address: 0x0000000001000000 ^^^^^^^^^^^^^^^^^ now, considering that same kernel loader (qemu) knows how big that kernel is: kernel_total_size: 0x0000000004428000 should that loader *not* put anything that the kernel will use in the range pref_addr + kernel_total_size ?
On Mon, 2 Jan 2023 at 07:17, Borislav Petkov <bp@alien8.de> wrote: > > On Mon, Jan 02, 2023 at 07:01:50AM +0100, Borislav Petkov wrote: > > On Sat, Dec 31, 2022 at 07:31:21PM -0800, H. Peter Anvin wrote: > > > It would probably be a good idea to add a "maximum physical address for > > > initrd/setup_data/cmdline" field to struct kernel_info, though. It appears > > > right now that those fields are being identity-mapped in the decompressor, > > > and that means that if 48-bit addressing is used, physical memory may extend > > > past the addressable range. > > > > Yeah, we will probably need that too. > > > > Btw, looka here - it can't get any more obvious than that after dumping > > setup_data too: > > > > early console in setup code > > early console in extract_kernel > > input_data: 0x00000000040f92bf > > input_len: 0x0000000000f1c325 > > output: 0x0000000001000000 > > output_len: 0x0000000003c5e7d8 > > kernel_total_size: 0x0000000004428000 > > needed_size: 0x0000000004600000 > > boot_params->hdr.setup_data: 0x00000000010203b0 > > trampoline_32bit: 0x000000000009d000 > > > > Decompressing Linux... Parsing ELF... done. > > Booting the kernel. > > <EOF> > > > > Aligning them vertically: > > > > output: 0x0000000001000000 > > output_len: 0x0000000003c5e7d8 > > kernel_total_size: 0x0000000004428000 > > needed_size: 0x0000000004600000 > > boot_params->hdr.setup_data: 0x00000000010203b0 > > Ok, waait a minute: > > ============ ============ > Field name: pref_address > Type: read (reloc) > Offset/size: 0x258/8 > Protocol: 2.10+ > ============ ============ > > This field, if nonzero, represents a preferred load address for the > kernel. A relocating bootloader should attempt to load at this > address if possible. > > A non-relocatable kernel will unconditionally move itself and to run > at this address. > > so a kernel loader (qemu in this case) already knows where the kernel goes: > > boot_params->hdr.setup_data: 0x0000000001020450 > boot_params->hdr.pref_address: 0x0000000001000000 > ^^^^^^^^^^^^^^^^^ > > now, considering that same kernel loader (qemu) knows how big that kernel is: > > kernel_total_size: 0x0000000004428000 > > should that loader *not* put anything that the kernel will use in the range > > pref_addr + kernel_total_size > This seems to be related to another issue that was discussed in the context of this change, but affecting EFI boot not legacy BIOS boot [0]. So, in a nutshell, we have the following pieces: - QEMU, which manages a directory of files and other data blobs, and exposes them via its fw_cfg interface. - SeaBIOS, which invokes the fw_cfg interface to load the 'kernel' blob at its preferred address - The boot code in the kernel, which interprets the various fields in the setup header to figure out where the compressed image lives etc So the problem here, which applies to SETUP_DTB as well as SETUP_RNG_SEED, is that the internal file representation of the kernel blob (which does not have an absolute address at this point, it's just a file in the fw_cfg filesystem) is augmented with: 1) setup_data linked-list entries carrying absolute addresses that are assumed to be valid once SeaBIOS loads the file to memory 2) DTB and/or RNG seed blobs appended to the compressed 'kernel' blob, but without updating that file's internal metadata Issue 1) is what broke EFI boot, given that EFI interprets the kernel blob as a PE/COFF image and hands it to the Loadimage() boot service, which has no awareness of boot_params or setup_data and so just ignores it and loads the image at an arbitrary address, resulting in setup_data absolute address values pointing to bogus places. It seems that now, we have another issue 2), where the fw_cfg view of the file size goes out of sync with the compressed image's own view of its size. As a fix for issue 1), we explored another solution, which was to allocate fixed areas in memory for the RNG seed, so that the absolute address added to setup_data is guaranteed to be correct regardless of where the compressed image is loaded, but that was shot down for other reasons, and we ended up enabling this feature only for legacy BIOS boot. But apparently, this approach has other issues so perhaps it is better to revisit that solution again. So instead of appending data to the compressed image and assuming that it will stay in place, create or extend a memory reservation elsewhere, and refer to its absolute address in setup_data.
On Mon, Jan 02, 2023 at 10:32:03AM +0100, Ard Biesheuvel wrote: > So instead of appending data to the compressed image and assuming that > it will stay in place, create or extend a memory reservation > elsewhere, and refer to its absolute address in setup_data. From my limited experience with all those boot protocols, I'd say hardcoding stuff is always a bad idea. But, we already more or less hardcode, or rather codify through the setup header contract how stuff needs to get accessed. And yeah, maybe specifying an absolute address and size for a blob of data and putting that address and size in the setup header so that all the parties involved are where what is, is probably better. But WTH do I know...
On Mon, 2 Jan 2023 at 14:37, Borislav Petkov <bp@alien8.de> wrote: > > On Mon, Jan 02, 2023 at 10:32:03AM +0100, Ard Biesheuvel wrote: > > So instead of appending data to the compressed image and assuming that > > it will stay in place, create or extend a memory reservation > > elsewhere, and refer to its absolute address in setup_data. > > From my limited experience with all those boot protocols, I'd say hardcoding > stuff is always a bad idea. But, we already more or less hardcode, or rather > codify through the setup header contract how stuff needs to get accessed. > > And yeah, maybe specifying an absolute address and size for a blob of data and > putting that address and size in the setup header so that all the parties > involved are where what is, is probably better. > Exactly. In the EFI case, this was problematic because we would need to introduce a new way to pass memory reservations between QEMU and the firmware. But I don't think that issue should affect legacy BIOS boot, and we could just reserve the memory in the E820 table AFAIK.
diff --git a/hw/i386/x86.c b/hw/i386/x86.c index 78cc131926..628fd2b2e9 100644 --- a/hw/i386/x86.c +++ b/hw/i386/x86.c @@ -1077,6 +1077,36 @@ void x86_load_linux(X86MachineState *x86ms, } fclose(f); + /* If a setup_data is going to be used, make sure that the bootloader won't + * decompress into it and clobber those bytes. */ + if (dtb_filename || !legacy_no_rng_seed) { + uint32_t payload_offset = ldl_p(setup + 0x248); + uint32_t payload_length = ldl_p(setup + 0x24c); + uint32_t target_address = ldl_p(setup + 0x258); + uint32_t decompressed_length = ldl_p(kernel + payload_offset + payload_length - 4); + + uint32_t estimated_setup_data_length = 4096 * 16; + uint32_t start_setup_data = prot_addr + kernel_size; + uint32_t end_setup_data = start_setup_data + estimated_setup_data_length; + uint32_t start_target = target_address; + uint32_t end_target = target_address + decompressed_length; + + if ((start_setup_data >= start_target && start_setup_data < end_target) || + (end_setup_data >= start_target && end_setup_data < end_target)) { + uint32_t padded_size = target_address + decompressed_length - prot_addr; + + /* The early stage can't address past around 64 MB from the original + * mapping, so just give up in that case. */ + if (padded_size < 62 * 1024 * 1024) + kernel_size = padded_size; + else { + fprintf(stderr, "qemu: Kernel image too large to hold setup_data\n"); + dtb_filename = NULL; + legacy_no_rng_seed = true; + } + } + } + /* append dtb to kernel */ if (dtb_filename) { if (protocol < 0x209) {
The setup_data links are appended to the compressed kernel image. Since the kernel image is typically loaded at 0x100000, setup_data lives at `0x100000 + compressed_size`, which does not get relocated during the kernel's boot process. The kernel typically decompresses the image starting at address 0x1000000 (note: there's one more zero there than the decompressed image above). This usually is fine for most kernels. However, if the compressed image is actually quite large, then setup_data will live at a `0x100000 + compressed_size` that extends into the decompressed zone at 0x1000000. In other words, if compressed_size is larger than `0x1000000 - 0x100000`, then the decompression step will clobber setup_data, resulting in crashes. Fix this by detecting that possibility, and if it occurs, put setup_data *after* the end of the decompressed kernel, so that it doesn't get clobbered. One caveat is that this only works for images less than around 64 megabytes, so just bail out in that case. This is unfortunate, but I don't currently have a way of fixing it. Cc: x86@kernel.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> --- hw/i386/x86.c | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+)