Message ID | 20240613155506.811013916@goodmis.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Hey Steve, On 13.06.24 17:55, Steven Rostedt wrote: > Reserve unspecified location of physical memory from kernel command line > > Background: > > In ChromeOS, we have 1 MB of pstore ramoops reserved so that we can extract > dmesg output and some other information when a crash happens in the field. > (This is only done when the user selects "Allow Google to collect data for > improving the system"). But there are cases when there's a bug that > requires more data to be retrieved to figure out what is happening. We would > like to increase the pstore size, either temporarily, or maybe even > permanently. The pstore on these devices are at a fixed location in RAM (as > the RAM is not cleared on soft reboots nor crashes). The location is chosen > by the BIOS (coreboot) and passed to the kernel via ACPI tables on x86. > There's a driver that queries for this to initialize the pstore for > ChromeOS: > > See drivers/platform/chrome/chromeos_pstore.c > > Problem: > > The problem is that, even though there's a process to change the kernel on > these systems, and is done regularly to install updates, the firmware is > updated much less frequently. Choosing the place in RAM also takes special > care, and may be in a different address for different boards. Updating the > size via firmware is a large effort and not something that many are willing > to do for a temporary pstore size change. (sorry for not commenting on earlier versions, I didn't see v1-v5 in my inbox) Do you have a "real" pstore on these systems that you could store non-volatile variables in, such as persistent UEFI variables? If so, you could create an actually persistent mapping for your trace pstore even across kernel version updates as a general mechanism to create reserved memblocks at fixed offsets. > Requirement: > > Need a way to reserve memory that will be at a consistent location for > every boot, if the kernel and system are the same. Does not need to work > if rebooting to a different kernel, or if the system can change the > memory layout between boots. > > The reserved memory can not be an hard coded address, as the same kernel / > command line needs to run on several different machines. The picked memory > reservation just needs to be the same for a given machine, but may be With KASLR is enabled, doesn't this approach break too often to be reliable enough for the data you want to extract? Picking up the idea above, with a persistent variable we could even make KASLR avoid that reserved pstore region in its search for a viable KASLR offset. Alex Amazon Web Services Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597
On Thu, 13 Jun 2024 18:54:12 +0200 Alexander Graf <graf@amazon.com> wrote: > > Do you have a "real" pstore on these systems that you could store > non-volatile variables in, such as persistent UEFI variables? If so, you > could create an actually persistent mapping for your trace pstore even > across kernel version updates as a general mechanism to create reserved > memblocks at fixed offsets. After implementing all this, I don't think I can use pstore for my purpose. pstore is a generic interface for persistent storage, and requires an interface to access it. From what I understand, it's not the place to just ask for an area of RAM. For this, I have a single patch that allows the tracing instance to use an area reserved by reserve_mem. reserve_mem=12M:4096:trace trace_instance=boot_mapped@trace I've already tested this on qemu and a couple of chromebooks. It works well. > > > > Requirement: > > > > Need a way to reserve memory that will be at a consistent location for > > every boot, if the kernel and system are the same. Does not need to work > > if rebooting to a different kernel, or if the system can change the > > memory layout between boots. > > > > The reserved memory can not be an hard coded address, as the same kernel / > > command line needs to run on several different machines. The picked memory > > reservation just needs to be the same for a given machine, but may be > > > With KASLR is enabled, doesn't this approach break too often to be > reliable enough for the data you want to extract? > > Picking up the idea above, with a persistent variable we could even make > KASLR avoid that reserved pstore region in its search for a viable KASLR > offset. I think I was hit by it once in all my testing. For our use case, the few times it fails to map is not going to affect what we need this for at all. -- Steve
Hey Steve, On 13.06.24 19:12, Steven Rostedt wrote: > On Thu, 13 Jun 2024 18:54:12 +0200 > Alexander Graf <graf@amazon.com> wrote: > >> Do you have a "real" pstore on these systems that you could store >> non-volatile variables in, such as persistent UEFI variables? If so, you >> could create an actually persistent mapping for your trace pstore even >> across kernel version updates as a general mechanism to create reserved >> memblocks at fixed offsets. > After implementing all this, I don't think I can use pstore for my > purpose. pstore is a generic interface for persistent storage, and > requires an interface to access it. From what I understand, it's not > the place to just ask for an area of RAM. > > For this, I have a single patch that allows the tracing instance to use > an area reserved by reserve_mem. > > reserve_mem=12M:4096:trace trace_instance=boot_mapped@trace > > I've already tested this on qemu and a couple of chromebooks. It works > well. I believe we're talking about 2 different things :). Let me rephrase a bit and make a concrete example. Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command line option. The kernel now comes up and allocates a random chunk of memory that - by (admittedly good) chance - may be at the same physical location as before. Let's assume it deemed 0x1000000 as a good offset. Let's now assume you're running on a UEFI system. There, you always have non-volatile storage available to you even in the pre-boot phase. That means the kernel could create a UEFI variable that says "12M:4096:trace -> 0x1000000". The pre-boot phase takes all these UEFI variables and marks them as reserved. When you finally reach your command line parsing logic for reserve_mem=, you can flip all reservations that were not on the command line back to normal memory. That way you have pretty much guaranteed persistent memory regions, even with KASLR changing your memory layout across boots. The nice thing is that the above is an extension of what you've already built: Systems with UEFI simply get better guarantees that their regions persist. > >> >>> Requirement: >>> >>> Need a way to reserve memory that will be at a consistent location for >>> every boot, if the kernel and system are the same. Does not need to work >>> if rebooting to a different kernel, or if the system can change the >>> memory layout between boots. >>> >>> The reserved memory can not be an hard coded address, as the same kernel / >>> command line needs to run on several different machines. The picked memory >>> reservation just needs to be the same for a given machine, but may be >> >> With KASLR is enabled, doesn't this approach break too often to be >> reliable enough for the data you want to extract? >> >> Picking up the idea above, with a persistent variable we could even make >> KASLR avoid that reserved pstore region in its search for a viable KASLR >> offset. > I think I was hit by it once in all my testing. For our use case, the > few times it fails to map is not going to affect what we need this for > at all. Once is pretty good. Do you know why? Also once out of how many runs? Is the randomness source not as random as it should be or are the number of bits for KASLR maybe so few on your target architecture that the odds of hitting anything become low? Do these same constraints hold true outside of your testing environment? Alex Amazon Web Services Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597
On Mon, 17 Jun 2024 09:07:29 +0200 Alexander Graf <graf@amazon.com> wrote: > Hey Steve, > > > I believe we're talking about 2 different things :). Let me rephrase a > bit and make a concrete example. > > Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command > line option. The kernel now comes up and allocates a random chunk of > memory that - by (admittedly good) chance - may be at the same physical > location as before. Let's assume it deemed 0x1000000 as a good offset. Note, it's not random. It picks from the top of available memory every time. But things can mess with it (see below). > > Let's now assume you're running on a UEFI system. There, you always have > non-volatile storage available to you even in the pre-boot phase. That > means the kernel could create a UEFI variable that says "12M:4096:trace > -> 0x1000000". The pre-boot phase takes all these UEFI variables and > marks them as reserved. When you finally reach your command line parsing > logic for reserve_mem=, you can flip all reservations that were not on > the command line back to normal memory. > > That way you have pretty much guaranteed persistent memory regions, even > with KASLR changing your memory layout across boots. > > The nice thing is that the above is an extension of what you've already > built: Systems with UEFI simply get better guarantees that their regions > persist. This could be an added feature, but it is very architecture specific, and would likely need architecture specific updates. > > > > > >> > >>> Requirement: > >>> > >>> Need a way to reserve memory that will be at a consistent location for > >>> every boot, if the kernel and system are the same. Does not need to work > >>> if rebooting to a different kernel, or if the system can change the > >>> memory layout between boots. > >>> > >>> The reserved memory can not be an hard coded address, as the same kernel / > >>> command line needs to run on several different machines. The picked memory > >>> reservation just needs to be the same for a given machine, but may be > >> > >> With KASLR is enabled, doesn't this approach break too often to be > >> reliable enough for the data you want to extract? > >> > >> Picking up the idea above, with a persistent variable we could even make > >> KASLR avoid that reserved pstore region in its search for a viable KASLR > >> offset. > > I think I was hit by it once in all my testing. For our use case, the > > few times it fails to map is not going to affect what we need this for > > at all. > > > Once is pretty good. Do you know why? Also once out of how many runs? Is > the randomness source not as random as it should be or are the number of > bits for KASLR maybe so few on your target architecture that the odds of > hitting anything become low? Do these same constraints hold true outside > of your testing environment? So I just ran it a hundred times in a loop. I added a patch to print the location of "_text". The loop was this: for i in `seq 100`; do ssh root@debiantesting-x86-64 "dmesg | grep -e 'text starts' -e 'mapped boot' >> text; grub-reboot '1>0'; sleep 0.5; reboot" sleep 25 done It searches dmesg for my added printk as well as the print of were the ring buffer was loaded in physical memory. It takes about 15 seconds to reboot, so I waited 25. The results are attached. I found that it was consistent 76 times, which means 1 out of 4 it's not. Funny enough, it broke whenever it loaded the kernel below 0x100000000. And then it would be off by a little. It was consistently at: 0x27d000000 And when it failed, it was at 0x27ce00000. Note, when I used the e820 tables to do this, I never saw a failure. My assumption is that when it is below 0x100000000, something else gets allocated causing this to get pushed down. As this code relies on memblock_phys_alloc() being consistent, if something gets allocated before it differently depending on where the kernel is, it can also move the location. A plugin to UEFI would mean that it would need to reserve the memory, and the code here will need to know where it is. We could always make the function reserve_mem() global and weak so that architectures can override it. -- Steve
[resend because Thunderbird decided to send the previous version as HTML :(] On 17.06.24 22:40, Steven Rostedt wrote: > On Mon, 17 Jun 2024 09:07:29 +0200 > Alexander Graf<graf@amazon.com> wrote: > >> Hey Steve, >> >> >> I believe we're talking about 2 different things :). Let me rephrase a >> bit and make a concrete example. >> >> Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command >> line option. The kernel now comes up and allocates a random chunk of >> memory that - by (admittedly good) chance - may be at the same physical >> location as before. Let's assume it deemed 0x1000000 as a good offset. > Note, it's not random. It picks from the top of available memory every > time. But things can mess with it (see below). > >> Let's now assume you're running on a UEFI system. There, you always have >> non-volatile storage available to you even in the pre-boot phase. That >> means the kernel could create a UEFI variable that says "12M:4096:trace >> -> 0x1000000". The pre-boot phase takes all these UEFI variables and >> marks them as reserved. When you finally reach your command line parsing >> logic for reserve_mem=, you can flip all reservations that were not on >> the command line back to normal memory. >> >> That way you have pretty much guaranteed persistent memory regions, even >> with KASLR changing your memory layout across boots. >> >> The nice thing is that the above is an extension of what you've already >> built: Systems with UEFI simply get better guarantees that their regions >> persist. > This could be an added feature, but it is very architecture specific, > and would likely need architecture specific updates. It definitely would be an added feature, yes. But one that allows you to ensure persistence a lot more safely :). >>>>> Requirement: >>>>> >>>>> Need a way to reserve memory that will be at a consistent location for >>>>> every boot, if the kernel and system are the same. Does not need to work >>>>> if rebooting to a different kernel, or if the system can change the >>>>> memory layout between boots. >>>>> >>>>> The reserved memory can not be an hard coded address, as the same kernel / >>>>> command line needs to run on several different machines. The picked memory >>>>> reservation just needs to be the same for a given machine, but may be >>>> With KASLR is enabled, doesn't this approach break too often to be >>>> reliable enough for the data you want to extract? >>>> >>>> Picking up the idea above, with a persistent variable we could even make >>>> KASLR avoid that reserved pstore region in its search for a viable KASLR >>>> offset. >>> I think I was hit by it once in all my testing. For our use case, the >>> few times it fails to map is not going to affect what we need this for >>> at all. >> Once is pretty good. Do you know why? Also once out of how many runs? Is >> the randomness source not as random as it should be or are the number of >> bits for KASLR maybe so few on your target architecture that the odds of >> hitting anything become low? Do these same constraints hold true outside >> of your testing environment? > So I just ran it a hundred times in a loop. I added a patch to print > the location of "_text". The loop was this: > > for i in `seq 100`; do > ssh root@debiantesting-x86-64 "dmesg | grep -e 'text starts' -e 'mapped boot' >> text; grub-reboot '1>0'; sleep 0.5; reboot" > sleep 25 > done > > It searches dmesg for my added printk as well as the print of were the > ring buffer was loaded in physical memory. > > It takes about 15 seconds to reboot, so I waited 25. The results are > attached. I found that it was consistent 76 times, which means 1 out of > 4 it's not. Funny enough, it broke whenever it loaded the kernel below > 0x100000000. And then it would be off by a little. > > It was consistently at: > > 0x27d000000 > > And when it failed, it was at 0x27ce00000. > > Note, when I used the e820 tables to do this, I never saw a failure. My > assumption is that when it is below 0x100000000, something else gets > allocated causing this to get pushed down. Thinking about it again: What if you run the allocation super early (see arch/x86/boot/compressed/kaslr.c:handle_mem_options())? If you stick to allocating only from top, you're effectively kernel version independent for your allocations because none of the kernel code ran yet and definitely KASLR independent because you're running deterministically before KASLR even gets allocated. > As this code relies on memblock_phys_alloc() being consistent, if > something gets allocated before it differently depending on where the > kernel is, it can also move the location. A plugin to UEFI would mean > that it would need to reserve the memory, and the code here will need > to know where it is. We could always make the function reserve_mem() > global and weak so that architectures can override it. Yes, the in-kernel UEFI loader (efi-stub) could simply populate a new type of memblock with the respective reservations and you later call memblock_find_in_range_node() instead of memblock_phys_alloc() to pass in flags that you want to allocate only from the new MEMBLOCK_RESERVE_MEM type. The same model would work for BIOS boots through the handle_mem_options() path above. In fact, if the BIOS way works fine, we don't even need UEFI variables: The same way allocations will be identical during BIOS execution, they should stay identical across UEFI launches. As cherry on top, kexec also works seamlessly with the special memblock approach because kexec (at least on x86) hands memblocks as is to the next kernel. So the new kernel will also automatically use the same ranges for its allocations. Alex Amazon Web Services Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597
On Mon, 17 Jun 2024 23:01:12 +0200 Alexander Graf <graf@amazon.com> wrote: > > This could be an added feature, but it is very architecture specific, > > and would likely need architecture specific updates. > > > It definitely would be an added feature, yes. But one that allows you to > ensure persistence a lot more safely :). Sure. > > Thinking about it again: What if you run the allocation super early (see > arch/x86/boot/compressed/kaslr.c:handle_mem_options())? If you stick to > allocating only from top, you're effectively kernel version independent > for your allocations because none of the kernel code ran yet and > definitely KASLR independent because you're running deterministically > before KASLR even gets allocated. > > > As this code relies on memblock_phys_alloc() being consistent, if > > something gets allocated before it differently depending on where the > > kernel is, it can also move the location. A plugin to UEFI would mean > > that it would need to reserve the memory, and the code here will need > > to know where it is. We could always make the function reserve_mem() > > global and weak so that architectures can override it. > > > Yes, the in-kernel UEFI loader (efi-stub) could simply populate a new > type of memblock with the respective reservations and you later call > memblock_find_in_range_node() instead of memblock_phys_alloc() to pass > in flags that you want to allocate only from the new > MEMBLOCK_RESERVE_MEM type. The same model would work for BIOS boots > through the handle_mem_options() path above. In fact, if the BIOS way > works fine, we don't even need UEFI variables: The same way allocations > will be identical during BIOS execution, they should stay identical > across UEFI launches. > > As cherry on top, kexec also works seamlessly with the special memblock > approach because kexec (at least on x86) hands memblocks as is to the > next kernel. So the new kernel will also automatically use the same > ranges for its allocations. I'm all for expanding this. But I would just want to get this in for now as is. It theoretically works on all architectures. If someone wants to make in more robust and accurate on a specific architecture, I'm all for it. Like I said, we could make the reserver_mem() function global and weak, and then if an architecture has a better way to handle this, it could use that. Hmm, x86 could do this with the e820 code like I did in my first versions. Like I said, it didn't fail at all with that. And we can have an UEFI version as well. -- Steve
On Mon, 17 Jun 2024 at 23:01, Alexander Graf <graf@amazon.com> wrote: > > On 17.06.24 22:40, Steven Rostedt wrote: > > On Mon, 17 Jun 2024 09:07:29 +0200 > > Alexander Graf<graf@amazon.com> wrote: > > > >> Hey Steve, > >> > >> > >> I believe we're talking about 2 different things :). Let me rephrase a > >> bit and make a concrete example. > >> > >> Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command > >> line option. The kernel now comes up and allocates a random chunk of > >> memory that - by (admittedly good) chance - may be at the same physical > >> location as before. Let's assume it deemed 0x1000000 as a good offset. > > Note, it's not random. It picks from the top of available memory every > > time. But things can mess with it (see below). > > > >> Let's now assume you're running on a UEFI system. There, you always have > >> non-volatile storage available to you even in the pre-boot phase. That > >> means the kernel could create a UEFI variable that says "12M:4096:trace > >> -> 0x1000000". The pre-boot phase takes all these UEFI variables and > >> marks them as reserved. When you finally reach your command line parsing > >> logic for reserve_mem=, you can flip all reservations that were not on > >> the command line back to normal memory. > >> > >> That way you have pretty much guaranteed persistent memory regions, even > >> with KASLR changing your memory layout across boots. > >> > >> The nice thing is that the above is an extension of what you've already > >> built: Systems with UEFI simply get better guarantees that their regions > >> persist. > > This could be an added feature, but it is very architecture specific, > > and would likely need architecture specific updates. > > > It definitely would be an added feature, yes. But one that allows you to > ensure persistence a lot more safely :). > > > >>>>> Requirement: > >>>>> > >>>>> Need a way to reserve memory that will be at a consistent location for > >>>>> every boot, if the kernel and system are the same. Does not need to work > >>>>> if rebooting to a different kernel, or if the system can change the > >>>>> memory layout between boots. > >>>>> > >>>>> The reserved memory can not be an hard coded address, as the same kernel / > >>>>> command line needs to run on several different machines. The picked memory > >>>>> reservation just needs to be the same for a given machine, but may be > >>>> With KASLR is enabled, doesn't this approach break too often to be > >>>> reliable enough for the data you want to extract? > >>>> > >>>> Picking up the idea above, with a persistent variable we could even make > >>>> KASLR avoid that reserved pstore region in its search for a viable KASLR > >>>> offset. > >>> I think I was hit by it once in all my testing. For our use case, the > >>> few times it fails to map is not going to affect what we need this for > >>> at all. > >> Once is pretty good. Do you know why? Also once out of how many runs? Is > >> the randomness source not as random as it should be or are the number of > >> bits for KASLR maybe so few on your target architecture that the odds of > >> hitting anything become low? Do these same constraints hold true outside > >> of your testing environment? > > So I just ran it a hundred times in a loop. I added a patch to print > > the location of "_text". The loop was this: > > > > for i in `seq 100`; do > > ssh root@debiantesting-x86-64 "dmesg | grep -e 'text starts' -e 'mapped boot' >> text; grub-reboot '1>0'; sleep 0.5; reboot" > > sleep 25 > > done > > > > It searches dmesg for my added printk as well as the print of were the > > ring buffer was loaded in physical memory. > > > > It takes about 15 seconds to reboot, so I waited 25. The results are > > attached. I found that it was consistent 76 times, which means 1 out of > > 4 it's not. Funny enough, it broke whenever it loaded the kernel below > > 0x100000000. And then it would be off by a little. > > > > It was consistently at: > > > > 0x27d000000 > > > > And when it failed, it was at 0x27ce00000. > > > > Note, when I used the e820 tables to do this, I never saw a failure. My > > assumption is that when it is below 0x100000000, something else gets > > allocated causing this to get pushed down. > > > Thinking about it again: What if you run the allocation super early (see > arch/x86/boot/compressed/kaslr.c:handle_mem_options())? That code is not used by EFI boot anymore. In general, I would recommend (and have recommended) against these kinds of hacks in mainline, because -especially on x86- there is always someone that turns up with some kind of convoluted use case that gets broken if we try to change anything in the boot code. I spent considerable time over the past year making the EFI/x86 boot code compatible with the new MS imposed requirements on PC boot firmware (related to secure boot and NX restrictions on memory mappings). This involved some radical refactoring of the boot sequence, including the KASLR logic. Adding fragile code there that will result in regressions observable to end users when it gets broken is really not what I'd like to see. So I would personally prefer for this code not to go in at all. But if it does go in (and Steven has already agreed to this), it needs a giant disclaimer that it is best effort and may get broken inadvertently by changes that may seem unrelated. > If you stick to > allocating only from top, you're effectively kernel version independent > for your allocations because none of the kernel code ran yet and > definitely KASLR independent because you're running deterministically > before KASLR even gets allocated. > Allocating top down under EFI is almost guaranteed to result in problems, because that is how the EFI page allocator works as well. This means that allocations will move around depending on, e.g., whether some USB stick was inserted on the first boot and removed on the second, or whether your external display was on or off. > > As this code relies on memblock_phys_alloc() being consistent, if > > something gets allocated before it differently depending on where the > > kernel is, it can also move the location. A plugin to UEFI would mean > > that it would need to reserve the memory, and the code here will need > > to know where it is. We could always make the function reserve_mem() > > global and weak so that architectures can override it. > > > Yes, the in-kernel UEFI loader (efi-stub) could simply populate a new > type of memblock with the respective reservations and you later call > memblock_find_in_range_node() instead of memblock_phys_alloc() to pass > in flags that you want to allocate only from the new > MEMBLOCK_RESERVE_MEM type. This would require the introduction of a new kind of memory map for the handover between EFI stub and core kernel. Currently, x86 relies on E820 and other architectures pass the EFI memory map unmodified.
On 18.06.24 12:21, Ard Biesheuvel wrote: > On Mon, 17 Jun 2024 at 23:01, Alexander Graf <graf@amazon.com> wrote: >> On 17.06.24 22:40, Steven Rostedt wrote: >>> On Mon, 17 Jun 2024 09:07:29 +0200 >>> Alexander Graf<graf@amazon.com> wrote: >>> >>>> Hey Steve, >>>> >>>> >>>> I believe we're talking about 2 different things :). Let me rephrase a >>>> bit and make a concrete example. >>>> >>>> Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command >>>> line option. The kernel now comes up and allocates a random chunk of >>>> memory that - by (admittedly good) chance - may be at the same physical >>>> location as before. Let's assume it deemed 0x1000000 as a good offset. >>> Note, it's not random. It picks from the top of available memory every >>> time. But things can mess with it (see below). >>> >>>> Let's now assume you're running on a UEFI system. There, you always have >>>> non-volatile storage available to you even in the pre-boot phase. That >>>> means the kernel could create a UEFI variable that says "12M:4096:trace >>>> -> 0x1000000". The pre-boot phase takes all these UEFI variables and >>>> marks them as reserved. When you finally reach your command line parsing >>>> logic for reserve_mem=, you can flip all reservations that were not on >>>> the command line back to normal memory. >>>> >>>> That way you have pretty much guaranteed persistent memory regions, even >>>> with KASLR changing your memory layout across boots. >>>> >>>> The nice thing is that the above is an extension of what you've already >>>> built: Systems with UEFI simply get better guarantees that their regions >>>> persist. >>> This could be an added feature, but it is very architecture specific, >>> and would likely need architecture specific updates. >> >> It definitely would be an added feature, yes. But one that allows you to >> ensure persistence a lot more safely :). >> >> >>>>>>> Requirement: >>>>>>> >>>>>>> Need a way to reserve memory that will be at a consistent location for >>>>>>> every boot, if the kernel and system are the same. Does not need to work >>>>>>> if rebooting to a different kernel, or if the system can change the >>>>>>> memory layout between boots. >>>>>>> >>>>>>> The reserved memory can not be an hard coded address, as the same kernel / >>>>>>> command line needs to run on several different machines. The picked memory >>>>>>> reservation just needs to be the same for a given machine, but may be >>>>>> With KASLR is enabled, doesn't this approach break too often to be >>>>>> reliable enough for the data you want to extract? >>>>>> >>>>>> Picking up the idea above, with a persistent variable we could even make >>>>>> KASLR avoid that reserved pstore region in its search for a viable KASLR >>>>>> offset. >>>>> I think I was hit by it once in all my testing. For our use case, the >>>>> few times it fails to map is not going to affect what we need this for >>>>> at all. >>>> Once is pretty good. Do you know why? Also once out of how many runs? Is >>>> the randomness source not as random as it should be or are the number of >>>> bits for KASLR maybe so few on your target architecture that the odds of >>>> hitting anything become low? Do these same constraints hold true outside >>>> of your testing environment? >>> So I just ran it a hundred times in a loop. I added a patch to print >>> the location of "_text". The loop was this: >>> >>> for i in `seq 100`; do >>> ssh root@debiantesting-x86-64 "dmesg | grep -e 'text starts' -e 'mapped boot' >> text; grub-reboot '1>0'; sleep 0.5; reboot" >>> sleep 25 >>> done >>> >>> It searches dmesg for my added printk as well as the print of were the >>> ring buffer was loaded in physical memory. >>> >>> It takes about 15 seconds to reboot, so I waited 25. The results are >>> attached. I found that it was consistent 76 times, which means 1 out of >>> 4 it's not. Funny enough, it broke whenever it loaded the kernel below >>> 0x100000000. And then it would be off by a little. >>> >>> It was consistently at: >>> >>> 0x27d000000 >>> >>> And when it failed, it was at 0x27ce00000. >>> >>> Note, when I used the e820 tables to do this, I never saw a failure. My >>> assumption is that when it is below 0x100000000, something else gets >>> allocated causing this to get pushed down. >> >> Thinking about it again: What if you run the allocation super early (see >> arch/x86/boot/compressed/kaslr.c:handle_mem_options())? > That code is not used by EFI boot anymore. > > In general, I would recommend (and have recommended) against these > kinds of hacks in mainline, because -especially on x86- there is > always someone that turns up with some kind of convoluted use case > that gets broken if we try to change anything in the boot code. > > I spent considerable time over the past year making the EFI/x86 boot > code compatible with the new MS imposed requirements on PC boot > firmware (related to secure boot and NX restrictions on memory > mappings). This involved some radical refactoring of the boot > sequence, including the KASLR logic. Adding fragile code there that > will result in regressions observable to end users when it gets broken > is really not what I'd like to see. > > So I would personally prefer for this code not to go in at all. But if > it does go in (and Steven has already agreed to this), it needs a > giant disclaimer that it is best effort and may get broken > inadvertently by changes that may seem unrelated. Alright, happy to rest my case about making it more reliable for now then :). IMHO the big fat disclaimer should be in the argument name. "reserve_mem" to me sounds like it actually guarantees a reservation - which it doesn't. Can we name it more along the lines of "debug" (to indicate it's not for production data) or "phoenix" (usually gets reborn out of ashes, but you can never know for sure): "debug_mem", / "phoenix_mem"? > >> If you stick to >> allocating only from top, you're effectively kernel version independent >> for your allocations because none of the kernel code ran yet and >> definitely KASLR independent because you're running deterministically >> before KASLR even gets allocated. >> > Allocating top down under EFI is almost guaranteed to result in > problems, because that is how the EFI page allocator works as well. > This means that allocations will move around depending on, e.g., > whether some USB stick was inserted on the first boot and removed on > the second, or whether your external display was on or off. I believe most UEFI implementations only allocate top down in the lower 32bits. But yes, it's fragile, I hear you. Let's embrace the flaky nature of the beast then :). Alex Amazon Web Services Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597
On Tue, 18 Jun 2024 13:47:49 +0200 Alexander Graf <graf@amazon.com> wrote: > IMHO the big fat disclaimer should be in the argument name. > "reserve_mem" to me sounds like it actually guarantees a reservation - > which it doesn't. Can we name it more along the lines of "debug" (to > indicate it's not for production data) or "phoenix" (usually gets reborn > out of ashes, but you can never know for sure): "debug_mem", / > "phoenix_mem"? I don't see any reason it will not reserve memory. That doesn't seem to be the issue. What is not guaranteed is that it will be in the same location as last time. So the name is correct. It's not "reserve_consistent_memory" ;-) -- Steve
On Tue, Jun 18, 2024 at 12:21:19PM +0200, Ard Biesheuvel wrote: > On Mon, 17 Jun 2024 at 23:01, Alexander Graf <graf@amazon.com> wrote: > > So I would personally prefer for this code not to go in at all. But if > it does go in (and Steven has already agreed to this), it needs a > giant disclaimer that it is best effort and may get broken > inadvertently by changes that may seem unrelated. I think that reserve_mem= is way less intrusive than memmap= we anyway have. It will reserve memory and the documentation adequately warns that the location of that memory might be at different physical addresses.
From: Mike Rapoport (IBM) <rppt@kernel.org> On Thu, 13 Jun 2024 11:55:06 -0400, Steven Rostedt wrote: > Reserve unspecified location of physical memory from kernel command line > > Background: > > In ChromeOS, we have 1 MB of pstore ramoops reserved so that we can extract > dmesg output and some other information when a crash happens in the field. > (This is only done when the user selects "Allow Google to collect data for > improving the system"). But there are cases when there's a bug that > requires more data to be retrieved to figure out what is happening. We would > like to increase the pstore size, either temporarily, or maybe even > permanently. The pstore on these devices are at a fixed location in RAM (as > the RAM is not cleared on soft reboots nor crashes). The location is chosen > by the BIOS (coreboot) and passed to the kernel via ACPI tables on x86. > There's a driver that queries for this to initialize the pstore for > ChromeOS: > > [...] Applied to for-next branch of memblock.git tree, thanks! [0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up commit: 1e4c64b71c9bf230b25fde12cbcceacfdc8b3332 [1/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up commit: 1e4c64b71c9bf230b25fde12cbcceacfdc8b3332 [2/2] pstore/ramoops: Add ramoops.mem_name= command line option commit: d9d814eebb1ae9742e7fd7f39730653b16326bd4 tree: https://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock branch: for-next -- Sincerely yours, Mike.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index ce7de8136f2f..56e18b1a520d 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5717,9 +5717,11 @@ used for systems that do not wipe the RAM, and this command line will try to reserve the same physical memory on soft reboots. Note, it is not guaranteed to be the same - location. For example, if KASLR places the kernel at the - location of where the RAM reservation was from a previous - boot, the new reservation will be at a different location. + location. For example, if anything about the system changes + or if booting a different kernel. It can also fail if KASLR + places the kernel at the location of where the RAM reservation + was from a previous boot, the new reservation will be at a + different location. Any subsystem using this feature must add a way to verify that the contents of the physical memory is from a previous boot, as there may be cases where the memory will not be diff --git a/Documentation/admin-guide/ramoops.rst b/Documentation/admin-guide/ramoops.rst index e9f85142182d..6f534a707b2a 100644 --- a/Documentation/admin-guide/ramoops.rst +++ b/Documentation/admin-guide/ramoops.rst @@ -23,6 +23,8 @@ and type of the memory area are set using three variables: * ``mem_size`` for the size. The memory size will be rounded down to a power of two. * ``mem_type`` to specify if the memory type (default is pgprot_writecombine). + * ``mem_name`` to specify a memory region defined by ``reserve_mem`` command + line parameter. Typically the default value of ``mem_type=0`` should be used as that sets the pstore mapping to pgprot_writecombine. Setting ``mem_type=1`` attempts to use @@ -118,6 +120,17 @@ Setting the ramoops parameters can be done in several different manners: return ret; } + D. Using a region of memory reserved via ``reserve_mem`` command line + parameter. The address and size will be defined by the ``reserve_mem`` + parameter. Note, that ``reserve_mem`` may not always allocate memory + in the same location, and cannot be relied upon. Testing will need + to be done, and it may not work on every machine, nor every kernel. + Consider this a "best effort" approach. The ``reserve_mem`` option + takes a size, alignment and name as arguments. The name is used + to map the memory to a label that can be retrieved by ramoops. + + reserver_mem=2M:4096:oops ramoops.mem_name=oops + You can specify either RAM memory or peripheral devices' memory. However, when specifying RAM, be sure to reserve the memory by issuing memblock_reserve() very early in the architecture code, e.g.:: diff --git a/mm/memblock.c b/mm/memblock.c index 739d106a9165..b7b0e8c3868d 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -2301,7 +2301,7 @@ EXPORT_SYMBOL_GPL(reserve_mem_find_by_name); */ static int __init reserve_mem(char *p) { - phys_addr_t start, size, align; + phys_addr_t start, size, align, tmp; char *name; char *oldp; int len; @@ -2347,8 +2347,8 @@ static int __init reserve_mem(char *p) if (!*p) return -EINVAL; - /* Make sure the name is not already used (size is only updated if found) */ - if (reserve_mem_find_by_name(name, &start, &size)) + /* Make sure the name is not already used */ + if (reserve_mem_find_by_name(name, &start, &tmp)) return -EBUSY; start = memblock_phys_alloc(size, align);