Message ID | 20180718024944.577-5-bhe@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
On Wed, 18 Jul 2018 10:49:44 +0800 Baoquan He <bhe@redhat.com> wrote: > For kexec_file loading, if kexec_buf.top_down is 'true', the memory which > is used to load kernel/initrd/purgatory is supposed to be allocated from > top to down. This is what we have been doing all along in the old kexec > loading interface and the kexec loading is still default setting in some > distributions. However, the current kexec_file loading interface doesn't > do like this. The function arch_kexec_walk_mem() it calls ignores checking > kexec_buf.top_down, but calls walk_system_ram_res() directly to go through > all resources of System RAM from bottom to up, to try to find memory region > which can contain the specific kexec buffer, then call locate_mem_hole_callback() > to allocate memory in that found memory region from top to down. This brings > confusion especially when KASLR is widely supported , users have to make clear > why kexec/kdump kernel loading position is different between these two > interfaces in order to exclude unnecessary noises. Hence these two interfaces > need be unified on behaviour. As far as I can tell, the above is the whole reason for the patchset, yes? To avoid confusing users. Is that sufficient? Can we instead simplify their lives by providing better documentation or informative printks or better Kconfig text, etc? And who *are* the people who are performing this configuration? Random system administrators? Linux distro engineers? If the latter then they presumably aren't easily confused! In other words, I'm trying to understand how much benefit this patchset will provide to our users as a whole.
Hi Andrew, On 07/18/18 at 03:33pm, Andrew Morton wrote: > On Wed, 18 Jul 2018 10:49:44 +0800 Baoquan He <bhe@redhat.com> wrote: > > > For kexec_file loading, if kexec_buf.top_down is 'true', the memory which > > is used to load kernel/initrd/purgatory is supposed to be allocated from > > top to down. This is what we have been doing all along in the old kexec > > loading interface and the kexec loading is still default setting in some > > distributions. However, the current kexec_file loading interface doesn't > > do like this. The function arch_kexec_walk_mem() it calls ignores checking > > kexec_buf.top_down, but calls walk_system_ram_res() directly to go through > > all resources of System RAM from bottom to up, to try to find memory region > > which can contain the specific kexec buffer, then call locate_mem_hole_callback() > > to allocate memory in that found memory region from top to down. This brings > > confusion especially when KASLR is widely supported , users have to make clear > > why kexec/kdump kernel loading position is different between these two > > interfaces in order to exclude unnecessary noises. Hence these two interfaces > > need be unified on behaviour. > > As far as I can tell, the above is the whole reason for the patchset, > yes? To avoid confusing users. In fact, it's not just trying to avoid confusing users. Kexec loading and kexec_file loading are just do the same thing in essence. Just we need do kernel image verification on uefi system, have to port kexec loading code to kernel. Kexec has been a formal feature in our distro, and customers owning those kind of very large machine can make use of this feature to speed up the reboot process. On uefi machine, the kexec_file loading will search place to put kernel under 4G from top to down. As we know, the 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume it. It may have possibility to not be able to find a usable space for kernel/initrd. From the top down of the whole memory space, we don't have this worry. And at the first post, I just posted below with AKASHI's walk_system_ram_res_rev() version. Later you suggested to use list_head to link child sibling of resource, see what the code change looks like. http://lkml.kernel.org/r/20180322033722.9279-1-bhe@redhat.com Then I posted v2 http://lkml.kernel.org/r/20180408024724.16812-1-bhe@redhat.com Rob Herring mentioned that other components which has this tree struct have planned to do the same thing, replacing the singly linked list with list_head to link resource child sibling. Just quote Rob's words as below. I think this could be another reason. ~~~~~ From Rob The DT struct device_node also has the same tree structure with parent, child, sibling pointers and converting to list_head had been on the todo list for a while. ACPI also has some tree walking functions (drivers/acpi/acpica/pstree.c). Perhaps there should be a common tree struct and helpers defined either on top of list_head or a ~~~~~ new struct if that saves some size. > > Is that sufficient? Can we instead simplify their lives by providing > better documentation or informative printks or better Kconfig text, > etc? > > And who *are* the people who are performing this configuration? Random > system administrators? Linux distro engineers? If the latter then > they presumably aren't easily confused! Kexec was invented for kernel developer to speed up their kernel rebooting. Now high end sever admin, kernel developer and QE are also keen to use it to reboot large box for faster feature testing, bug debugging. Kernel dev could know this well, about kernel loading position, admin or QE might not be aware of it very well. > > In other words, I'm trying to understand how much benefit this patchset > will provide to our users as a whole. Understood. The list_head replacing patch truly involes too many code changes, it's risky. I am willing to try any idea from reviewers, won't persuit they have to be accepted finally. If don't have a try, we don't know what it looks like, and what impact it may have. I am fine to take AKASHI's simple version of walk_system_ram_res_rev() to lower risk, even though it could be a little bit low efficient. Thanks Baoquan
On Thu, 19 Jul 2018 23:17:53 +0800 Baoquan He <bhe@redhat.com> wrote: > Hi Andrew, > > On 07/18/18 at 03:33pm, Andrew Morton wrote: > > On Wed, 18 Jul 2018 10:49:44 +0800 Baoquan He <bhe@redhat.com> wrote: > > > > > For kexec_file loading, if kexec_buf.top_down is 'true', the memory which > > > is used to load kernel/initrd/purgatory is supposed to be allocated from > > > top to down. This is what we have been doing all along in the old kexec > > > loading interface and the kexec loading is still default setting in some > > > distributions. However, the current kexec_file loading interface doesn't > > > do like this. The function arch_kexec_walk_mem() it calls ignores checking > > > kexec_buf.top_down, but calls walk_system_ram_res() directly to go through > > > all resources of System RAM from bottom to up, to try to find memory region > > > which can contain the specific kexec buffer, then call locate_mem_hole_callback() > > > to allocate memory in that found memory region from top to down. This brings > > > confusion especially when KASLR is widely supported , users have to make clear > > > why kexec/kdump kernel loading position is different between these two > > > interfaces in order to exclude unnecessary noises. Hence these two interfaces > > > need be unified on behaviour. > > > > As far as I can tell, the above is the whole reason for the patchset, > > yes? To avoid confusing users. > > > In fact, it's not just trying to avoid confusing users. Kexec loading > and kexec_file loading are just do the same thing in essence. Just we > need do kernel image verification on uefi system, have to port kexec > loading code to kernel. > > Kexec has been a formal feature in our distro, and customers owning > those kind of very large machine can make use of this feature to speed > up the reboot process. On uefi machine, the kexec_file loading will > search place to put kernel under 4G from top to down. As we know, the > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > it. It may have possibility to not be able to find a usable space for > kernel/initrd. From the top down of the whole memory space, we don't > have this worry. > > And at the first post, I just posted below with AKASHI's > walk_system_ram_res_rev() version. Later you suggested to use > list_head to link child sibling of resource, see what the code change > looks like. > http://lkml.kernel.org/r/20180322033722.9279-1-bhe@redhat.com > > Then I posted v2 > http://lkml.kernel.org/r/20180408024724.16812-1-bhe@redhat.com > Rob Herring mentioned that other components which has this tree struct > have planned to do the same thing, replacing the singly linked list with > list_head to link resource child sibling. Just quote Rob's words as > below. I think this could be another reason. > > ~~~~~ From Rob > The DT struct device_node also has the same tree structure with > parent, child, sibling pointers and converting to list_head had been > on the todo list for a while. ACPI also has some tree walking > functions (drivers/acpi/acpica/pstree.c). Perhaps there should be a > common tree struct and helpers defined either on top of list_head or a > ~~~~~ > new struct if that saves some size. Please let's get all this into the changelogs? > > > > Is that sufficient? Can we instead simplify their lives by providing > > better documentation or informative printks or better Kconfig text, > > etc? > > > > And who *are* the people who are performing this configuration? Random > > system administrators? Linux distro engineers? If the latter then > > they presumably aren't easily confused! > > Kexec was invented for kernel developer to speed up their kernel > rebooting. Now high end sever admin, kernel developer and QE are also > keen to use it to reboot large box for faster feature testing, bug > debugging. Kernel dev could know this well, about kernel loading > position, admin or QE might not be aware of it very well. > > > > > In other words, I'm trying to understand how much benefit this patchset > > will provide to our users as a whole. > > Understood. The list_head replacing patch truly involes too many code > changes, it's risky. I am willing to try any idea from reviewers, won't > persuit they have to be accepted finally. If don't have a try, we don't > know what it looks like, and what impact it may have. I am fine to take > AKASHI's simple version of walk_system_ram_res_rev() to lower risk, even > though it could be a little bit low efficient. The larger patch produces a better result. We can handle it ;)
On Thu 19-07-18 23:17:53, Baoquan He wrote: > Kexec has been a formal feature in our distro, and customers owning > those kind of very large machine can make use of this feature to speed > up the reboot process. On uefi machine, the kexec_file loading will > search place to put kernel under 4G from top to down. As we know, the > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > it. It may have possibility to not be able to find a usable space for > kernel/initrd. From the top down of the whole memory space, we don't > have this worry. I do not have the full context here but let me note that you should be careful when doing top-down reservation because you can easily get into hotplugable memory and break the hotremove usecase. We even warn when this is done. See memblock_find_in_range_node
Hi Andrew, On 07/19/18 at 12:44pm, Andrew Morton wrote: > On Thu, 19 Jul 2018 23:17:53 +0800 Baoquan He <bhe@redhat.com> wrote: > > > As far as I can tell, the above is the whole reason for the patchset, > > > yes? To avoid confusing users. > > > > > > In fact, it's not just trying to avoid confusing users. Kexec loading > > and kexec_file loading are just do the same thing in essence. Just we > > need do kernel image verification on uefi system, have to port kexec > > loading code to kernel. > > > > Kexec has been a formal feature in our distro, and customers owning > > those kind of very large machine can make use of this feature to speed > > up the reboot process. On uefi machine, the kexec_file loading will > > search place to put kernel under 4G from top to down. As we know, the > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > > it. It may have possibility to not be able to find a usable space for > > kernel/initrd. From the top down of the whole memory space, we don't > > have this worry. > > > > And at the first post, I just posted below with AKASHI's > > walk_system_ram_res_rev() version. Later you suggested to use > > list_head to link child sibling of resource, see what the code change > > looks like. > > http://lkml.kernel.org/r/20180322033722.9279-1-bhe@redhat.com > > > > Then I posted v2 > > http://lkml.kernel.org/r/20180408024724.16812-1-bhe@redhat.com > > Rob Herring mentioned that other components which has this tree struct > > have planned to do the same thing, replacing the singly linked list with > > list_head to link resource child sibling. Just quote Rob's words as > > below. I think this could be another reason. > > > > ~~~~~ From Rob > > The DT struct device_node also has the same tree structure with > > parent, child, sibling pointers and converting to list_head had been > > on the todo list for a while. ACPI also has some tree walking > > functions (drivers/acpi/acpica/pstree.c). Perhaps there should be a > > common tree struct and helpers defined either on top of list_head or a > > ~~~~~ > > new struct if that saves some size. > > Please let's get all this into the changelogs? Sorry for late reply because of some urgent customer hotplug issues. I am rewriting all change logs, and cover letter. Then found I was wrong about the 2nd reason. The current kexec_file_load calls kexec_locate_mem_hole() to go through all system RAM region, if one region is larger than the size of kernel or initrd, it will search a position in that region from top to down. Since kexec will jump to 2nd kernel and don't need to care the 1st kernel's data, we can always find a usable space to load kexec kernel/initrd under 4G. So the only reason for this patch is keeping consistent with kexec_load and avoid confusion. And since x86 5-level paging mode has been added, we have another issue for top-down searching in the whole system RAM. That is we support dynamic 4-level to 5-level changing. Namely a kernel compiled with 5-level support, we can add 'no5lvl' to force 4-level. Then jumping from a 5-level kernel to 4-level kernel, e.g we load kernel at the top of system RAM in 5-level paging mode which might be bigger than 64TB, then try to jump to 4-level kernel with the upper limit of 64TB. For this case, we need add limit for kexec kernel loading if in 5-level kernel. All this mess makes me hesitate to choose a deligate method. Maybe I should drop this patchset. > > > > > > > Is that sufficient? Can we instead simplify their lives by providing > > > better documentation or informative printks or better Kconfig text, > > > etc? > > > > > > And who *are* the people who are performing this configuration? Random > > > system administrators? Linux distro engineers? If the latter then > > > they presumably aren't easily confused! > > > > Kexec was invented for kernel developer to speed up their kernel > > rebooting. Now high end sever admin, kernel developer and QE are also > > keen to use it to reboot large box for faster feature testing, bug > > debugging. Kernel dev could know this well, about kernel loading > > position, admin or QE might not be aware of it very well. > > > > > > > > In other words, I'm trying to understand how much benefit this patchset > > > will provide to our users as a whole. > > > > Understood. The list_head replacing patch truly involes too many code > > changes, it's risky. I am willing to try any idea from reviewers, won't > > persuit they have to be accepted finally. If don't have a try, we don't > > know what it looks like, and what impact it may have. I am fine to take > > AKASHI's simple version of walk_system_ram_res_rev() to lower risk, even > > though it could be a little bit low efficient. > > The larger patch produces a better result. We can handle it ;) For this issue, if we stop changing the kexec top down searching code, I am not sure if we should post this replacing with list_head patches separately. Thanks Baoquan
On 07/23/18 at 04:34pm, Michal Hocko wrote: > On Thu 19-07-18 23:17:53, Baoquan He wrote: > > Kexec has been a formal feature in our distro, and customers owning > > those kind of very large machine can make use of this feature to speed > > up the reboot process. On uefi machine, the kexec_file loading will > > search place to put kernel under 4G from top to down. As we know, the > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > > it. It may have possibility to not be able to find a usable space for > > kernel/initrd. From the top down of the whole memory space, we don't > > have this worry. > > I do not have the full context here but let me note that you should be > careful when doing top-down reservation because you can easily get into > hotplugable memory and break the hotremove usecase. We even warn when > this is done. See memblock_find_in_range_node Kexec read kernel/initrd file into buffer, just search usable positions for them to do the later copying. You can see below struct kexec_segment, for the old kexec_load, kernel/initrd are read into user space buffer, the @buf stores the user space buffer address, @mem stores the position where kernel/initrd will be put. In kernel, it calls kimage_load_normal_segment() to copy user space buffer to intermediate pages which are allocated with flag GFP_KERNEL. These intermediate pages are recorded as entries, later when user execute "kexec -e" to trigger kexec jumping, it will do the final copying from the intermediate pages to the real destination pages which @mem pointed. Because we can't touch the existed data in 1st kernel when do kexec kernel loading. With my understanding, GFP_KERNEL will make those intermediate pages be allocated inside immovable area, it won't impact hotplugging. But the @mem we searched in the whole system RAM might be lost along with hotplug. Hence we need do kexec kernel again when hotplug event is detected. #define KEXEC_CONTROL_MEMORY_GFP (GFP_KERNEL | __GFP_NORETRY) struct kexec_segment { /* * This pointer can point to user memory if kexec_load() system * call is used or will point to kernel memory if * kexec_file_load() system call is used. * * Use ->buf when expecting to deal with user memory and use ->kbuf * when expecting to deal with kernel memory. */ union { void __user *buf; void *kbuf; }; size_t bufsz; unsigned long mem; size_t memsz; }; Thanks Baoquan
On Wed 25-07-18 14:48:13, Baoquan He wrote: > On 07/23/18 at 04:34pm, Michal Hocko wrote: > > On Thu 19-07-18 23:17:53, Baoquan He wrote: > > > Kexec has been a formal feature in our distro, and customers owning > > > those kind of very large machine can make use of this feature to speed > > > up the reboot process. On uefi machine, the kexec_file loading will > > > search place to put kernel under 4G from top to down. As we know, the > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > > > it. It may have possibility to not be able to find a usable space for > > > kernel/initrd. From the top down of the whole memory space, we don't > > > have this worry. > > > > I do not have the full context here but let me note that you should be > > careful when doing top-down reservation because you can easily get into > > hotplugable memory and break the hotremove usecase. We even warn when > > this is done. See memblock_find_in_range_node > > Kexec read kernel/initrd file into buffer, just search usable positions > for them to do the later copying. You can see below struct kexec_segment, > for the old kexec_load, kernel/initrd are read into user space buffer, > the @buf stores the user space buffer address, @mem stores the position > where kernel/initrd will be put. In kernel, it calls > kimage_load_normal_segment() to copy user space buffer to intermediate > pages which are allocated with flag GFP_KERNEL. These intermediate pages > are recorded as entries, later when user execute "kexec -e" to trigger > kexec jumping, it will do the final copying from the intermediate pages > to the real destination pages which @mem pointed. Because we can't touch > the existed data in 1st kernel when do kexec kernel loading. With my > understanding, GFP_KERNEL will make those intermediate pages be > allocated inside immovable area, it won't impact hotplugging. But the > @mem we searched in the whole system RAM might be lost along with > hotplug. Hence we need do kexec kernel again when hotplug event is > detected. I am not sure I am following. If @mem is placed at movable node then the memory hotremove simply won't work, because we are seeing reserved pages and do not know what to do about them. They are not migrateable. Allocating intermediate pages from other nodes doesn't really help. The memblock code warns exactly for that reason.
On 07/26/18 at 02:59pm, Michal Hocko wrote: > On Wed 25-07-18 14:48:13, Baoquan He wrote: > > On 07/23/18 at 04:34pm, Michal Hocko wrote: > > > On Thu 19-07-18 23:17:53, Baoquan He wrote: > > > > Kexec has been a formal feature in our distro, and customers owning > > > > those kind of very large machine can make use of this feature to speed > > > > up the reboot process. On uefi machine, the kexec_file loading will > > > > search place to put kernel under 4G from top to down. As we know, the > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > > > > it. It may have possibility to not be able to find a usable space for > > > > kernel/initrd. From the top down of the whole memory space, we don't > > > > have this worry. > > > > > > I do not have the full context here but let me note that you should be > > > careful when doing top-down reservation because you can easily get into > > > hotplugable memory and break the hotremove usecase. We even warn when > > > this is done. See memblock_find_in_range_node > > > > Kexec read kernel/initrd file into buffer, just search usable positions > > for them to do the later copying. You can see below struct kexec_segment, > > for the old kexec_load, kernel/initrd are read into user space buffer, > > the @buf stores the user space buffer address, @mem stores the position > > where kernel/initrd will be put. In kernel, it calls > > kimage_load_normal_segment() to copy user space buffer to intermediate > > pages which are allocated with flag GFP_KERNEL. These intermediate pages > > are recorded as entries, later when user execute "kexec -e" to trigger > > kexec jumping, it will do the final copying from the intermediate pages > > to the real destination pages which @mem pointed. Because we can't touch > > the existed data in 1st kernel when do kexec kernel loading. With my > > understanding, GFP_KERNEL will make those intermediate pages be > > allocated inside immovable area, it won't impact hotplugging. But the > > @mem we searched in the whole system RAM might be lost along with > > hotplug. Hence we need do kexec kernel again when hotplug event is > > detected. > > I am not sure I am following. If @mem is placed at movable node then the > memory hotremove simply won't work, because we are seeing reserved pages > and do not know what to do about them. They are not migrateable. > Allocating intermediate pages from other nodes doesn't really help. OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove in 1st kernel, it does impact the kernel which kexec jump into if kernel is at top of system RAM and the top RAM is in movable node. > > The memblock code warns exactly for that reason. > -- > Michal Hocko > SUSE Labs
On Thu 26-07-18 21:09:04, Baoquan He wrote: > On 07/26/18 at 02:59pm, Michal Hocko wrote: > > On Wed 25-07-18 14:48:13, Baoquan He wrote: > > > On 07/23/18 at 04:34pm, Michal Hocko wrote: > > > > On Thu 19-07-18 23:17:53, Baoquan He wrote: > > > > > Kexec has been a formal feature in our distro, and customers owning > > > > > those kind of very large machine can make use of this feature to speed > > > > > up the reboot process. On uefi machine, the kexec_file loading will > > > > > search place to put kernel under 4G from top to down. As we know, the > > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > > > > > it. It may have possibility to not be able to find a usable space for > > > > > kernel/initrd. From the top down of the whole memory space, we don't > > > > > have this worry. > > > > > > > > I do not have the full context here but let me note that you should be > > > > careful when doing top-down reservation because you can easily get into > > > > hotplugable memory and break the hotremove usecase. We even warn when > > > > this is done. See memblock_find_in_range_node > > > > > > Kexec read kernel/initrd file into buffer, just search usable positions > > > for them to do the later copying. You can see below struct kexec_segment, > > > for the old kexec_load, kernel/initrd are read into user space buffer, > > > the @buf stores the user space buffer address, @mem stores the position > > > where kernel/initrd will be put. In kernel, it calls > > > kimage_load_normal_segment() to copy user space buffer to intermediate > > > pages which are allocated with flag GFP_KERNEL. These intermediate pages > > > are recorded as entries, later when user execute "kexec -e" to trigger > > > kexec jumping, it will do the final copying from the intermediate pages > > > to the real destination pages which @mem pointed. Because we can't touch > > > the existed data in 1st kernel when do kexec kernel loading. With my > > > understanding, GFP_KERNEL will make those intermediate pages be > > > allocated inside immovable area, it won't impact hotplugging. But the > > > @mem we searched in the whole system RAM might be lost along with > > > hotplug. Hence we need do kexec kernel again when hotplug event is > > > detected. > > > > I am not sure I am following. If @mem is placed at movable node then the > > memory hotremove simply won't work, because we are seeing reserved pages > > and do not know what to do about them. They are not migrateable. > > Allocating intermediate pages from other nodes doesn't really help. > > OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove > in 1st kernel, it does impact the kernel which kexec jump into if kernel > is at top of system RAM and the top RAM is in movable node. It will affect the 1st kernel (which does the memblock allocation top-down) as well. For reasons mentioned above.
On Thu 26-07-18 15:12:42, Michal Hocko wrote: > On Thu 26-07-18 21:09:04, Baoquan He wrote: > > On 07/26/18 at 02:59pm, Michal Hocko wrote: > > > On Wed 25-07-18 14:48:13, Baoquan He wrote: > > > > On 07/23/18 at 04:34pm, Michal Hocko wrote: > > > > > On Thu 19-07-18 23:17:53, Baoquan He wrote: > > > > > > Kexec has been a formal feature in our distro, and customers owning > > > > > > those kind of very large machine can make use of this feature to speed > > > > > > up the reboot process. On uefi machine, the kexec_file loading will > > > > > > search place to put kernel under 4G from top to down. As we know, the > > > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > > > > > > it. It may have possibility to not be able to find a usable space for > > > > > > kernel/initrd. From the top down of the whole memory space, we don't > > > > > > have this worry. > > > > > > > > > > I do not have the full context here but let me note that you should be > > > > > careful when doing top-down reservation because you can easily get into > > > > > hotplugable memory and break the hotremove usecase. We even warn when > > > > > this is done. See memblock_find_in_range_node > > > > > > > > Kexec read kernel/initrd file into buffer, just search usable positions > > > > for them to do the later copying. You can see below struct kexec_segment, > > > > for the old kexec_load, kernel/initrd are read into user space buffer, > > > > the @buf stores the user space buffer address, @mem stores the position > > > > where kernel/initrd will be put. In kernel, it calls > > > > kimage_load_normal_segment() to copy user space buffer to intermediate > > > > pages which are allocated with flag GFP_KERNEL. These intermediate pages > > > > are recorded as entries, later when user execute "kexec -e" to trigger > > > > kexec jumping, it will do the final copying from the intermediate pages > > > > to the real destination pages which @mem pointed. Because we can't touch > > > > the existed data in 1st kernel when do kexec kernel loading. With my > > > > understanding, GFP_KERNEL will make those intermediate pages be > > > > allocated inside immovable area, it won't impact hotplugging. But the > > > > @mem we searched in the whole system RAM might be lost along with > > > > hotplug. Hence we need do kexec kernel again when hotplug event is > > > > detected. > > > > > > I am not sure I am following. If @mem is placed at movable node then the > > > memory hotremove simply won't work, because we are seeing reserved pages > > > and do not know what to do about them. They are not migrateable. > > > Allocating intermediate pages from other nodes doesn't really help. > > > > OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove > > in 1st kernel, it does impact the kernel which kexec jump into if kernel > > is at top of system RAM and the top RAM is in movable node. > > It will affect the 1st kernel (which does the memblock allocation > top-down) as well. For reasons mentioned above. And btw. in the ideal world, we would restrict the memblock allocation top-down from the non-movable nodes. But I do not think we have that information ready at the time when the reservation is done.
On 07/26/18 at 03:14pm, Michal Hocko wrote: > On Thu 26-07-18 15:12:42, Michal Hocko wrote: > > On Thu 26-07-18 21:09:04, Baoquan He wrote: > > > On 07/26/18 at 02:59pm, Michal Hocko wrote: > > > > On Wed 25-07-18 14:48:13, Baoquan He wrote: > > > > > On 07/23/18 at 04:34pm, Michal Hocko wrote: > > > > > > On Thu 19-07-18 23:17:53, Baoquan He wrote: > > > > > > > Kexec has been a formal feature in our distro, and customers owning > > > > > > > those kind of very large machine can make use of this feature to speed > > > > > > > up the reboot process. On uefi machine, the kexec_file loading will > > > > > > > search place to put kernel under 4G from top to down. As we know, the > > > > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > > > > > > > it. It may have possibility to not be able to find a usable space for > > > > > > > kernel/initrd. From the top down of the whole memory space, we don't > > > > > > > have this worry. > > > > > > > > > > > > I do not have the full context here but let me note that you should be > > > > > > careful when doing top-down reservation because you can easily get into > > > > > > hotplugable memory and break the hotremove usecase. We even warn when > > > > > > this is done. See memblock_find_in_range_node > > > > > > > > > > Kexec read kernel/initrd file into buffer, just search usable positions > > > > > for them to do the later copying. You can see below struct kexec_segment, > > > > > for the old kexec_load, kernel/initrd are read into user space buffer, > > > > > the @buf stores the user space buffer address, @mem stores the position > > > > > where kernel/initrd will be put. In kernel, it calls > > > > > kimage_load_normal_segment() to copy user space buffer to intermediate > > > > > pages which are allocated with flag GFP_KERNEL. These intermediate pages > > > > > are recorded as entries, later when user execute "kexec -e" to trigger > > > > > kexec jumping, it will do the final copying from the intermediate pages > > > > > to the real destination pages which @mem pointed. Because we can't touch > > > > > the existed data in 1st kernel when do kexec kernel loading. With my > > > > > understanding, GFP_KERNEL will make those intermediate pages be > > > > > allocated inside immovable area, it won't impact hotplugging. But the > > > > > @mem we searched in the whole system RAM might be lost along with > > > > > hotplug. Hence we need do kexec kernel again when hotplug event is > > > > > detected. > > > > > > > > I am not sure I am following. If @mem is placed at movable node then the > > > > memory hotremove simply won't work, because we are seeing reserved pages > > > > and do not know what to do about them. They are not migrateable. > > > > Allocating intermediate pages from other nodes doesn't really help. > > > > > > OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove > > > in 1st kernel, it does impact the kernel which kexec jump into if kernel > > > is at top of system RAM and the top RAM is in movable node. > > > > It will affect the 1st kernel (which does the memblock allocation > > top-down) as well. For reasons mentioned above. > > And btw. in the ideal world, we would restrict the memblock allocation > top-down from the non-movable nodes. But I do not think we have that > information ready at the time when the reservation is done. Oh, you could mix kexec loading up with kdump kernel loading. For kdump kernel, we need reserve memory region during bootup with memblock allocator. For kexec loading, we just operate after system up, and do not need to reserve any memmory region. About memory used to load them, it's quite different way. Thanks Baoquan
On Thu 26-07-18 21:37:05, Baoquan He wrote: > On 07/26/18 at 03:14pm, Michal Hocko wrote: > > On Thu 26-07-18 15:12:42, Michal Hocko wrote: > > > On Thu 26-07-18 21:09:04, Baoquan He wrote: > > > > On 07/26/18 at 02:59pm, Michal Hocko wrote: > > > > > On Wed 25-07-18 14:48:13, Baoquan He wrote: > > > > > > On 07/23/18 at 04:34pm, Michal Hocko wrote: > > > > > > > On Thu 19-07-18 23:17:53, Baoquan He wrote: > > > > > > > > Kexec has been a formal feature in our distro, and customers owning > > > > > > > > those kind of very large machine can make use of this feature to speed > > > > > > > > up the reboot process. On uefi machine, the kexec_file loading will > > > > > > > > search place to put kernel under 4G from top to down. As we know, the > > > > > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > > > > > > > > it. It may have possibility to not be able to find a usable space for > > > > > > > > kernel/initrd. From the top down of the whole memory space, we don't > > > > > > > > have this worry. > > > > > > > > > > > > > > I do not have the full context here but let me note that you should be > > > > > > > careful when doing top-down reservation because you can easily get into > > > > > > > hotplugable memory and break the hotremove usecase. We even warn when > > > > > > > this is done. See memblock_find_in_range_node > > > > > > > > > > > > Kexec read kernel/initrd file into buffer, just search usable positions > > > > > > for them to do the later copying. You can see below struct kexec_segment, > > > > > > for the old kexec_load, kernel/initrd are read into user space buffer, > > > > > > the @buf stores the user space buffer address, @mem stores the position > > > > > > where kernel/initrd will be put. In kernel, it calls > > > > > > kimage_load_normal_segment() to copy user space buffer to intermediate > > > > > > pages which are allocated with flag GFP_KERNEL. These intermediate pages > > > > > > are recorded as entries, later when user execute "kexec -e" to trigger > > > > > > kexec jumping, it will do the final copying from the intermediate pages > > > > > > to the real destination pages which @mem pointed. Because we can't touch > > > > > > the existed data in 1st kernel when do kexec kernel loading. With my > > > > > > understanding, GFP_KERNEL will make those intermediate pages be > > > > > > allocated inside immovable area, it won't impact hotplugging. But the > > > > > > @mem we searched in the whole system RAM might be lost along with > > > > > > hotplug. Hence we need do kexec kernel again when hotplug event is > > > > > > detected. > > > > > > > > > > I am not sure I am following. If @mem is placed at movable node then the > > > > > memory hotremove simply won't work, because we are seeing reserved pages > > > > > and do not know what to do about them. They are not migrateable. > > > > > Allocating intermediate pages from other nodes doesn't really help. > > > > > > > > OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove > > > > in 1st kernel, it does impact the kernel which kexec jump into if kernel > > > > is at top of system RAM and the top RAM is in movable node. > > > > > > It will affect the 1st kernel (which does the memblock allocation > > > top-down) as well. For reasons mentioned above. > > > > And btw. in the ideal world, we would restrict the memblock allocation > > top-down from the non-movable nodes. But I do not think we have that > > information ready at the time when the reservation is done. > > Oh, you could mix kexec loading up with kdump kernel loading. For kdump > kernel, we need reserve memory region during bootup with memblock > allocator. For kexec loading, we just operate after system up, and do > not need to reserve any memmory region. About memory used to load them, > it's quite different way. I didn't know about that. I thought both use the same underlying reservation mechanism. My bad and sorry for the noise.
On 07/26/18 at 04:01pm, Michal Hocko wrote: > On Thu 26-07-18 21:37:05, Baoquan He wrote: > > On 07/26/18 at 03:14pm, Michal Hocko wrote: > > > On Thu 26-07-18 15:12:42, Michal Hocko wrote: > > > > On Thu 26-07-18 21:09:04, Baoquan He wrote: > > > > > On 07/26/18 at 02:59pm, Michal Hocko wrote: > > > > > > On Wed 25-07-18 14:48:13, Baoquan He wrote: > > > > > > > On 07/23/18 at 04:34pm, Michal Hocko wrote: > > > > > > > > On Thu 19-07-18 23:17:53, Baoquan He wrote: > > > > > > > > > Kexec has been a formal feature in our distro, and customers owning > > > > > > > > > those kind of very large machine can make use of this feature to speed > > > > > > > > > up the reboot process. On uefi machine, the kexec_file loading will > > > > > > > > > search place to put kernel under 4G from top to down. As we know, the > > > > > > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > > > > > > > > > it. It may have possibility to not be able to find a usable space for > > > > > > > > > kernel/initrd. From the top down of the whole memory space, we don't > > > > > > > > > have this worry. > > > > > > > > > > > > > > > > I do not have the full context here but let me note that you should be > > > > > > > > careful when doing top-down reservation because you can easily get into > > > > > > > > hotplugable memory and break the hotremove usecase. We even warn when > > > > > > > > this is done. See memblock_find_in_range_node > > > > > > > > > > > > > > Kexec read kernel/initrd file into buffer, just search usable positions > > > > > > > for them to do the later copying. You can see below struct kexec_segment, > > > > > > > for the old kexec_load, kernel/initrd are read into user space buffer, > > > > > > > the @buf stores the user space buffer address, @mem stores the position > > > > > > > where kernel/initrd will be put. In kernel, it calls > > > > > > > kimage_load_normal_segment() to copy user space buffer to intermediate > > > > > > > pages which are allocated with flag GFP_KERNEL. These intermediate pages > > > > > > > are recorded as entries, later when user execute "kexec -e" to trigger > > > > > > > kexec jumping, it will do the final copying from the intermediate pages > > > > > > > to the real destination pages which @mem pointed. Because we can't touch > > > > > > > the existed data in 1st kernel when do kexec kernel loading. With my > > > > > > > understanding, GFP_KERNEL will make those intermediate pages be > > > > > > > allocated inside immovable area, it won't impact hotplugging. But the > > > > > > > @mem we searched in the whole system RAM might be lost along with > > > > > > > hotplug. Hence we need do kexec kernel again when hotplug event is > > > > > > > detected. > > > > > > > > > > > > I am not sure I am following. If @mem is placed at movable node then the > > > > > > memory hotremove simply won't work, because we are seeing reserved pages > > > > > > and do not know what to do about them. They are not migrateable. > > > > > > Allocating intermediate pages from other nodes doesn't really help. > > > > > > > > > > OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove > > > > > in 1st kernel, it does impact the kernel which kexec jump into if kernel > > > > > is at top of system RAM and the top RAM is in movable node. > > > > > > > > It will affect the 1st kernel (which does the memblock allocation > > > > top-down) as well. For reasons mentioned above. > > > > > > And btw. in the ideal world, we would restrict the memblock allocation > > > top-down from the non-movable nodes. But I do not think we have that > > > information ready at the time when the reservation is done. > > > > Oh, you could mix kexec loading up with kdump kernel loading. For kdump > > kernel, we need reserve memory region during bootup with memblock > > allocator. For kexec loading, we just operate after system up, and do > > not need to reserve any memmory region. About memory used to load them, > > it's quite different way. > > I didn't know about that. I thought both use the same underlying > reservation mechanism. My bad and sorry for the noise. Not at all. It's truly confusing. I often need take time to recall those details.
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index c6a3b6851372..75226c1d08ce 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -518,6 +518,8 @@ int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf, IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY, crashk_res.start, crashk_res.end, kbuf, func); + else if (kbuf->top_down) + return walk_system_ram_res_rev(0, ULONG_MAX, kbuf, func); else return walk_system_ram_res(0, ULONG_MAX, kbuf, func); }
For kexec_file loading, if kexec_buf.top_down is 'true', the memory which is used to load kernel/initrd/purgatory is supposed to be allocated from top to down. This is what we have been doing all along in the old kexec loading interface and the kexec loading is still default setting in some distributions. However, the current kexec_file loading interface doesn't do like this. The function arch_kexec_walk_mem() it calls ignores checking kexec_buf.top_down, but calls walk_system_ram_res() directly to go through all resources of System RAM from bottom to up, to try to find memory region which can contain the specific kexec buffer, then call locate_mem_hole_callback() to allocate memory in that found memory region from top to down. This brings confusion especially when KASLR is widely supported , users have to make clear why kexec/kdump kernel loading position is different between these two interfaces in order to exclude unnecessary noises. Hence these two interfaces need be unified on behaviour. Here add checking if kexec_buf.top_down is 'true' in arch_kexec_walk_mem(), if yes, call the newly added walk_system_ram_res_rev() to find memory region from top to down to load kernel. Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: kexec@lists.infradead.org --- kernel/kexec_file.c | 2 ++ 1 file changed, 2 insertions(+)