Message ID | 20170227092817.23571-1-mhocko@kernel.org (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
Michal Hocko <mhocko@kernel.org> writes: > From: Michal Hocko <mhocko@suse.com> > > This knob has been added by 31bc3858ea3e ("memory-hotplug: add automatic > onlining policy for the newly added memory") mainly to cover memory > hotplug based balooning solutions currently implemented for HyperV > and Xen. Both of them want to online the memory as soon after > registering as possible otherwise they can register too much memory > which cannot be used and trigger the oom killer (we need ~1.5% of the > registered memory so a large increase can consume all the available > memory). hv_mem_hot_add even waits for the userspace to online the > memory if the auto onlining is disabled to mitigate that problem. > > Adding yet another knob and a config option just doesn't make much sense > IMHO. How is a random user supposed to know when to enable this option? > Ballooning drivers know much better that they want to do an immediate > online rather than waiting for the userspace to do that. If the memory > is onlined for a different purpose then we already have a notification > for the userspace and udev can handle the onlining. So the knob as well > as the config option for the default behavior just doesn't make any > sense. Let's remove them and allow user of add_memory to request the > online status explicitly. Not only it makes more sense it also removes a > lot of clutter. > > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > > Hi, > I am sending this as an RFC because this is a user visible change. Maybe > we won't be able to remove the sysfs knob which would be sad, especially > when it has been added without a wider discussion and IMHO it is just > wrong. Is there any reason why a kernel command line parameter wouldn't > work just fine? > > Even in that case I believe that we should remove > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE knob. It just adds to an already > messy config space. Does anybody depend on the policy during the early > boot before the userspace can set the sysfs knob? Or why those users cannot > simply use the kernel command line parameter. > > I also believe that the wait-for-userspace in hyperV should just die. It > should do the unconditional onlining. Same as Xen. I do not see any > reason why those should depend on the userspace. This should be just > fixed regardless of the sysfs/config part. I can separate this out of course. > > Thoughts/Concerns? I don't have anything new to add to the discussion happened last week but I'd like to summarize my arguments against this change: 1) This patch doesn't solve any issue. Configuration option is not an issue by itself, it is an option for distros to decide what they want to ship: udev rule with known issues (legacy mode) or enable the new option. Distro makers and users building their kernels should be able to answer this simple question "do you want to automatically online all newly added memory or not". There are distros already which ship kernels with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE enabled (Fedora 24 and 25 as far as I remember, maybe someone else). 2) This patch creates an imbalance between Xen/Hyper-V on one side and KVM/Vmware on another. KVM/Vmware use pure ACPI memory hotplug and this memory won't get onlined. I don't understand how this problem is supposed to be solved by distros. They'll *have to* continue shipping a udev rule which has and always will have issues. 3) Kernel command line is not a viable choice, it is rather a debug method. Having all newly added memory online as soon as possible is a major use-case not something a couple of users wants (and this is proved by major distros shipping the unconditional 'offline->online' rule with udev). A couple of other thoughts: 1) Having all newly added memory online ASAP is probably what people want for all virtual machines. Unfortunately, we have additional complexity with memory zones (ZONE_NORMAL, ZONE_MOVABLE) and in some cases manual intervention is required. Especially, when further unplug is expected. 2) Adding new memory can (in some extreme cases) still fail as we need some *other* memory before we're able to online the newly added block. This is an issue to be solved and it is doable (IMO) with some pre-allocation. I'd also like to notice that this patch doesn't re-introduce the issue I was fixing with in-kernel memory onlining as all memory added through the Hyper-V driver will be auto-onlined unconditionally. What I disagree with here is taking away choice without fixing any real world issues. [snip]
On Mon 27-02-17 11:02:09, Vitaly Kuznetsov wrote: [...] > I don't have anything new to add to the discussion happened last week > but I'd like to summarize my arguments against this change: > > 1) This patch doesn't solve any issue. Configuration option is not an > issue by itself, it is an option for distros to decide what they want to > ship: udev rule with known issues (legacy mode) or enable the new > option. Distro makers and users building their kernels should be able to > answer this simple question "do you want to automatically online all > newly added memory or not". OK, so could you be more specific? Distributions have no clue about which HW their kernel runs on so how can they possibly make a sensible decision here? > There are distros already which ship kernels > with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE enabled (Fedora 24 and 25 as > far as I remember, maybe someone else). > > 2) This patch creates an imbalance between Xen/Hyper-V on one side and > KVM/Vmware on another. KVM/Vmware use pure ACPI memory hotplug and this > memory won't get onlined. I don't understand how this problem is > supposed to be solved by distros. They'll *have to* continue shipping > a udev rule which has and always will have issues. They have notifications for udev to online that memory and AFAICU neither KVM nor VMware are using memory hotplut for ballooning - unlike HyperV and Xen. > 3) Kernel command line is not a viable choice, it is rather a debug > method. Why? > Having all newly added memory online as soon as possible is a > major use-case not something a couple of users wants (and this is > proved by major distros shipping the unconditional 'offline->online' > rule with udev). I would argue because this really depends on the usecase. a) somebody might want to online memory as movable and that really depends on which node we are talking about because not all of them can be movable b) it is easier to handle potential errors from userspace than the kernel. > A couple of other thoughts: > 1) Having all newly added memory online ASAP is probably what people > want for all virtual machines. Unfortunately, we have additional > complexity with memory zones (ZONE_NORMAL, ZONE_MOVABLE) and in some > cases manual intervention is required. Especially, when further unplug > is expected. and that is why we do not want to hardwire this into the kernel and we have a notification to handle this in userspace. > 2) Adding new memory can (in some extreme cases) still fail as we need > some *other* memory before we're able to online the newly added > block. This is an issue to be solved and it is doable (IMO) with some > pre-allocation. you cannot preallocate for all the possible memory that can be added.
Michal Hocko <mhocko@kernel.org> writes: > On Mon 27-02-17 11:02:09, Vitaly Kuznetsov wrote: > [...] >> I don't have anything new to add to the discussion happened last week >> but I'd like to summarize my arguments against this change: >> >> 1) This patch doesn't solve any issue. Configuration option is not an >> issue by itself, it is an option for distros to decide what they want to >> ship: udev rule with known issues (legacy mode) or enable the new >> option. Distro makers and users building their kernels should be able to >> answer this simple question "do you want to automatically online all >> newly added memory or not". > > OK, so could you be more specific? Distributions have no clue about > which HW their kernel runs on so how can they possibly make a sensible > decision here? They at least have an idea if they ship udev rule or not. I can also imagine different choices for non-x86 architectures but I don't know enough about them to have an opinion. > >> There are distros already which ship kernels >> with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE enabled (Fedora 24 and 25 as >> far as I remember, maybe someone else). >> >> 2) This patch creates an imbalance between Xen/Hyper-V on one side and >> KVM/Vmware on another. KVM/Vmware use pure ACPI memory hotplug and this >> memory won't get onlined. I don't understand how this problem is >> supposed to be solved by distros. They'll *have to* continue shipping >> a udev rule which has and always will have issues. > > They have notifications for udev to online that memory and AFAICU > neither KVM nor VMware are using memory hotplut for ballooning - unlike > HyperV and Xen. > No, Hyper-V doesn't use memory hotplug for ballooning purposes. It is just a memory hotplug. The fact that the code is located in hv_balloon is just a coincidence. The difference with real hardware is how the operation is performed: with real hardware you need to take a DIMM, go to your server room, open the box, insert DIMM, go back to your seat. Asking to do some manual action to actually enable memory is kinda OK. The beauty of hypervisors is that everything happens automatically (e.g. when the VM is running out of memory). >> 3) Kernel command line is not a viable choice, it is rather a debug >> method. > > Why? > Because we usually have just a few things there (root=, console=) and the rest is used when something goes wrong or for 'special' cases, not for the majority of users. >> Having all newly added memory online as soon as possible is a >> major use-case not something a couple of users wants (and this is >> proved by major distros shipping the unconditional 'offline->online' >> rule with udev). > > I would argue because this really depends on the usecase. a) somebody > might want to online memory as movable and that really depends on which > node we are talking about because not all of them can be movable This is possible and that's why I introduce kernel command line options back then. To simplify, I argue that the major use-case is 'online ASAP, never offline' and for other use-cases we have options, both for distros (config) and for users (command-line) > b) it > is easier to handle potential errors from userspace than the kernel. > Yes, probably, but memory hotplug was around for quite some time and I didn't see anything but the dump udev rule (offline->online) without any handling. And I think that we should rather focus on fixing potential issues and making failures less probable (e.g. it's really hard to come up with something different from 'failed->retry'). >> A couple of other thoughts: >> 1) Having all newly added memory online ASAP is probably what people >> want for all virtual machines. Unfortunately, we have additional >> complexity with memory zones (ZONE_NORMAL, ZONE_MOVABLE) and in some >> cases manual intervention is required. Especially, when further unplug >> is expected. > > and that is why we do not want to hardwire this into the kernel and we > have a notification to handle this in userspace. Yes and I don't know about any plans to remove this notification. In case some really complex handling is required just don't turn on the automatic onlining. Memory hotplug in real x86 hardware is rare, memory hotplug for VMs is ubiquitous. > >> 2) Adding new memory can (in some extreme cases) still fail as we need >> some *other* memory before we're able to online the newly added >> block. This is an issue to be solved and it is doable (IMO) with some >> pre-allocation. > > you cannot preallocate for all the possible memory that can be added. For all, no, but for 1 next block - yes, and then I'll preallocate for the next one.
On Mon, Feb 27, 2017 at 11:02:09AM +0100, Vitaly Kuznetsov wrote: > A couple of other thoughts: > 1) Having all newly added memory online ASAP is probably what people > want for all virtual machines. This is not true for s390. On s390 we have "standby" memory that a guest sees and potentially may use if it sets it online. Every guest that sets memory offline contributes to the hypervisor's standby memory pool, while onlining standby memory takes memory away from the standby pool. The use-case is that a system administrator in advance knows the maximum size a guest will ever have and also defines how much memory should be used at boot time. The difference is standby memory. Auto-onlining of standby memory is the last thing we want. > Unfortunately, we have additional complexity with memory zones > (ZONE_NORMAL, ZONE_MOVABLE) and in some cases manual intervention is > required. Especially, when further unplug is expected. This also is a reason why auto-onlining doesn't seem be the best way. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Heiko Carstens <heiko.carstens@de.ibm.com> writes: > On Mon, Feb 27, 2017 at 11:02:09AM +0100, Vitaly Kuznetsov wrote: >> A couple of other thoughts: >> 1) Having all newly added memory online ASAP is probably what people >> want for all virtual machines. Sorry, obviously missed 'x86' in the above statement. > > This is not true for s390. On s390 we have "standby" memory that a guest > sees and potentially may use if it sets it online. Every guest that sets > memory offline contributes to the hypervisor's standby memory pool, while > onlining standby memory takes memory away from the standby pool. > > The use-case is that a system administrator in advance knows the maximum > size a guest will ever have and also defines how much memory should be used > at boot time. The difference is standby memory. > > Auto-onlining of standby memory is the last thing we want. > This is actually a very good example of why we need the config option. s390 kernels will have it disabled.
On Mon 27-02-17 11:49:43, Vitaly Kuznetsov wrote: > Michal Hocko <mhocko@kernel.org> writes: > > > On Mon 27-02-17 11:02:09, Vitaly Kuznetsov wrote: > > [...] > >> I don't have anything new to add to the discussion happened last week > >> but I'd like to summarize my arguments against this change: > >> > >> 1) This patch doesn't solve any issue. Configuration option is not an > >> issue by itself, it is an option for distros to decide what they want to > >> ship: udev rule with known issues (legacy mode) or enable the new > >> option. Distro makers and users building their kernels should be able to > >> answer this simple question "do you want to automatically online all > >> newly added memory or not". > > > > OK, so could you be more specific? Distributions have no clue about > > which HW their kernel runs on so how can they possibly make a sensible > > decision here? > > They at least have an idea if they ship udev rule or not. I can also > imagine different choices for non-x86 architectures but I don't know > enough about them to have an opinion. I really do not follow. If they know whether they ship the udev rule then why do they need a kernel help at all? Anyway this global policy actually breaks some usecases. Say you would have a default set to online. What should user do if _some_ nodes should be online_movable? Or, say that HyperV or other hotplug based ballooning implementation really want to online the movable memory in order to have a realiable hotremove. Now you have a global policy which goes against it. > >> There are distros already which ship kernels > >> with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE enabled (Fedora 24 and 25 as > >> far as I remember, maybe someone else). > >> > >> 2) This patch creates an imbalance between Xen/Hyper-V on one side and > >> KVM/Vmware on another. KVM/Vmware use pure ACPI memory hotplug and this > >> memory won't get onlined. I don't understand how this problem is > >> supposed to be solved by distros. They'll *have to* continue shipping > >> a udev rule which has and always will have issues. > > > > They have notifications for udev to online that memory and AFAICU > > neither KVM nor VMware are using memory hotplut for ballooning - unlike > > HyperV and Xen. > > > > No, Hyper-V doesn't use memory hotplug for ballooning purposes. It is > just a memory hotplug. The fact that the code is located in hv_balloon > is just a coincidence. OK, I might be wrong here but 1cac8cd4d146 ("Drivers: hv: balloon: Implement hot-add functionality") suggests otherwise. > The difference with real hardware is how the operation is performed: > with real hardware you need to take a DIMM, go to your server room, open > the box, insert DIMM, go back to your seat. Asking to do some manual > action to actually enable memory is kinda OK. The beauty of hypervisors > is that everything happens automatically (e.g. when the VM is running > out of memory). I do not see your point. Either you have some (semi)automatic way to balance memory in guest based on the memory pressure (let's call it ballooning) or this is an administration operation (say you buy more DIMs or pay more to your virtualization provider) and then it is up to the guest owner to tell what to do about that memory. In other words you really do not want to wait in the first case as you are under memory pressure which is _actively_ managed or this is much more relaxed environment. > >> 3) Kernel command line is not a viable choice, it is rather a debug > >> method. > > > > Why? > > > > Because we usually have just a few things there (root=, console=) and > the rest is used when something goes wrong or for 'special' cases, not > for the majority of users. auto online or even memory hotplug seems something that requires a special HW/configuration already so I fail to see your point. It is normal to put kernel parameters to override the default. And AFAIU default offline is a sensible default for the standard memory hotplug. [...] > >> 2) Adding new memory can (in some extreme cases) still fail as we need > >> some *other* memory before we're able to online the newly added > >> block. This is an issue to be solved and it is doable (IMO) with some > >> pre-allocation. > > > > you cannot preallocate for all the possible memory that can be added. > > For all, no, but for 1 next block - yes, and then I'll preallocate for > the next one. You are still thinking in the scope of your particular use case and I believe the whole thing is shaped around that very same thing and that is why it should have been rejected in the first place. Especially when that use case can be handled without user visible configuration knob.
Michal Hocko <mhocko@kernel.org> writes: > On Mon 27-02-17 11:49:43, Vitaly Kuznetsov wrote: >> Michal Hocko <mhocko@kernel.org> writes: >> >> > On Mon 27-02-17 11:02:09, Vitaly Kuznetsov wrote: >> > [...] >> >> I don't have anything new to add to the discussion happened last week >> >> but I'd like to summarize my arguments against this change: >> >> >> >> 1) This patch doesn't solve any issue. Configuration option is not an >> >> issue by itself, it is an option for distros to decide what they want to >> >> ship: udev rule with known issues (legacy mode) or enable the new >> >> option. Distro makers and users building their kernels should be able to >> >> answer this simple question "do you want to automatically online all >> >> newly added memory or not". >> > >> > OK, so could you be more specific? Distributions have no clue about >> > which HW their kernel runs on so how can they possibly make a sensible >> > decision here? >> >> They at least have an idea if they ship udev rule or not. I can also >> imagine different choices for non-x86 architectures but I don't know >> enough about them to have an opinion. > > I really do not follow. If they know whether they ship the udev rule > then why do they need a kernel help at all? Anyway this global policy > actually breaks some usecases. Say you would have a default set to > online. What should user do if _some_ nodes should be online_movable? > Or, say that HyperV or other hotplug based ballooning implementation > really want to online the movable memory in order to have a realiable > hotremove. Now you have a global policy which goes against it. > While I think that hotremove is a special case which really requires manual intervention (at least to decide which memory goes NORMAL and which MOVABLE), MEMORY_HOTPLUG_DEFAULT_ONLINE is probably not for it. [snip] > >> The difference with real hardware is how the operation is performed: >> with real hardware you need to take a DIMM, go to your server room, open >> the box, insert DIMM, go back to your seat. Asking to do some manual >> action to actually enable memory is kinda OK. The beauty of hypervisors >> is that everything happens automatically (e.g. when the VM is running >> out of memory). > > I do not see your point. Either you have some (semi)automatic way to > balance memory in guest based on the memory pressure (let's call it > ballooning) or this is an administration operation (say you buy more > DIMs or pay more to your virtualization provider) and then it is up to > the guest owner to tell what to do about that memory. In other words you > really do not want to wait in the first case as you are under memory > pressure which is _actively_ managed or this is much more relaxed > environment. I don't see a contradiction between what I say and what you say here :-) Yes, there are case when we're not in a hurry and there are cases when we can't wait. > >> >> 3) Kernel command line is not a viable choice, it is rather a debug >> >> method. >> > >> > Why? >> > >> >> Because we usually have just a few things there (root=, console=) and >> the rest is used when something goes wrong or for 'special' cases, not >> for the majority of users. > > auto online or even memory hotplug seems something that requires > a special HW/configuration already so I fail to see your point. It is > normal to put kernel parameters to override the default. And AFAIU > default offline is a sensible default for the standard memory hotplug. > It depends how we define 'standard'. The point I'm trying to make is that it's really common for VMs to use this technique while in hardware (x86) world it is a rare occasion. The 'sensible default' may differ. > [...] > >> >> 2) Adding new memory can (in some extreme cases) still fail as we need >> >> some *other* memory before we're able to online the newly added >> >> block. This is an issue to be solved and it is doable (IMO) with some >> >> pre-allocation. >> > >> > you cannot preallocate for all the possible memory that can be added. >> >> For all, no, but for 1 next block - yes, and then I'll preallocate for >> the next one. > > You are still thinking in the scope of your particular use case and I > believe the whole thing is shaped around that very same thing and that > is why it should have been rejected in the first place. Especially when > that use case can be handled without user visible configuration knob. I think my use case is broad enough. At least it applies to all virtualization technoligies and not only to Hyper-V. But yes, I agree that adding a parameter to add_memory() solves my particular use case as well.
On Mon 27-02-17 12:25:10, Heiko Carstens wrote: > On Mon, Feb 27, 2017 at 11:02:09AM +0100, Vitaly Kuznetsov wrote: > > A couple of other thoughts: > > 1) Having all newly added memory online ASAP is probably what people > > want for all virtual machines. > > This is not true for s390. On s390 we have "standby" memory that a guest > sees and potentially may use if it sets it online. Every guest that sets > memory offline contributes to the hypervisor's standby memory pool, while > onlining standby memory takes memory away from the standby pool. > > The use-case is that a system administrator in advance knows the maximum > size a guest will ever have and also defines how much memory should be used > at boot time. The difference is standby memory. > > Auto-onlining of standby memory is the last thing we want. > > > Unfortunately, we have additional complexity with memory zones > > (ZONE_NORMAL, ZONE_MOVABLE) and in some cases manual intervention is > > required. Especially, when further unplug is expected. > > This also is a reason why auto-onlining doesn't seem be the best way. Can you imagine any situation when somebody actually might want to have this knob enabled? From what I understand it doesn't seem to be the case.
On Mon, Feb 27, 2017 at 10:28:17AM +0100, Michal Hocko wrote: >diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h >index 134a2f69c21a..a72f7f64ee26 100644 >--- a/include/linux/memory_hotplug.h >+++ b/include/linux/memory_hotplug.h >@@ -100,8 +100,6 @@ extern void __online_page_free(struct page *page); > > extern int try_online_node(int nid); > >-extern bool memhp_auto_online; >- > #ifdef CONFIG_MEMORY_HOTREMOVE > extern bool is_pageblock_removable_nolock(struct page *page); > extern int arch_remove_memory(u64 start, u64 size); >@@ -272,7 +270,7 @@ static inline void remove_memory(int nid, u64 start, u64 size) {} > > extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn, > void *arg, int (*func)(struct memory_block *, void *)); >-extern int add_memory(int nid, u64 start, u64 size); >+extern int add_memory(int nid, u64 start, u64 size, bool online); > extern int add_memory_resource(int nid, struct resource *resource, bool online); > extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default, > bool for_device); It would be nice if instead of a 'bool online' argument, add_memory() and add_memory_resource() took an 'int online_type', ala online_pages(). That way we could specify offline, online, online+movable, etc.
On Mon 27-02-17 11:28:52, Reza Arbab wrote: > On Mon, Feb 27, 2017 at 10:28:17AM +0100, Michal Hocko wrote: > >diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h > >index 134a2f69c21a..a72f7f64ee26 100644 > >--- a/include/linux/memory_hotplug.h > >+++ b/include/linux/memory_hotplug.h > >@@ -100,8 +100,6 @@ extern void __online_page_free(struct page *page); > > > >extern int try_online_node(int nid); > > > >-extern bool memhp_auto_online; > >- > >#ifdef CONFIG_MEMORY_HOTREMOVE > >extern bool is_pageblock_removable_nolock(struct page *page); > >extern int arch_remove_memory(u64 start, u64 size); > >@@ -272,7 +270,7 @@ static inline void remove_memory(int nid, u64 start, u64 size) {} > > > >extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn, > > void *arg, int (*func)(struct memory_block *, void *)); > >-extern int add_memory(int nid, u64 start, u64 size); > >+extern int add_memory(int nid, u64 start, u64 size, bool online); > >extern int add_memory_resource(int nid, struct resource *resource, bool online); > >extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default, > > bool for_device); > > It would be nice if instead of a 'bool online' argument, add_memory() and > add_memory_resource() took an 'int online_type', ala online_pages(). > > That way we could specify offline, online, online+movable, etc. Sure that would require more changes though and as such it is out of scope of this patch. But you are right, this is a logical follow up step.
On Mon, Feb 27, 2017 at 04:43:04PM +0100, Michal Hocko wrote: > On Mon 27-02-17 12:25:10, Heiko Carstens wrote: > > On Mon, Feb 27, 2017 at 11:02:09AM +0100, Vitaly Kuznetsov wrote: > > > A couple of other thoughts: > > > 1) Having all newly added memory online ASAP is probably what people > > > want for all virtual machines. > > > > This is not true for s390. On s390 we have "standby" memory that a guest > > sees and potentially may use if it sets it online. Every guest that sets > > memory offline contributes to the hypervisor's standby memory pool, while > > onlining standby memory takes memory away from the standby pool. > > > > The use-case is that a system administrator in advance knows the maximum > > size a guest will ever have and also defines how much memory should be used > > at boot time. The difference is standby memory. > > > > Auto-onlining of standby memory is the last thing we want. > > > > > Unfortunately, we have additional complexity with memory zones > > > (ZONE_NORMAL, ZONE_MOVABLE) and in some cases manual intervention is > > > required. Especially, when further unplug is expected. > > > > This also is a reason why auto-onlining doesn't seem be the best way. > > Can you imagine any situation when somebody actually might want to have > this knob enabled? From what I understand it doesn't seem to be the > case. I can only speak for s390, and at least here I think auto-online is always wrong, especially if you consider the added complexity that you may want to online memory sometimes to ZONE_NORMAL and sometimes to ZONE_MOVABLE. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon 27-02-17 16:43:04, Michal Hocko wrote: > On Mon 27-02-17 12:25:10, Heiko Carstens wrote: > > On Mon, Feb 27, 2017 at 11:02:09AM +0100, Vitaly Kuznetsov wrote: > > > A couple of other thoughts: > > > 1) Having all newly added memory online ASAP is probably what people > > > want for all virtual machines. > > > > This is not true for s390. On s390 we have "standby" memory that a guest > > sees and potentially may use if it sets it online. Every guest that sets > > memory offline contributes to the hypervisor's standby memory pool, while > > onlining standby memory takes memory away from the standby pool. > > > > The use-case is that a system administrator in advance knows the maximum > > size a guest will ever have and also defines how much memory should be used > > at boot time. The difference is standby memory. > > > > Auto-onlining of standby memory is the last thing we want. I don't know much about anything other than x86 so all comments below are from that point of view, archetectures that don't need auto online can keep current default > > > Unfortunately, we have additional complexity with memory zones > > > (ZONE_NORMAL, ZONE_MOVABLE) and in some cases manual intervention is > > > required. Especially, when further unplug is expected. > > > > This also is a reason why auto-onlining doesn't seem be the best way. When trying to support memory unplug on guest side in RHEL7, experience shows otherwise. Simplistic udev rule which onlines added block doesn't work in case one wants to online it as movable. Hotplugged blocks in current kernel should be onlined in reverse order to online blocks as movable depending on adjacent blocks zone. Which means simple udev rule isn't usable since it gets event from the first to the last hotplugged block order. So now we would have to write a daemon that would - watch for all blocks in hotplugged memory appear (how would it know) - online them in right order (order might also be different depending on kernel version) -- it becomes even more complicated in NUMA case when there are multiple zones and kernel would have to provide user-space with information about zone maps In short current experience shows that userspace approach - doesn't solve issues that Vitaly has been fixing (i.e. onlining fast and/or under memory pressure) when udev (or something else might be killed) - doesn't reduce overall system complexity, it only gets worse as user-space handler needs to know a lot about kernel internals and implementation details/kernel versions to work properly It's might be not easy but doing onlining in kernel on the other hand is: - faster - more reliable (can't be killed under memory pressure) - kernel has access to all info needed for onlining and how it internals work to implement auto-online correctly - since there is no need to mantain ABI for user-space (zones layout/ordering/maybe something else), kernel is free change internal implemetation without breaking userspace (currently hotplug+unplug doesn't work reliably and we might need something more flexible than zones) That's direction of research in progress, i.e. making kernel implementation better instead of putting responsibility on user-space to deal with kernel shortcomings. > Can you imagine any situation when somebody actually might want to have > this knob enabled? From what I understand it doesn't seem to be the > case. For x86: * this config option is enabled by default in recent Fedora, * RHEL6 ships similar downstream patches to do the same thing for years * RHEL7 has udev rule (because there wasn't kernel side solution at fork time) that auto-onlines it unconditionally, Vitaly might backport it later when he has time. Not linux kernel but still auto online policy is used by Windows both on baremetal and guest configurations. That's somewhat shows that current defaults upstream on x86 might be not what end-users wish for. When auto_online_blocks were introduced, Vitaly has been conservative and left current upstream defaults where they were lest it would break someone else setup but allowing downstreams set their own auto-online policy, eventually we might switch it upstream too. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu 02-03-17 14:53:48, Igor Mammedov wrote: [...] > When trying to support memory unplug on guest side in RHEL7, > experience shows otherwise. Simplistic udev rule which onlines > added block doesn't work in case one wants to online it as movable. > > Hotplugged blocks in current kernel should be onlined in reverse > order to online blocks as movable depending on adjacent blocks zone. Could you be more specific please? Setting online_movable from the udev rule should just work regardless of the ordering or the state of other memblocks. If that doesn't work I would call it a bug. > Which means simple udev rule isn't usable since it gets event from > the first to the last hotplugged block order. So now we would have > to write a daemon that would > - watch for all blocks in hotplugged memory appear (how would it know) > - online them in right order (order might also be different depending > on kernel version) > -- it becomes even more complicated in NUMA case when there are > multiple zones and kernel would have to provide user-space > with information about zone maps > > In short current experience shows that userspace approach > - doesn't solve issues that Vitaly has been fixing (i.e. onlining > fast and/or under memory pressure) when udev (or something else > might be killed) yeah and that is why the patch does the onlining from the kernel. > > Can you imagine any situation when somebody actually might want to have > > this knob enabled? From what I understand it doesn't seem to be the > > case. > For x86: > * this config option is enabled by default in recent Fedora, How do you want to support usecases which really want to online memory as movable? Do you expect those users to disable the option because unless I am missing something the in kernel auto onlining only supporst regular onlining.
On Thu, 2 Mar 2017 15:28:16 +0100 Michal Hocko <mhocko@kernel.org> wrote: > On Thu 02-03-17 14:53:48, Igor Mammedov wrote: > [...] > > When trying to support memory unplug on guest side in RHEL7, > > experience shows otherwise. Simplistic udev rule which onlines > > added block doesn't work in case one wants to online it as movable. > > > > Hotplugged blocks in current kernel should be onlined in reverse > > order to online blocks as movable depending on adjacent blocks zone. > > Could you be more specific please? Setting online_movable from the udev > rule should just work regardless of the ordering or the state of other > memblocks. If that doesn't work I would call it a bug. It's rather an implementation constrain than a bug for details and workaround patch see [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7 patch attached there is limited by another memory hotplug issue, which is NORMAL/MOVABLE zone balance, if kernel runs on configuration where the most of memory is hot-removable kernel might experience lack of memory in zone NORMAL. > > > Which means simple udev rule isn't usable since it gets event from > > the first to the last hotplugged block order. So now we would have > > to write a daemon that would > > - watch for all blocks in hotplugged memory appear (how would it know) > > - online them in right order (order might also be different depending > > on kernel version) > > -- it becomes even more complicated in NUMA case when there are > > multiple zones and kernel would have to provide user-space > > with information about zone maps > > > > In short current experience shows that userspace approach > > - doesn't solve issues that Vitaly has been fixing (i.e. onlining > > fast and/or under memory pressure) when udev (or something else > > might be killed) > > yeah and that is why the patch does the onlining from the kernel. onlining in this patch is limited to hyperv and patch breaks auto-online on x86 kvm/vmware/baremetal as they reuse the same hotplug path. > > > Can you imagine any situation when somebody actually might want to have > > > this knob enabled? From what I understand it doesn't seem to be the > > > case. > > For x86: > > * this config option is enabled by default in recent Fedora, > > How do you want to support usecases which really want to online memory > as movable? Do you expect those users to disable the option because > unless I am missing something the in kernel auto onlining only supporst > regular onlining. current auto onlining config option does what it's been designed for, i.e. it onlines hotplugged memory. It's possible for non average Fedora user to override default (commit 86dd995d6) if she/he needs non default behavior (i.e. user knows how to online manually and/or can write a daemon that would handle all of nuances of kernel in use). For the rest when Fedora is used in cloud and user increases memory via management interface of whatever cloud she/he uses, it just works. So it's choice of distribution to pick its own default that makes majority of user-base happy and this patch removes it without taking that in consideration. How to online memory is different issue not related to this patch, current default onlining as ZONE_NORMAL works well for scaling up VMs. Memory unplug is rather new and it doesn't work reliably so far, moving onlining to user-space won't really help. Further work is need to be done so that it would work reliably. Now about the question of onlining removable memory as movable, x86 kernel is able to get info, if hotadded memory is removable, from ACPI subsystem and online it as movable one without any intervention from user-space where it's hard to do so, as patch[1] shows. Problem is still researched and when we figure out how to fix hot-remove issues we might enable auto-onlining by default for x86. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu 02-03-17 18:03:15, Igor Mammedov wrote: > On Thu, 2 Mar 2017 15:28:16 +0100 > Michal Hocko <mhocko@kernel.org> wrote: > > > On Thu 02-03-17 14:53:48, Igor Mammedov wrote: > > [...] > > > When trying to support memory unplug on guest side in RHEL7, > > > experience shows otherwise. Simplistic udev rule which onlines > > > added block doesn't work in case one wants to online it as movable. > > > > > > Hotplugged blocks in current kernel should be onlined in reverse > > > order to online blocks as movable depending on adjacent blocks zone. > > > > Could you be more specific please? Setting online_movable from the udev > > rule should just work regardless of the ordering or the state of other > > memblocks. If that doesn't work I would call it a bug. > It's rather an implementation constrain than a bug > for details and workaround patch see > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7 "You are not authorized to access bug #1314306" could you paste the reasoning here please? > patch attached there is limited by another memory hotplug > issue, which is NORMAL/MOVABLE zone balance, if kernel runs > on configuration where the most of memory is hot-removable > kernel might experience lack of memory in zone NORMAL. yes and that is an inherent problem of movable memory. > > > Which means simple udev rule isn't usable since it gets event from > > > the first to the last hotplugged block order. So now we would have > > > to write a daemon that would > > > - watch for all blocks in hotplugged memory appear (how would it know) > > > - online them in right order (order might also be different depending > > > on kernel version) > > > -- it becomes even more complicated in NUMA case when there are > > > multiple zones and kernel would have to provide user-space > > > with information about zone maps > > > > > > In short current experience shows that userspace approach > > > - doesn't solve issues that Vitaly has been fixing (i.e. onlining > > > fast and/or under memory pressure) when udev (or something else > > > might be killed) > > > > yeah and that is why the patch does the onlining from the kernel. > onlining in this patch is limited to hyperv and patch breaks > auto-online on x86 kvm/vmware/baremetal as they reuse the same > hotplug path. Those can use the udev or do you see any reason why they couldn't? > > > > Can you imagine any situation when somebody actually might want to have > > > > this knob enabled? From what I understand it doesn't seem to be the > > > > case. > > > For x86: > > > * this config option is enabled by default in recent Fedora, > > > > How do you want to support usecases which really want to online memory > > as movable? Do you expect those users to disable the option because > > unless I am missing something the in kernel auto onlining only supporst > > regular onlining. > > current auto onlining config option does what it's been designed for, > i.e. it onlines hotplugged memory. > It's possible for non average Fedora user to override default > (commit 86dd995d6) if she/he needs non default behavior > (i.e. user knows how to online manually and/or can write > a daemon that would handle all of nuances of kernel in use). > > For the rest when Fedora is used in cloud and user increases memory > via management interface of whatever cloud she/he uses, it just works. > > So it's choice of distribution to pick its own default that makes > majority of user-base happy and this patch removes it without taking > that in consideration. You still can have a udev rule to achive the same thing for non-ballooning based hotplug. > How to online memory is different issue not related to this patch, > current default onlining as ZONE_NORMAL works well for scaling > up VMs. > > Memory unplug is rather new and it doesn't work reliably so far, > moving onlining to user-space won't really help. Further work > is need to be done so that it would work reliably. The main problem I have with this is that this is a limited usecase driven configuration knob which doesn't work properly for other usecases (namely movable online once your distribution choses to set the config option to auto online). There is a userspace solution for this so this shouldn't have been merged in the first place! It sneaked a proper review process (linux-api wasn't CC to get a broader attenttion) which is really sad. So unless this causes a major regression which would be hard to fix I will submit the patch for inclusion.
On Fri, 3 Mar 2017 09:27:23 +0100 Michal Hocko <mhocko@kernel.org> wrote: > On Thu 02-03-17 18:03:15, Igor Mammedov wrote: > > On Thu, 2 Mar 2017 15:28:16 +0100 > > Michal Hocko <mhocko@kernel.org> wrote: > > > > > On Thu 02-03-17 14:53:48, Igor Mammedov wrote: > > > [...] > > > > When trying to support memory unplug on guest side in RHEL7, > > > > experience shows otherwise. Simplistic udev rule which onlines > > > > added block doesn't work in case one wants to online it as movable. > > > > > > > > Hotplugged blocks in current kernel should be onlined in reverse > > > > order to online blocks as movable depending on adjacent blocks zone. > > > > > > Could you be more specific please? Setting online_movable from the udev > > > rule should just work regardless of the ordering or the state of other > > > memblocks. If that doesn't work I would call it a bug. > > It's rather an implementation constrain than a bug > > for details and workaround patch see > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7 > > "You are not authorized to access bug #1314306" Sorry, I've made it public, related comments and patch should be accessible now (code snippets in BZ are based on older kernel but logic is still the same upstream) > could you paste the reasoning here please? sure here is reproducer: start VM with CLI: qemu-system-x86_64 -enable-kvm -m size=1G,slots=2,maxmem=4G -numa node \ -object memory-backend-ram,id=m1,size=1G -device pc-dimm,node=0,memdev=m1 \ /path/to/guest_image then in guest dimm1 blocks are from 32-39 echo online_movable > /sys/devices/system/memory/memory32/state -bash: echo: write error: Invalid argument in current mainline kernel it triggers following code path: online_pages() ... if (online_type == MMOP_ONLINE_KERNEL) { if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift)) return -EINVAL; zone_can_shift() ... if (idx < target) { /* pages must be at end of current zone */ if (pfn + nr_pages != zone_end_pfn(zone)) return false; since we are trying to online as movable not the last section in ZONE_NORMAL. Here is what makes hotplugged memory end up in ZONE_NORMAL: acpi_memory_enable_device() -> add_memory -> add_memory_resource -> -> arch/x86/mm/init_64.c /* * Memory is added always to NORMAL zone. This means you will never get * additional DMA/DMA32 memory. */ int arch_add_memory(int nid, u64 start, u64 size, bool for_device) { ... struct zone *zone = pgdat->node_zones + zone_for_memory(nid, start, size, ZONE_NORMAL, for_device); i.e. all hot-plugged memory modules always go to ZONE_NORMAL and only the first/last block in zone is allowed to be moved to another zone. Patch [1] tries to fix issue by assigning removable memory resource to movable zone so hotplugged+removable blocks look like: movable normal, movable, movable instead of current: normal, normal, normal movable but then with this fixed as suggested, auto online by default should work just fine in kernel with normal and movable zones without any need for user-space. > > patch attached there is limited by another memory hotplug > > issue, which is NORMAL/MOVABLE zone balance, if kernel runs > > on configuration where the most of memory is hot-removable > > kernel might experience lack of memory in zone NORMAL. > > yes and that is an inherent problem of movable memory. > > > > > Which means simple udev rule isn't usable since it gets event from > > > > the first to the last hotplugged block order. So now we would have > > > > to write a daemon that would > > > > - watch for all blocks in hotplugged memory appear (how would it know) > > > > - online them in right order (order might also be different depending > > > > on kernel version) > > > > -- it becomes even more complicated in NUMA case when there are > > > > multiple zones and kernel would have to provide user-space > > > > with information about zone maps > > > > > > > > In short current experience shows that userspace approach > > > > - doesn't solve issues that Vitaly has been fixing (i.e. onlining > > > > fast and/or under memory pressure) when udev (or something else > > > > might be killed) > > > > > > yeah and that is why the patch does the onlining from the kernel. > > onlining in this patch is limited to hyperv and patch breaks > > auto-online on x86 kvm/vmware/baremetal as they reuse the same > > hotplug path. > > Those can use the udev or do you see any reason why they couldn't? Reasons are above, under >>>> and >> quotations, patch breaks what Vitaly's fixed (including kvm/vmware usecases) i.e. udev/some user-space process could be killed if hotplugged memory isn't onlined fast enough leading to service termination and/or memory not being onlined at all (if udev is killed) Currently udev rule is not usable and one needs a daemon which would correctly do onlining and keep zone balance even for simple case usecase of 1 normal and 1 movable zone. And it gets more complicated in case of multiple numa nodes with multiple zones. > > > > > Can you imagine any situation when somebody actually might want to have > > > > > this knob enabled? From what I understand it doesn't seem to be the > > > > > case. > > > > For x86: > > > > * this config option is enabled by default in recent Fedora, > > > > > > How do you want to support usecases which really want to online memory > > > as movable? Do you expect those users to disable the option because > > > unless I am missing something the in kernel auto onlining only supporst > > > regular onlining. > > > > current auto onlining config option does what it's been designed for, > > i.e. it onlines hotplugged memory. > > It's possible for non average Fedora user to override default > > (commit 86dd995d6) if she/he needs non default behavior > > (i.e. user knows how to online manually and/or can write > > a daemon that would handle all of nuances of kernel in use). > > > > For the rest when Fedora is used in cloud and user increases memory > > via management interface of whatever cloud she/he uses, it just works. > > > > So it's choice of distribution to pick its own default that makes > > majority of user-base happy and this patch removes it without taking > > that in consideration. > > You still can have a udev rule to achive the same thing for > non-ballooning based hotplug. not in case when system is under load, udev path might be slow and udev might be killed by OOM leading to permanent disablement of memory onlining. > > How to online memory is different issue not related to this patch, > > current default onlining as ZONE_NORMAL works well for scaling > > up VMs. > > > > Memory unplug is rather new and it doesn't work reliably so far, > > moving onlining to user-space won't really help. Further work > > is need to be done so that it would work reliably. > > The main problem I have with this is that this is a limited usecase > driven configuration knob which doesn't work properly for other usecases > (namely movable online once your distribution choses to set the config > option to auto online). it works for default usecase in Fedora and non-default movable can be used with 1) removable memory auto-online as movable in kernel, like patch [1] would make movable hotplugged memory (when I have time I'll try to work on it) 2) (or in worst case due to lack of alternative) explicitly disabled auto-online on kernel CLI + onlining daemon (since udev isn't working in current kernel due to ordering issue) > There is a userspace solution for this so this > shouldn't have been merged in the first place! Sorry, currently user-space udev solution doesn't work nor will it work reliably in extreme conditions. > It sneaked a proper review > process (linux-api wasn't CC to get a broader attenttion) which is > really sad. get_maintainer.pl doesn't lists linux-api for 31bc3858ea3e, MAINTAINERS should be fixed if linux-api were to be CCed. > So unless this causes a major regression which would be hard to fix I > will submit the patch for inclusion. it will be a major regression due to lack of daemon that could online fast and can't be killed on OOM. So this clean up patch does break used feature without providing a viable alternative. I wouldn't object to removing config option as in this patch if memory were onlined for x86 by default but that's not the case yet. [1] https://bugzilla.redhat.com/attachment.cgi?id=1146332 -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri 03-03-17 18:34:22, Igor Mammedov wrote: > On Fri, 3 Mar 2017 09:27:23 +0100 > Michal Hocko <mhocko@kernel.org> wrote: > > > On Thu 02-03-17 18:03:15, Igor Mammedov wrote: > > > On Thu, 2 Mar 2017 15:28:16 +0100 > > > Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > On Thu 02-03-17 14:53:48, Igor Mammedov wrote: > > > > [...] > > > > > When trying to support memory unplug on guest side in RHEL7, > > > > > experience shows otherwise. Simplistic udev rule which onlines > > > > > added block doesn't work in case one wants to online it as movable. > > > > > > > > > > Hotplugged blocks in current kernel should be onlined in reverse > > > > > order to online blocks as movable depending on adjacent blocks zone. > > > > > > > > Could you be more specific please? Setting online_movable from the udev > > > > rule should just work regardless of the ordering or the state of other > > > > memblocks. If that doesn't work I would call it a bug. > > > It's rather an implementation constrain than a bug > > > for details and workaround patch see > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7 > > > > "You are not authorized to access bug #1314306" > Sorry, > I've made it public, related comments and patch should be accessible now > (code snippets in BZ are based on older kernel but logic is still the same upstream) > > > could you paste the reasoning here please? > sure here is reproducer: > start VM with CLI: > qemu-system-x86_64 -enable-kvm -m size=1G,slots=2,maxmem=4G -numa node \ > -object memory-backend-ram,id=m1,size=1G -device pc-dimm,node=0,memdev=m1 \ > /path/to/guest_image > > then in guest dimm1 blocks are from 32-39 > > echo online_movable > /sys/devices/system/memory/memory32/state > -bash: echo: write error: Invalid argument > > in current mainline kernel it triggers following code path: > > online_pages() > ... > if (online_type == MMOP_ONLINE_KERNEL) { > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift)) > return -EINVAL; Are you sure? I would expect MMOP_ONLINE_MOVABLE here > zone_can_shift() > ... > if (idx < target) { > /* pages must be at end of current zone */ > if (pfn + nr_pages != zone_end_pfn(zone)) > return false; > > since we are trying to online as movable not the last section in > ZONE_NORMAL. > > Here is what makes hotplugged memory end up in ZONE_NORMAL: > acpi_memory_enable_device() -> add_memory -> add_memory_resource -> > -> arch/x86/mm/init_64.c > > /* > * Memory is added always to NORMAL zone. This means you will never get > * additional DMA/DMA32 memory. > */ > int arch_add_memory(int nid, u64 start, u64 size, bool for_device) > { > ... > struct zone *zone = pgdat->node_zones + > zone_for_memory(nid, start, size, ZONE_NORMAL, for_device); > > i.e. all hot-plugged memory modules always go to ZONE_NORMAL > and only the first/last block in zone is allowed to be moved > to another zone. Patch [1] tries to fix issue by assigning > removable memory resource to movable zone so hotplugged+removable > blocks look like: > movable normal, movable, movable > instead of current: > normal, normal, normal movable Hmm, this code is confusing and clean as mud. I have to stare there some more but AFAIK zones shouldn't have problems with holes so the only thing we have to guarantee is that different zones do not overlap. So this smells like a bug rather than the ineherent implementation limitation. [...] > > > > > Which means simple udev rule isn't usable since it gets event from > > > > > the first to the last hotplugged block order. So now we would have > > > > > to write a daemon that would > > > > > - watch for all blocks in hotplugged memory appear (how would it know) > > > > > - online them in right order (order might also be different depending > > > > > on kernel version) > > > > > -- it becomes even more complicated in NUMA case when there are > > > > > multiple zones and kernel would have to provide user-space > > > > > with information about zone maps > > > > > > > > > > In short current experience shows that userspace approach > > > > > - doesn't solve issues that Vitaly has been fixing (i.e. onlining > > > > > fast and/or under memory pressure) when udev (or something else > > > > > might be killed) > > > > > > > > yeah and that is why the patch does the onlining from the kernel. > > > onlining in this patch is limited to hyperv and patch breaks > > > auto-online on x86 kvm/vmware/baremetal as they reuse the same > > > hotplug path. > > > > Those can use the udev or do you see any reason why they couldn't? > > Reasons are above, under >>>> and >> quotations, patch breaks > what Vitaly's fixed (including kvm/vmware usecases) i.e. udev/some > user-space process could be killed if hotplugged memory isn't onlined > fast enough leading to service termination and/or memory not > being onlined at all (if udev is killed) OK, so from the discussion so far I have learned that this would be problem _only_ if we are trying to hotplug a _lot_ of memory at once (~1.5% of the online memory is needed). I am kind of skeptical this is a reasonable usecase. Who is going to hotadd 8G to 256M machine (which would eat half of the available memory which is still quite far from OOM)? Even if the memory balloning uses hotplug then such a grow sounds a bit excessive. > Currently udev rule is not usable and one needs a daemon > which would correctly do onlining and keep zone balance > even for simple case usecase of 1 normal and 1 movable zone. > And it gets more complicated in case of multiple numa nodes > with multiple zones. That sounds to be more related to the current implementation than anything else and as such is not a reason to invent specific user visible api. Btw. you are talking about movable zones byt the auto onlining doesn't allow to auto online movable memory. So either I miss your point or I am utterly confused. [...] > > > Memory unplug is rather new and it doesn't work reliably so far, > > > moving onlining to user-space won't really help. Further work > > > is need to be done so that it would work reliably. > > > > The main problem I have with this is that this is a limited usecase > > driven configuration knob which doesn't work properly for other usecases > > (namely movable online once your distribution choses to set the config > > option to auto online). > > it works for default usecase in Fedora and non-default > movable can be used with > 1) removable memory auto-online as movable in kernel, like > patch [1] would make movable hotplugged memory > (when I have time I'll try to work on it) > 2) (or in worst case due to lack of alternative) explicitly > disabled auto-online on kernel CLI + onlining daemon > (since udev isn't working in current kernel due to ordering issue) So I fail to see how this can work. Say the default will auto online all the added memory. This will be all in Zone Normal because the auto onlining doesn't do online_movable. Now I would like to online the memory as movable but that might fail because some kernel objects might be already sitting there so the offline would fail. > > There is a userspace solution for this so this > > shouldn't have been merged in the first place! > Sorry, currently user-space udev solution doesn't work nor > will it work reliably in extreme conditions. > > > It sneaked a proper review > > process (linux-api wasn't CC to get a broader attenttion) which is > > really sad. > > get_maintainer.pl doesn't lists linux-api for 31bc3858ea3e, > MAINTAINERS should be fixed if linux-api were to be CCed. user visible APIs _should_ be discussed at this mailing list regardless what get_maintainer.pl says. This is not about who is the maintainer but about getting as wide audience for things that would have to be maintained basically for ever. > > So unless this causes a major regression which would be hard to fix I > > will submit the patch for inclusion. > it will be a major regression due to lack of daemon that > could online fast and can't be killed on OOM. So this > clean up patch does break used feature without providing > a viable alternative. So let's discuss the current memory hotplug shortcomings and get rid of the crud which developed on top. I will start by splitting up the patch into 3 parts. Do the auto online thing from the HyperV and xen balloning drivers and dropping the config option and finally drop the sysfs knob. The last patch might be NAKed and I can live with that as long as the reasoning is proper and there is a general consensus on that.
On Mon, 6 Mar 2017 15:54:17 +0100 Michal Hocko <mhocko@kernel.org> wrote: > On Fri 03-03-17 18:34:22, Igor Mammedov wrote: > > On Fri, 3 Mar 2017 09:27:23 +0100 > > Michal Hocko <mhocko@kernel.org> wrote: > > > > > On Thu 02-03-17 18:03:15, Igor Mammedov wrote: > > > > On Thu, 2 Mar 2017 15:28:16 +0100 > > > > Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > > On Thu 02-03-17 14:53:48, Igor Mammedov wrote: [...] > > > > > memblocks. If that doesn't work I would call it a bug. > > > > It's rather an implementation constrain than a bug > > > > for details and workaround patch see > > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7 > > > > > > "You are not authorized to access bug #1314306" > > Sorry, > > I've made it public, related comments and patch should be accessible now > > (code snippets in BZ are based on older kernel but logic is still the same upstream) > > > > > could you paste the reasoning here please? > > sure here is reproducer: > > start VM with CLI: > > qemu-system-x86_64 -enable-kvm -m size=1G,slots=2,maxmem=4G -numa node \ > > -object memory-backend-ram,id=m1,size=1G -device pc-dimm,node=0,memdev=m1 \ > > /path/to/guest_image > > > > then in guest dimm1 blocks are from 32-39 > > > > echo online_movable > /sys/devices/system/memory/memory32/state > > -bash: echo: write error: Invalid argument > > > > in current mainline kernel it triggers following code path: > > > > online_pages() > > ... > > if (online_type == MMOP_ONLINE_KERNEL) { > > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift)) > > return -EINVAL; > > Are you sure? I would expect MMOP_ONLINE_MOVABLE here pretty much, reproducer is above so try and see for yourself [...] > [...] > > > > > > Which means simple udev rule isn't usable since it gets event from > > > > > > the first to the last hotplugged block order. So now we would have > > > > > > to write a daemon that would > > > > > > - watch for all blocks in hotplugged memory appear (how would it know) > > > > > > - online them in right order (order might also be different depending > > > > > > on kernel version) > > > > > > -- it becomes even more complicated in NUMA case when there are > > > > > > multiple zones and kernel would have to provide user-space > > > > > > with information about zone maps > > > > > > > > > > > > In short current experience shows that userspace approach > > > > > > - doesn't solve issues that Vitaly has been fixing (i.e. onlining > > > > > > fast and/or under memory pressure) when udev (or something else > > > > > > might be killed) > > > > > > > > > > yeah and that is why the patch does the onlining from the kernel. > > > > onlining in this patch is limited to hyperv and patch breaks > > > > auto-online on x86 kvm/vmware/baremetal as they reuse the same > > > > hotplug path. > > > > > > Those can use the udev or do you see any reason why they couldn't? > > > > Reasons are above, under >>>> and >> quotations, patch breaks > > what Vitaly's fixed (including kvm/vmware usecases) i.e. udev/some > > user-space process could be killed if hotplugged memory isn't onlined > > fast enough leading to service termination and/or memory not > > being onlined at all (if udev is killed) > > OK, so from the discussion so far I have learned that this would be > problem _only_ if we are trying to hotplug a _lot_ of memory at once > (~1.5% of the online memory is needed). I am kind of skeptical this is > a reasonable usecase. Who is going to hotadd 8G to 256M machine (which > would eat half of the available memory which is still quite far from > OOM)? Even if the memory balloning uses hotplug then such a grow sounds > a bit excessive. Slow and killable udev issue doesn't really depends on amount of hotplugged memory since it's onlined in blocks (128M for x64). Considering that it's currently onlined as zone normal, kernel doesn't have any issues adding more follow up blocks of memory. > > Currently udev rule is not usable and one needs a daemon > > which would correctly do onlining and keep zone balance > > even for simple case usecase of 1 normal and 1 movable zone. > > And it gets more complicated in case of multiple numa nodes > > with multiple zones. > > That sounds to be more related to the current implementation than > anything else and as such is not a reason to invent specific user > visible api. Btw. you are talking about movable zones byt the auto > onlining doesn't allow to auto online movable memory. So either I miss > your point or I am utterly confused. in current state neither does udev rule as memory is onlined as NORMAL (x64 variant at least), which is the same as auto online does now. We are discussing 2 different issues here and thread got pretty hard to follow. I'll try to sum up at results the end. > [...] > > > > Memory unplug is rather new and it doesn't work reliably so far, > > > > moving onlining to user-space won't really help. Further work > > > > is need to be done so that it would work reliably. > > > > > > The main problem I have with this is that this is a limited usecase > > > driven configuration knob which doesn't work properly for other usecases > > > (namely movable online once your distribution choses to set the config > > > option to auto online). > > > > it works for default usecase in Fedora and non-default > > movable can be used with > > 1) removable memory auto-online as movable in kernel, like > > patch [1] would make movable hotplugged memory > > (when I have time I'll try to work on it) > > 2) (or in worst case due to lack of alternative) explicitly > > disabled auto-online on kernel CLI + onlining daemon > > (since udev isn't working in current kernel due to ordering issue) > > So I fail to see how this can work. Say the default will auto online > all the added memory. This will be all in Zone Normal because the auto > onlining doesn't do online_movable. Now I would like to online the > memory as movable but that might fail because some kernel objects might > be already sitting there so the offline would fail. same happens with udev rule as currently it can online from the first to last hotplugged block only as Zone Normal. see arch/x86/mm/init_64.c: acpi_memory_enable_device() -> add_memory -> add_memory_resource both auto online and udev could be fixed by patch https://bugzilla.redhat.com/attachment.cgi?id=1146332 to online reomovable memory as movable. > > > There is a userspace solution for this so this > > > shouldn't have been merged in the first place! > > Sorry, currently user-space udev solution doesn't work nor > > will it work reliably in extreme conditions. > > > > > It sneaked a proper review > > > process (linux-api wasn't CC to get a broader attenttion) which is > > > really sad. > > > > get_maintainer.pl doesn't lists linux-api for 31bc3858ea3e, > > MAINTAINERS should be fixed if linux-api were to be CCed. > > user visible APIs _should_ be discussed at this mailing list regardless > what get_maintainer.pl says. This is not about who is the maintainer but > about getting as wide audience for things that would have to be > maintained basically for ever. How would random contributor know which list to CC? > > > So unless this causes a major regression which would be hard to fix I > > > will submit the patch for inclusion. > > it will be a major regression due to lack of daemon that > > could online fast and can't be killed on OOM. So this > > clean up patch does break used feature without providing > > a viable alternative. > > So let's discuss the current memory hotplug shortcomings and get rid of > the crud which developed on top. I will start by splitting up the patch > into 3 parts. Do the auto online thing from the HyperV and xen balloning > drivers and dropping the config option and finally drop the sysfs knob. > The last patch might be NAKed and I can live with that as long as the > reasoning is proper and there is a general consensus on that. PS: CC me on that patches too It's major regression if you remove auto online in kernels that run on top of x86 kvm/vmware hypervisors, making API cleanups while breaking useful functionality doesn't make sense. I would ACK config option removal if auto online keeps working for all x86 hypervisors (hyperv/xen isn't the only who needs it) and keep kernel CLI option to override default. That doesn't mean that others will agree with flipping default, that's why config option has been added. Now to sum up what's been discussed on this thread, there were 2 different issues discussed: 1) memory hotplug: remove in kernel auto online for all except of hyperv/xen - suggested RFC is not acceptable from virt point of view as it regresses guests on top of x86 kvm/vmware which both use ACPI based memory hotplug. - udev/userspace solution doesn't work in practice as it's too slow and unreliable when system is under load which is quite common in virt usecase. That's why auto online has been introduced in the first place. 2) memory unplug: online memory as movable - doesn't work currently with udev rule due to kernel issues https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7 - could be fixed both for in kernel auto online and udev with following patch: https://bugzilla.redhat.com/attachment.cgi?id=1146332 but fixing it this way exposes zone disbalance issues, which are not present in current kernel as blocks are onlined in Zone Normal. So this is area to work and improve on. - currently if one wants to use online_movable, one has to either * disable auto online in kernel OR * remove udev rule that distro ships AND write custom daemon that will be able to online block in right zone/order. So currently whole online_movable thing isn't working by default regardless of who onlines memory. I'm in favor of implementing that in kernel as it keeps kernel internals inside kernel and doesn't need kernel API to be involved (memory blocks in sysfs, online_kernel, online_movable) There would be no need in userspace which would have to deal with kernel zoo and maintain that as well. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue 07-03-17 13:40:04, Igor Mammedov wrote: > On Mon, 6 Mar 2017 15:54:17 +0100 > Michal Hocko <mhocko@kernel.org> wrote: > > > On Fri 03-03-17 18:34:22, Igor Mammedov wrote: [...] > > > in current mainline kernel it triggers following code path: > > > > > > online_pages() > > > ... > > > if (online_type == MMOP_ONLINE_KERNEL) { > > > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift)) > > > return -EINVAL; > > > > Are you sure? I would expect MMOP_ONLINE_MOVABLE here > pretty much, reproducer is above so try and see for yourself I will play with this... [...] > > > get_maintainer.pl doesn't lists linux-api for 31bc3858ea3e, > > > MAINTAINERS should be fixed if linux-api were to be CCed. > > > > user visible APIs _should_ be discussed at this mailing list regardless > > what get_maintainer.pl says. This is not about who is the maintainer but > > about getting as wide audience for things that would have to be > > maintained basically for ever. > > How would random contributor know which list to CC? This should have been brought up during the review process which was less than sufficient in this case. > > > > So unless this causes a major regression which would be hard to fix I > > > > will submit the patch for inclusion. > > > it will be a major regression due to lack of daemon that > > > could online fast and can't be killed on OOM. So this > > > clean up patch does break used feature without providing > > > a viable alternative. > > > > So let's discuss the current memory hotplug shortcomings and get rid of > > the crud which developed on top. I will start by splitting up the patch > > into 3 parts. Do the auto online thing from the HyperV and xen balloning > > drivers and dropping the config option and finally drop the sysfs knob. > > The last patch might be NAKed and I can live with that as long as the > > reasoning is proper and there is a general consensus on that. > PS: CC me on that patches too > > It's major regression if you remove auto online in kernels that > run on top of x86 kvm/vmware hypervisors, making API cleanups > while breaking useful functionality doesn't make sense. > > I would ACK config option removal if auto online keeps working > for all x86 hypervisors (hyperv/xen isn't the only who needs it) > and keep kernel CLI option to override default. > > That doesn't mean that others will agree with flipping default, > that's why config option has been added. > > Now to sum up what's been discussed on this thread, there were 2 > different issues discussed: > 1) memory hotplug: remove in kernel auto online for all > except of hyperv/xen > > - suggested RFC is not acceptable from virt point of view > as it regresses guests on top of x86 kvm/vmware which > both use ACPI based memory hotplug. > > - udev/userspace solution doesn't work in practice as it's > too slow and unreliable when system is under load which > is quite common in virt usecase. That's why auto online > has been introduced in the first place. Please try to be more specific why "too slow" is a problem. Also how much slower are we talking about? > 2) memory unplug: online memory as movable > > - doesn't work currently with udev rule due to kernel > issues https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7 These should be fixed > - could be fixed both for in kernel auto online and udev > with following patch: > https://bugzilla.redhat.com/attachment.cgi?id=1146332 > but fixing it this way exposes zone disbalance issues, > which are not present in current kernel as blocks are > onlined in Zone Normal. So this is area to work and > improve on. > > - currently if one wants to use online_movable, > one has to either > * disable auto online in kernel OR which might not just work because an unmovable allocation could have made the memblock pinned. > * remove udev rule that distro ships > AND write custom daemon that will be able to online > block in right zone/order. So currently whole > online_movable thing isn't working by default > regardless of who onlines memory. my epxperience with onlining full nodes as movable shows this works just fine (with all the limitations of the movable zones but that is a separate thing). I haven't played with configurations where movable zones are sharing the node with other zones. > I'm in favor of implementing that in kernel as it keeps > kernel internals inside kernel and doesn't need > kernel API to be involved (memory blocks in sysfs, > online_kernel, online_movable) > There would be no need in userspace which would have to > deal with kernel zoo and maintain that as well. The kernel is supposed to provide a proper API and that is sysfs currently. I am not entirely happy about it either but pulling a lot of code into the kernel is not the rigth thing to do. Especially when different usecases require different treatment.
Let's CC people touching this logic. A short summary is that onlining memory via udev is currently unusable for online_movable because blocks are added from lower addresses while movable blocks are allowed from last blocks. More below. On Thu 09-03-17 13:54:00, Michal Hocko wrote: > On Tue 07-03-17 13:40:04, Igor Mammedov wrote: > > On Mon, 6 Mar 2017 15:54:17 +0100 > > Michal Hocko <mhocko@kernel.org> wrote: > > > > > On Fri 03-03-17 18:34:22, Igor Mammedov wrote: > [...] > > > > in current mainline kernel it triggers following code path: > > > > > > > > online_pages() > > > > ... > > > > if (online_type == MMOP_ONLINE_KERNEL) { > > > > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift)) > > > > return -EINVAL; > > > > > > Are you sure? I would expect MMOP_ONLINE_MOVABLE here > > pretty much, reproducer is above so try and see for yourself > > I will play with this... OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G which generated [...] [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff] [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff] [ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff] [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug [ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff] [ 0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff] [ 0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff] [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff] [ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffdffff] [ 0.000000] Normal empty [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] [ 0.000000] node 0: [mem 0x0000000000100000-0x000000003fffffff] [ 0.000000] node 1: [mem 0x0000000040000000-0x000000007ffdffff] so there is neither any normal zone nor movable one at the boot time. Then I hotplugged 1G slot (qemu) object_add memory-backend-ram,id=mem1,size=1G (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 unfortunatelly the memory didn't show up automatically and I got [ 116.375781] acpi PNP0C80:00: Enumeration failure so I had to probe it manually (prbably the BIOS my qemu uses doesn't support auto probing - I haven't really dug further). Anyway the SRAT table printed during the boot told that we should start at 0x100000000 # echo 0x100000000 > /sys/devices/system/memory/probe # grep . /sys/devices/system/memory/memory32/valid_zones Normal Movable which looks reasonably right? Both Normal and Movable zones are allowed # echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe # grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal Movable Huh, so our valid_zones have changed under our feet... # echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe # grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal /sys/devices/system/memory/memory34/valid_zones:Normal Movable and again. So only the last memblock is considered movable. Let's try to online them now. # echo online_movable > /sys/devices/system/memory/memory34/state # grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Movable Normal This would explain why onlining from the last block actually works but to me this sounds like a completely crappy behavior. All we need to guarantee AFAICS is that Normal and Movable zones do not overlap. I believe there is even no real requirement about ordering of the physical memory in Normal vs. Movable zones as long as they do not overlap. But let's keep it simple for the start and always enforce the current status quo that Normal zone is physically preceeding Movable zone. Can somebody explain why we cannot have a simple rule for Normal vs. Movable which would be: - block [pfn, pfn+block_size] can be Normal if !zone_populated(MOVABLE) || pfn+block_size < ZONE_MOVABLE->zone_start_pfn - block [pfn, pfn+block_size] can be Movable if !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn I haven't fully grokked all the restrictions on the movable zone size based on the kernel parameters (find_zone_movable_pfns_for_nodes) but this shouldn't really make the situation really much more complicated I believe because those parameters should be mostly about static initialization rather than hotplug but I might be easily missing something.
On Fri 10-03-17 14:58:07, Michal Hocko wrote: [...] > This would explain why onlining from the last block actually works but > to me this sounds like a completely crappy behavior. All we need to > guarantee AFAICS is that Normal and Movable zones do not overlap. I > believe there is even no real requirement about ordering of the physical > memory in Normal vs. Movable zones as long as they do not overlap. But > let's keep it simple for the start and always enforce the current status > quo that Normal zone is physically preceeding Movable zone. > Can somebody explain why we cannot have a simple rule for Normal vs. > Movable which would be: > - block [pfn, pfn+block_size] can be Normal if > !zone_populated(MOVABLE) || pfn+block_size < ZONE_MOVABLE->zone_start_pfn > - block [pfn, pfn+block_size] can be Movable if > !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn OK, so while I was playing with this setup some more I probably got why this is done this way. All new memblocks are added to the zone Normal where they are accounted as spanned but not present. When we do online_movable we just cut from the end of the Normal zone and move it to Movable zone. This sounds really awkward. What was the reason to go this way? Why cannot we simply add those pages to the zone at the online time?
On 03/10/2017 08:58 AM, Michal Hocko wrote: > Let's CC people touching this logic. A short summary is that onlining > memory via udev is currently unusable for online_movable because blocks > are added from lower addresses while movable blocks are allowed from > last blocks. More below. > > On Thu 09-03-17 13:54:00, Michal Hocko wrote: >> On Tue 07-03-17 13:40:04, Igor Mammedov wrote: >>> On Mon, 6 Mar 2017 15:54:17 +0100 >>> Michal Hocko <mhocko@kernel.org> wrote: >>> >>>> On Fri 03-03-17 18:34:22, Igor Mammedov wrote: >> [...] >>>>> in current mainline kernel it triggers following code path: >>>>> >>>>> online_pages() >>>>> ... >>>>> if (online_type == MMOP_ONLINE_KERNEL) { >>>>> if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift)) >>>>> return -EINVAL; >>>> >>>> Are you sure? I would expect MMOP_ONLINE_MOVABLE here >>> pretty much, reproducer is above so try and see for yourself >> >> I will play with this... > > OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G which generated > [...] > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff] > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff] > [ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff] > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug > [ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff] > [ 0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff] > [ 0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff] > [ 0.000000] Zone ranges: > [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff] > [ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffdffff] > [ 0.000000] Normal empty > [ 0.000000] Movable zone start for each node > [ 0.000000] Early memory node ranges > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > [ 0.000000] node 0: [mem 0x0000000000100000-0x000000003fffffff] > [ 0.000000] node 1: [mem 0x0000000040000000-0x000000007ffdffff] > > so there is neither any normal zone nor movable one at the boot time. > Then I hotplugged 1G slot > (qemu) object_add memory-backend-ram,id=mem1,size=1G > (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 > > unfortunatelly the memory didn't show up automatically and I got > [ 116.375781] acpi PNP0C80:00: Enumeration failure > > so I had to probe it manually (prbably the BIOS my qemu uses doesn't > support auto probing - I haven't really dug further). Anyway the SRAT > table printed during the boot told that we should start at 0x100000000 > > # echo 0x100000000 > /sys/devices/system/memory/probe > # grep . /sys/devices/system/memory/memory32/valid_zones > Normal Movable > > which looks reasonably right? Both Normal and Movable zones are allowed > > # echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe > # grep . /sys/devices/system/memory/memory3?/valid_zones > /sys/devices/system/memory/memory32/valid_zones:Normal > /sys/devices/system/memory/memory33/valid_zones:Normal Movable > > Huh, so our valid_zones have changed under our feet... > > # echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe > # grep . /sys/devices/system/memory/memory3?/valid_zones > /sys/devices/system/memory/memory32/valid_zones:Normal > /sys/devices/system/memory/memory33/valid_zones:Normal > /sys/devices/system/memory/memory34/valid_zones:Normal Movable > > and again. So only the last memblock is considered movable. Let's try to > online them now. > > # echo online_movable > /sys/devices/system/memory/memory34/state > # grep . /sys/devices/system/memory/memory3?/valid_zones > /sys/devices/system/memory/memory32/valid_zones:Normal > /sys/devices/system/memory/memory33/valid_zones:Normal Movable > /sys/devices/system/memory/memory34/valid_zones:Movable Normal > I think there is no strong reason which kernel has the restriction. By setting the restrictions, it seems to have made management of these zone structs simple. Thanks, Yasuaki Ishimatsu > This would explain why onlining from the last block actually works but > to me this sounds like a completely crappy behavior. All we need to > guarantee AFAICS is that Normal and Movable zones do not overlap. I > believe there is even no real requirement about ordering of the physical > memory in Normal vs. Movable zones as long as they do not overlap. But > let's keep it simple for the start and always enforce the current status > quo that Normal zone is physically preceeding Movable zone. > Can somebody explain why we cannot have a simple rule for Normal vs. > Movable which would be: > - block [pfn, pfn+block_size] can be Normal if > !zone_populated(MOVABLE) || pfn+block_size < ZONE_MOVABLE->zone_start_pfn > - block [pfn, pfn+block_size] can be Movable if > !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn > > I haven't fully grokked all the restrictions on the movable zone size > based on the kernel parameters (find_zone_movable_pfns_for_nodes) but > this shouldn't really make the situation really much more complicated I > believe because those parameters should be mostly about static > initialization rather than hotplug but I might be easily missing > something. > -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote: >OK, so while I was playing with this setup some more I probably got why >this is done this way. All new memblocks are added to the zone Normal >where they are accounted as spanned but not present. It's not always zone Normal. See zone_for_memory(). This leads to a workaround for having to do online_movable in descending block order. Instead of this: 1. probe block 34, probe block 33, probe block 32, ... 2. online_movable 34, online_movable 33, online_movable 32, ... you can online_movable the first block before adding the rest: 1. probe block 32, online_movable 32 2. probe block 33, probe block 34, ... - zone_for_memory() will cause these to start Movable 3. online 33, online 34, ... - they're already in Movable, so online_movable is equivalentr I agree with your general sentiment that this stuff is very nonintuitive.
Hey, On Mon, Mar 06, 2017 at 03:54:17PM +0100, Michal Hocko wrote: [...] > So let's discuss the current memory hotplug shortcomings and get rid of > the crud which developed on top. I will start by splitting up the patch > into 3 parts. Do the auto online thing from the HyperV and xen balloning > drivers and dropping the config option and finally drop the sysfs knob. > The last patch might be NAKed and I can live with that as long as the > reasoning is proper and there is a general consensus on that. If it is not too late please CC me on the patches and relevant threads. Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri 10-03-17 12:39:27, Yasuaki Ishimatsu wrote: > On 03/10/2017 08:58 AM, Michal Hocko wrote: [...] > >OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G which generated > >[...] > >[ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff] > >[ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff] > >[ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff] > >[ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug > >[ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff] > >[ 0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff] > >[ 0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff] > >[ 0.000000] Zone ranges: > >[ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff] > >[ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffdffff] > >[ 0.000000] Normal empty > >[ 0.000000] Movable zone start for each node > >[ 0.000000] Early memory node ranges > >[ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > >[ 0.000000] node 0: [mem 0x0000000000100000-0x000000003fffffff] > >[ 0.000000] node 1: [mem 0x0000000040000000-0x000000007ffdffff] > > > >so there is neither any normal zone nor movable one at the boot time. > >Then I hotplugged 1G slot > >(qemu) object_add memory-backend-ram,id=mem1,size=1G > >(qemu) device_add pc-dimm,id=dimm1,memdev=mem1 > > > >unfortunatelly the memory didn't show up automatically and I got > >[ 116.375781] acpi PNP0C80:00: Enumeration failure > > > >so I had to probe it manually (prbably the BIOS my qemu uses doesn't > >support auto probing - I haven't really dug further). Anyway the SRAT > >table printed during the boot told that we should start at 0x100000000 > > > ># echo 0x100000000 > /sys/devices/system/memory/probe > ># grep . /sys/devices/system/memory/memory32/valid_zones > >Normal Movable > > > >which looks reasonably right? Both Normal and Movable zones are allowed > > > ># echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe > ># grep . /sys/devices/system/memory/memory3?/valid_zones > >/sys/devices/system/memory/memory32/valid_zones:Normal > >/sys/devices/system/memory/memory33/valid_zones:Normal Movable > > > >Huh, so our valid_zones have changed under our feet... > > > ># echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe > ># grep . /sys/devices/system/memory/memory3?/valid_zones > >/sys/devices/system/memory/memory32/valid_zones:Normal > >/sys/devices/system/memory/memory33/valid_zones:Normal > >/sys/devices/system/memory/memory34/valid_zones:Normal Movable > > > >and again. So only the last memblock is considered movable. Let's try to > >online them now. > > > ># echo online_movable > /sys/devices/system/memory/memory34/state > ># grep . /sys/devices/system/memory/memory3?/valid_zones > >/sys/devices/system/memory/memory32/valid_zones:Normal > >/sys/devices/system/memory/memory33/valid_zones:Normal Movable > >/sys/devices/system/memory/memory34/valid_zones:Movable Normal > > > > I think there is no strong reason which kernel has the restriction. > By setting the restrictions, it seems to have made management of > these zone structs simple. Could you be more specific please? How could this make management any easier when udev is basically racing with the physical hotplug and the result is basically undefined?
On Fri 10-03-17 13:00:37, Reza Arbab wrote: > On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote: > >OK, so while I was playing with this setup some more I probably got why > >this is done this way. All new memblocks are added to the zone Normal > >where they are accounted as spanned but not present. > > It's not always zone Normal. See zone_for_memory(). This leads to a > workaround for having to do online_movable in descending block order. > Instead of this: > > 1. probe block 34, probe block 33, probe block 32, ... > 2. online_movable 34, online_movable 33, online_movable 32, ... > > you can online_movable the first block before adding the rest: I do I enforce that behavior when the probe happens automagically? > 1. probe block 32, online_movable 32 > 2. probe block 33, probe block 34, ... > - zone_for_memory() will cause these to start Movable > 3. online 33, online 34, ... > - they're already in Movable, so online_movable is equivalentr > > I agree with your general sentiment that this stuff is very nonintuitive. My criterion for nonintuitive is probably different because I would call this _completely_unusable_. Sorry for being so loud about this but the more I look into this area the more WTF code I see. This has seen close to zero review and seems to be building up more single usecase code on top of previous. We need to change this, seriously!
On Fri, 10 Mar 2017 14:58:07 +0100 Michal Hocko <mhocko@kernel.org> wrote: > Let's CC people touching this logic. A short summary is that onlining > memory via udev is currently unusable for online_movable because blocks > are added from lower addresses while movable blocks are allowed from > last blocks. More below. > > On Thu 09-03-17 13:54:00, Michal Hocko wrote: > > On Tue 07-03-17 13:40:04, Igor Mammedov wrote: > > > On Mon, 6 Mar 2017 15:54:17 +0100 > > > Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > On Fri 03-03-17 18:34:22, Igor Mammedov wrote: > > [...] > > > > > in current mainline kernel it triggers following code path: > > > > > > > > > > online_pages() > > > > > ... > > > > > if (online_type == MMOP_ONLINE_KERNEL) { > > > > > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift)) > > > > > return -EINVAL; > > > > > > > > Are you sure? I would expect MMOP_ONLINE_MOVABLE here > > > pretty much, reproducer is above so try and see for yourself > > > > I will play with this... > > OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G which generated 'mem' here distributes boot memory specified by "-m 2G" and does not include memory specified by -device pc-dimm. > [...] > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff] > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff] > [ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff] > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug > [ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff] > [ 0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff] > [ 0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff] > [ 0.000000] Zone ranges: > [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff] > [ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffdffff] > [ 0.000000] Normal empty > [ 0.000000] Movable zone start for each node > [ 0.000000] Early memory node ranges > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > [ 0.000000] node 0: [mem 0x0000000000100000-0x000000003fffffff] > [ 0.000000] node 1: [mem 0x0000000040000000-0x000000007ffdffff] > > so there is neither any normal zone nor movable one at the boot time. it could be if hotpluggable memory were present at boot time in E802 table (if I remember right when running on hyperv there is movable zone at boot time), but in qemu hotpluggable memory isn't put into E820, so zone is allocated later when memory is enumerated by ACPI subsystem and onlined. It causes less issues wrt movable zone and works for different versions of linux/windows as well. That's where in kernel auto-onlining could be also useful, since user would be able to start-up with with small non removable memory plus several removable DIMMs and have all the memory onlined/available by the time initrd is loaded. (missing piece here is onling removable memory as movable by default). > Then I hotplugged 1G slot > (qemu) object_add memory-backend-ram,id=mem1,size=1G > (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 You can also specify node a pc-dimm goes to with 'node' property if it should go to other then node 0. device_add pc-dimm,id=dimm1,memdev=mem1,node=1 > unfortunatelly the memory didn't show up automatically and I got > [ 116.375781] acpi PNP0C80:00: Enumeration failure it should work, do you have CONFIG_ACPI_HOTPLUG_MEMORY enabled? > so I had to probe it manually (prbably the BIOS my qemu uses doesn't > support auto probing - I haven't really dug further). Anyway the SRAT > table printed during the boot told that we should start at 0x100000000 -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon 13-03-17 11:31:10, Igor Mammedov wrote: > On Fri, 10 Mar 2017 14:58:07 +0100 [...] > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff] > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff] > > [ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff] > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug > > [ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff] > > [ 0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff] > > [ 0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff] > > [ 0.000000] Zone ranges: > > [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff] > > [ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffdffff] > > [ 0.000000] Normal empty > > [ 0.000000] Movable zone start for each node > > [ 0.000000] Early memory node ranges > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > > [ 0.000000] node 0: [mem 0x0000000000100000-0x000000003fffffff] > > [ 0.000000] node 1: [mem 0x0000000040000000-0x000000007ffdffff] > > > > so there is neither any normal zone nor movable one at the boot time. > it could be if hotpluggable memory were present at boot time in E802 table > (if I remember right when running on hyperv there is movable zone at boot time), > > but in qemu hotpluggable memory isn't put into E820, > so zone is allocated later when memory is enumerated > by ACPI subsystem and onlined. > It causes less issues wrt movable zone and works for > different versions of linux/windows as well. > > That's where in kernel auto-onlining could be also useful, > since user would be able to start-up with with small > non removable memory plus several removable DIMMs > and have all the memory onlined/available by the time > initrd is loaded. (missing piece here is onling > removable memory as movable by default). Why we should even care to online that memory that early rather than making it available via e820? > > Then I hotplugged 1G slot > > (qemu) object_add memory-backend-ram,id=mem1,size=1G > > (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 > You can also specify node a pc-dimm goes to with 'node' property > if it should go to other then node 0. > > device_add pc-dimm,id=dimm1,memdev=mem1,node=1 thanks for the tip > > unfortunatelly the memory didn't show up automatically and I got > > [ 116.375781] acpi PNP0C80:00: Enumeration failure > it should work, > do you have CONFIG_ACPI_HOTPLUG_MEMORY enabled? No I didn't. Thanks, good to know!
On Thu, 9 Mar 2017 13:54:00 +0100 Michal Hocko <mhocko@kernel.org> wrote: [...] > > It's major regression if you remove auto online in kernels that > > run on top of x86 kvm/vmware hypervisors, making API cleanups > > while breaking useful functionality doesn't make sense. > > > > I would ACK config option removal if auto online keeps working > > for all x86 hypervisors (hyperv/xen isn't the only who needs it) > > and keep kernel CLI option to override default. > > > > That doesn't mean that others will agree with flipping default, > > that's why config option has been added. > > > > Now to sum up what's been discussed on this thread, there were 2 > > different issues discussed: > > 1) memory hotplug: remove in kernel auto online for all > > except of hyperv/xen > > > > - suggested RFC is not acceptable from virt point of view > > as it regresses guests on top of x86 kvm/vmware which > > both use ACPI based memory hotplug. > > > > - udev/userspace solution doesn't work in practice as it's > > too slow and unreliable when system is under load which > > is quite common in virt usecase. That's why auto online > > has been introduced in the first place. > > Please try to be more specific why "too slow" is a problem. Also how > much slower are we talking about? In virt case on host with lots VMs, userspace handler processing could be scheduled late enough to trigger a race between (guest memory going away/OOM handler) and memory coming online. > > > 2) memory unplug: online memory as movable > > > > - doesn't work currently with udev rule due to kernel > > issues https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7 > > These should be fixed > > > - could be fixed both for in kernel auto online and udev > > with following patch: > > https://bugzilla.redhat.com/attachment.cgi?id=1146332 > > but fixing it this way exposes zone disbalance issues, > > which are not present in current kernel as blocks are > > onlined in Zone Normal. So this is area to work and > > improve on. > > > > - currently if one wants to use online_movable, > > one has to either > > * disable auto online in kernel OR > > which might not just work because an unmovable allocation could have > made the memblock pinned. With memhp_default_state=offline on kernel CLI there won't be any unmovable allocation as hotplugged memory won't be onlined and user can online it manually. So it works for non default usecase of playing with memory hot-unplug. > > * remove udev rule that distro ships > > AND write custom daemon that will be able to online > > block in right zone/order. So currently whole > > online_movable thing isn't working by default > > regardless of who onlines memory. > > my epxperience with onlining full nodes as movable shows this works just > fine (with all the limitations of the movable zones but that is a > separate thing). I haven't played with configurations where movable > zones are sharing the node with other zones. I don't have access to a such baremetal configuration to play with anymore. > > I'm in favor of implementing that in kernel as it keeps > > kernel internals inside kernel and doesn't need > > kernel API to be involved (memory blocks in sysfs, > > online_kernel, online_movable) > > There would be no need in userspace which would have to > > deal with kernel zoo and maintain that as well. > > The kernel is supposed to provide a proper API and that is sysfs > currently. I am not entirely happy about it either but pulling a lot of > code into the kernel is not the rigth thing to do. Especially when > different usecases require different treatment. If it could be done from kernel side alone, it looks like a better way to me not to involve userspace at all. And for ACPI based x86/ARM it's possible to implement without adding a lot of kernel code. That's one more of a reason to keep CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE so we could continue on improving kernel only auto-onlining and fixing current memory hot(un)plug issues without affecting other platforms/users that are no interested in it. (PS: I don't care much about sysfs knob for setting auto-onlining, as kernel CLI override with memhp_default_state seems sufficient to me) -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon 13-03-17 11:55:54, Igor Mammedov wrote: > On Thu, 9 Mar 2017 13:54:00 +0100 > Michal Hocko <mhocko@kernel.org> wrote: > > [...] > > > It's major regression if you remove auto online in kernels that > > > run on top of x86 kvm/vmware hypervisors, making API cleanups > > > while breaking useful functionality doesn't make sense. > > > > > > I would ACK config option removal if auto online keeps working > > > for all x86 hypervisors (hyperv/xen isn't the only who needs it) > > > and keep kernel CLI option to override default. > > > > > > That doesn't mean that others will agree with flipping default, > > > that's why config option has been added. > > > > > > Now to sum up what's been discussed on this thread, there were 2 > > > different issues discussed: > > > 1) memory hotplug: remove in kernel auto online for all > > > except of hyperv/xen > > > > > > - suggested RFC is not acceptable from virt point of view > > > as it regresses guests on top of x86 kvm/vmware which > > > both use ACPI based memory hotplug. > > > > > > - udev/userspace solution doesn't work in practice as it's > > > too slow and unreliable when system is under load which > > > is quite common in virt usecase. That's why auto online > > > has been introduced in the first place. > > > > Please try to be more specific why "too slow" is a problem. Also how > > much slower are we talking about? > > In virt case on host with lots VMs, userspace handler > processing could be scheduled late enough to trigger a race > between (guest memory going away/OOM handler) and memory > coming online. Either you are mixing two things together or this doesn't really make much sense. So is this a balloning based on memory hotplug (aka active memory hotadd initiated between guest and host automatically) or a guest asking for additional memory by other means (pay more for memory etc.)? Because if this is an administrative operation then I seriously question this reasoning. [...] > > > - currently if one wants to use online_movable, > > > one has to either > > > * disable auto online in kernel OR > > > > which might not just work because an unmovable allocation could have > > made the memblock pinned. > > With memhp_default_state=offline on kernel CLI there won't be any > unmovable allocation as hotplugged memory won't be onlined and > user can online it manually. So it works for non default usecase > of playing with memory hot-unplug. I was talking about the case when the auto_online was true, of course, e.g. depending on the config option which you've said is enabled in Fedora kernels. [...] > > > I'm in favor of implementing that in kernel as it keeps > > > kernel internals inside kernel and doesn't need > > > kernel API to be involved (memory blocks in sysfs, > > > online_kernel, online_movable) > > > There would be no need in userspace which would have to > > > deal with kernel zoo and maintain that as well. > > > > The kernel is supposed to provide a proper API and that is sysfs > > currently. I am not entirely happy about it either but pulling a lot of > > code into the kernel is not the rigth thing to do. Especially when > > different usecases require different treatment. > > If it could be done from kernel side alone, it looks like a better way > to me not to involve userspace at all. And for ACPI based x86/ARM it's > possible to implement without adding a lot of kernel code. But this is not how we do the kernel development. We provide the API so that userspace can implement the appropriate policy on top. We do not add random knobs to implement the same thing in the kernel. Different users might want to implement different onlining strategies and that is hardly describable by a single global knob. Just look at the s390 example provided earlier. Please try to think out of your usecase scope. > That's one more of a reason to keep CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE > so we could continue on improving kernel only auto-onlining > and fixing current memory hot(un)plug issues without affecting > other platforms/users that are no interested in it. I really do not see any reason to keep the config option. Setting up this to enabled is _wrong_ thing to do in general purpose (distribution) kernel and a kernel for the specific usecase can achieve the same thing via boot command line. > (PS: I don't care much about sysfs knob for setting auto-onlining, > as kernel CLI override with memhp_default_state seems > sufficient to me) That is good to hear! I would be OK with keeping the kernel command line option until we resolve all the current issues with the hotplug.
Michal Hocko <mhocko@kernel.org> writes: > On Mon 13-03-17 11:55:54, Igor Mammedov wrote: >> > > >> > > - suggested RFC is not acceptable from virt point of view >> > > as it regresses guests on top of x86 kvm/vmware which >> > > both use ACPI based memory hotplug. >> > > >> > > - udev/userspace solution doesn't work in practice as it's >> > > too slow and unreliable when system is under load which >> > > is quite common in virt usecase. That's why auto online >> > > has been introduced in the first place. >> > >> > Please try to be more specific why "too slow" is a problem. Also how >> > much slower are we talking about? >> >> In virt case on host with lots VMs, userspace handler >> processing could be scheduled late enough to trigger a race >> between (guest memory going away/OOM handler) and memory >> coming online. > > Either you are mixing two things together or this doesn't really make > much sense. So is this a balloning based on memory hotplug (aka active > memory hotadd initiated between guest and host automatically) or a guest > asking for additional memory by other means (pay more for memory etc.)? > Because if this is an administrative operation then I seriously question > this reasoning. I'm probably repeating myself but it seems this point was lost: This is not really a 'ballooning', it is just a pure memory hotplug. People may have any tools monitoring their VM memory usage and when a VM is running low on memory they may want to hotplug more memory to it. With udev-style memory onlining they should be aware of page tables and other in-kernel structures which require allocation so they need to add memory slowly and gradually or they risk running into OOM (at least getting some processes killed and these processes may be important). With in-kernel memory hotplug everything happens synchronously and no 'slowly and gradually' algorithm is required in all tools which may trigger memory hotplug. It's not about slowness, it's about being synchronous or asynchronous. This is not related to the virtualization technology used, the use-case is the same for all of them which support memory hotplug.
On Mon 13-03-17 13:54:59, Vitaly Kuznetsov wrote: > Michal Hocko <mhocko@kernel.org> writes: > > > On Mon 13-03-17 11:55:54, Igor Mammedov wrote: > >> > > > >> > > - suggested RFC is not acceptable from virt point of view > >> > > as it regresses guests on top of x86 kvm/vmware which > >> > > both use ACPI based memory hotplug. > >> > > > >> > > - udev/userspace solution doesn't work in practice as it's > >> > > too slow and unreliable when system is under load which > >> > > is quite common in virt usecase. That's why auto online > >> > > has been introduced in the first place. > >> > > >> > Please try to be more specific why "too slow" is a problem. Also how > >> > much slower are we talking about? > >> > >> In virt case on host with lots VMs, userspace handler > >> processing could be scheduled late enough to trigger a race > >> between (guest memory going away/OOM handler) and memory > >> coming online. > > > > Either you are mixing two things together or this doesn't really make > > much sense. So is this a balloning based on memory hotplug (aka active > > memory hotadd initiated between guest and host automatically) or a guest > > asking for additional memory by other means (pay more for memory etc.)? > > Because if this is an administrative operation then I seriously question > > this reasoning. > > I'm probably repeating myself but it seems this point was lost: > > This is not really a 'ballooning', it is just a pure memory > hotplug. People may have any tools monitoring their VM memory usage and > when a VM is running low on memory they may want to hotplug more memory > to it. What is the API those guests ask for the memory? And who is actually responsible to ask for that memory? Is it a kernel or userspace solution? > With udev-style memory onlining they should be aware of page > tables and other in-kernel structures which require allocation so they > need to add memory slowly and gradually or they risk running into OOM > (at least getting some processes killed and these processes may be > important). With in-kernel memory hotplug everything happens > synchronously and no 'slowly and gradually' algorithm is required in > all tools which may trigger memory hotplug. What prevents those APIs being used reasonably and only asks so much memory as they can afford? I mean 1.5% available memory necessary for the hotplug is not all that much. Or more precisely what prevents to ask for this additional memory in a synchronous way? And just to prevent from a further solution, I can see why _in-kernel_ hotplug based 'ballooning (or whatever you call an on demand memory hotplug based on the memory pressure)' want's to be synchronous and that is why my patch changed those onlined the memory explicitly. I am questioning memory hotplug requested by admin/user space component to need any special from kernel assistance becuase it is only a shortcut which can be implemented from the userspace. I hope I've made myself clear finally.
Michal Hocko <mhocko@kernel.org> writes: > On Mon 13-03-17 13:54:59, Vitaly Kuznetsov wrote: >> Michal Hocko <mhocko@kernel.org> writes: >> >> > On Mon 13-03-17 11:55:54, Igor Mammedov wrote: >> >> > > >> >> > > - suggested RFC is not acceptable from virt point of view >> >> > > as it regresses guests on top of x86 kvm/vmware which >> >> > > both use ACPI based memory hotplug. >> >> > > >> >> > > - udev/userspace solution doesn't work in practice as it's >> >> > > too slow and unreliable when system is under load which >> >> > > is quite common in virt usecase. That's why auto online >> >> > > has been introduced in the first place. >> >> > >> >> > Please try to be more specific why "too slow" is a problem. Also how >> >> > much slower are we talking about? >> >> >> >> In virt case on host with lots VMs, userspace handler >> >> processing could be scheduled late enough to trigger a race >> >> between (guest memory going away/OOM handler) and memory >> >> coming online. >> > >> > Either you are mixing two things together or this doesn't really make >> > much sense. So is this a balloning based on memory hotplug (aka active >> > memory hotadd initiated between guest and host automatically) or a guest >> > asking for additional memory by other means (pay more for memory etc.)? >> > Because if this is an administrative operation then I seriously question >> > this reasoning. >> >> I'm probably repeating myself but it seems this point was lost: >> >> This is not really a 'ballooning', it is just a pure memory >> hotplug. People may have any tools monitoring their VM memory usage and >> when a VM is running low on memory they may want to hotplug more memory >> to it. > > What is the API those guests ask for the memory? And who is actually > responsible to ask for that memory? Is it a kernel or userspace > solution? Whatever, this can even be a system administrator running 'free'. Hyper-V driver sends si_mem_available() and vm_memory_committed() metrics to the host every second and this can be later queried by any tool (e.g. powershell script). > >> With udev-style memory onlining they should be aware of page >> tables and other in-kernel structures which require allocation so they >> need to add memory slowly and gradually or they risk running into OOM >> (at least getting some processes killed and these processes may be >> important). With in-kernel memory hotplug everything happens >> synchronously and no 'slowly and gradually' algorithm is required in >> all tools which may trigger memory hotplug. > > What prevents those APIs being used reasonably and only asks so much > memory as they can afford? I mean 1.5% available memory necessary for > the hotplug is not all that much. Or more precisely what prevents to ask > for this additional memory in a synchronous way? The knowledge about the fact that we need to add memory slowly and wait till it gets onlined is not obvious. AFAIR when you hotplug memory to Windows VMs there is no such thing as 'onlining', and no brain is required, a simple script 'low memory -> add mory memory' always works. Asking all these script writers to think twice before issuing a memory add command memory sounds like too much (to me).
On Mon, 13 Mar 2017 11:43:02 +0100 Michal Hocko <mhocko@kernel.org> wrote: > On Mon 13-03-17 11:31:10, Igor Mammedov wrote: > > On Fri, 10 Mar 2017 14:58:07 +0100 > [...] > > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff] > > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff] > > > [ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff] > > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug > > > [ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff] > > > [ 0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff] > > > [ 0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff] > > > [ 0.000000] Zone ranges: > > > [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff] > > > [ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffdffff] > > > [ 0.000000] Normal empty > > > [ 0.000000] Movable zone start for each node > > > [ 0.000000] Early memory node ranges > > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > > > [ 0.000000] node 0: [mem 0x0000000000100000-0x000000003fffffff] > > > [ 0.000000] node 1: [mem 0x0000000040000000-0x000000007ffdffff] > > > > > > so there is neither any normal zone nor movable one at the boot time. > > it could be if hotpluggable memory were present at boot time in E802 table > > (if I remember right when running on hyperv there is movable zone at boot time), > > > > but in qemu hotpluggable memory isn't put into E820, > > so zone is allocated later when memory is enumerated > > by ACPI subsystem and onlined. > > It causes less issues wrt movable zone and works for > > different versions of linux/windows as well. > > > > That's where in kernel auto-onlining could be also useful, > > since user would be able to start-up with with small > > non removable memory plus several removable DIMMs > > and have all the memory onlined/available by the time > > initrd is loaded. (missing piece here is onling > > removable memory as movable by default). > > Why we should even care to online that memory that early rather than > making it available via e820? It's not forbidden by spec and has less complications when it comes to removable memory. Declaring it in E820 would add following limitations/drawbacks: - firmware should be able to exclude removable memory from its usage (currently SeaBIOS nor EFI have to know/care about it) => less qemu-guest ABI to maintain. - OS should be taught to avoid/move (early) nonmovable allocations from removable address ranges. There were patches targeting that in recent kernels, but it won't work with older kernels that don't have it. So limiting a range of OSes that could run on QEMU and do memory removal. E820 less approach works reasonably well with wide range of guest OSes and less complex that if removable memory were present it E820. Hence I don't have a compelling reason to introduce removable memory in E820 as it only adds to hot(un)plug issues. I have an off-tree QEMU hack that puts hotremovable memory added with "-device pc-dimm" on CLI into E820 to experiment with. It could be useful to play with zone layouts at boot time, so if you are interested I can rebase it on top of current master and post it here to play with. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon 13-03-17 14:42:37, Vitaly Kuznetsov wrote: > Michal Hocko <mhocko@kernel.org> writes: > > > On Mon 13-03-17 13:54:59, Vitaly Kuznetsov wrote: > >> Michal Hocko <mhocko@kernel.org> writes: > >> > >> > On Mon 13-03-17 11:55:54, Igor Mammedov wrote: > >> >> > > > >> >> > > - suggested RFC is not acceptable from virt point of view > >> >> > > as it regresses guests on top of x86 kvm/vmware which > >> >> > > both use ACPI based memory hotplug. > >> >> > > > >> >> > > - udev/userspace solution doesn't work in practice as it's > >> >> > > too slow and unreliable when system is under load which > >> >> > > is quite common in virt usecase. That's why auto online > >> >> > > has been introduced in the first place. > >> >> > > >> >> > Please try to be more specific why "too slow" is a problem. Also how > >> >> > much slower are we talking about? > >> >> > >> >> In virt case on host with lots VMs, userspace handler > >> >> processing could be scheduled late enough to trigger a race > >> >> between (guest memory going away/OOM handler) and memory > >> >> coming online. > >> > > >> > Either you are mixing two things together or this doesn't really make > >> > much sense. So is this a balloning based on memory hotplug (aka active > >> > memory hotadd initiated between guest and host automatically) or a guest > >> > asking for additional memory by other means (pay more for memory etc.)? > >> > Because if this is an administrative operation then I seriously question > >> > this reasoning. > >> > >> I'm probably repeating myself but it seems this point was lost: > >> > >> This is not really a 'ballooning', it is just a pure memory > >> hotplug. People may have any tools monitoring their VM memory usage and > >> when a VM is running low on memory they may want to hotplug more memory > >> to it. > > > > What is the API those guests ask for the memory? And who is actually > > responsible to ask for that memory? Is it a kernel or userspace > > solution? > > Whatever, this can even be a system administrator running > 'free'. I am pretty sure that 'free' will not give you additional memory but let's be serious here... If this is solely about monitoring from userspace and requesting more memory from the userspace then I would consider arguing about timely hotplug operation as void because there is absolutely no guarantee to do the request itself in a timely fashion. > Hyper-V driver sends si_mem_available() and > vm_memory_committed() metrics to the host every second and this can be > later queried by any tool (e.g. powershell script). And how exactly is this related to the acpi hotplug which you were arguing needs the timely handling as well? > >> With udev-style memory onlining they should be aware of page > >> tables and other in-kernel structures which require allocation so they > >> need to add memory slowly and gradually or they risk running into OOM > >> (at least getting some processes killed and these processes may be > >> important). With in-kernel memory hotplug everything happens > >> synchronously and no 'slowly and gradually' algorithm is required in > >> all tools which may trigger memory hotplug. > > > > What prevents those APIs being used reasonably and only asks so much > > memory as they can afford? I mean 1.5% available memory necessary for > > the hotplug is not all that much. Or more precisely what prevents to ask > > for this additional memory in a synchronous way? > > The knowledge about the fact that we need to add memory slowly and > wait till it gets onlined is not obvious. yes it is and we cannot afford to give a better experience with the implementation that requires to have memory to online a memory. > AFAIR when you hotplug memory > to Windows VMs there is no such thing as 'onlining', and no brain is > required, a simple script 'low memory -> add mory memory' always > works. Asking all these script writers to think twice before issuing a > memory add command memory sounds like too much (to me). Pardon me, but not requiring a brain while doing something on Windows VMs is not really an argument...
On Mon 13-03-17 14:57:12, Igor Mammedov wrote: > On Mon, 13 Mar 2017 11:43:02 +0100 > Michal Hocko <mhocko@kernel.org> wrote: > > > On Mon 13-03-17 11:31:10, Igor Mammedov wrote: > > > On Fri, 10 Mar 2017 14:58:07 +0100 > > [...] > > > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff] > > > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff] > > > > [ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff] > > > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug > > > > [ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff] > > > > [ 0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff] > > > > [ 0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff] > > > > [ 0.000000] Zone ranges: > > > > [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff] > > > > [ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffdffff] > > > > [ 0.000000] Normal empty > > > > [ 0.000000] Movable zone start for each node > > > > [ 0.000000] Early memory node ranges > > > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > > > > [ 0.000000] node 0: [mem 0x0000000000100000-0x000000003fffffff] > > > > [ 0.000000] node 1: [mem 0x0000000040000000-0x000000007ffdffff] > > > > > > > > so there is neither any normal zone nor movable one at the boot time. > > > it could be if hotpluggable memory were present at boot time in E802 table > > > (if I remember right when running on hyperv there is movable zone at boot time), > > > > > > but in qemu hotpluggable memory isn't put into E820, > > > so zone is allocated later when memory is enumerated > > > by ACPI subsystem and onlined. > > > It causes less issues wrt movable zone and works for > > > different versions of linux/windows as well. > > > > > > That's where in kernel auto-onlining could be also useful, > > > since user would be able to start-up with with small > > > non removable memory plus several removable DIMMs > > > and have all the memory onlined/available by the time > > > initrd is loaded. (missing piece here is onling > > > removable memory as movable by default). > > > > Why we should even care to online that memory that early rather than > > making it available via e820? > > It's not forbidden by spec and has less complications > when it comes to removable memory. Declaring it in E820 > would add following limitations/drawbacks: > - firmware should be able to exclude removable memory > from its usage (currently SeaBIOS nor EFI have to > know/care about it) => less qemu-guest ABI to maintain. > - OS should be taught to avoid/move (early) nonmovable > allocations from removable address ranges. > There were patches targeting that in recent kernels, > but it won't work with older kernels that don't have it. > So limiting a range of OSes that could run on QEMU > and do memory removal. > > E820 less approach works reasonably well with wide range > of guest OSes and less complex that if removable memory > were present it E820. Hence I don't have a compelling > reason to introduce removable memory in E820 as it > only adds to hot(un)plug issues. OK I see and that sounds like an argument to not put those ranges to E820. I still fail to see why we haeve to online the memory early during the boot and cannot wait for userspace to run?
On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote: >> I agree with your general sentiment that this stuff is very >> nonintuitive. > >My criterion for nonintuitive is probably different because I would >call this _completely_unusable_. Sorry for being so loud about this but >the more I look into this area the more WTF code I see. This has seen >close to zero review and seems to be building up more single usecase >code on top of previous. We need to change this, seriously! No argument here. I'm happy to help however I can.
Michal Hocko <mhocko@kernel.org> writes: > On Mon 13-03-17 14:42:37, Vitaly Kuznetsov wrote: >> > >> > What is the API those guests ask for the memory? And who is actually >> > responsible to ask for that memory? Is it a kernel or userspace >> > solution? >> >> Whatever, this can even be a system administrator running >> 'free'. > > I am pretty sure that 'free' will not give you additional memory but > let's be serious here... > If this is solely about monitoring from > userspace and requesting more memory from the userspace then I would > consider arguing about timely hotplug operation as void because there is > absolutely no guarantee to do the request itself in a timely fashion. > >> Hyper-V driver sends si_mem_available() and >> vm_memory_committed() metrics to the host every second and this can be >> later queried by any tool (e.g. powershell script). > > And how exactly is this related to the acpi hotplug which you were > arguing needs the timely handling as well? > What I meant to say is that there is no single 'right' way to get memory usage from a VM, make a decision somewhere (in the hypervisor, on some other host or even in someone's head) and issue a command to add more memory. I don't know what particular tools people use with ESX/KVM VMs but I think that multiple options are available. >> >> With udev-style memory onlining they should be aware of page >> >> tables and other in-kernel structures which require allocation so they >> >> need to add memory slowly and gradually or they risk running into OOM >> >> (at least getting some processes killed and these processes may be >> >> important). With in-kernel memory hotplug everything happens >> >> synchronously and no 'slowly and gradually' algorithm is required in >> >> all tools which may trigger memory hotplug. >> > >> > What prevents those APIs being used reasonably and only asks so much >> > memory as they can afford? I mean 1.5% available memory necessary for >> > the hotplug is not all that much. Or more precisely what prevents to ask >> > for this additional memory in a synchronous way? >> >> The knowledge about the fact that we need to add memory slowly and >> wait till it gets onlined is not obvious. > > yes it is and we cannot afford to give a better experience with the > implementation that requires to have memory to online a memory. Actually, we need memory to add memory, not to online it. And as I said before, this is a real issue which requires addressing, it should always be possible to add more memory (and to online already added memory if these events are separated). > >> AFAIR when you hotplug memory >> to Windows VMs there is no such thing as 'onlining', and no brain is >> required, a simple script 'low memory -> add mory memory' always >> works. Asking all these script writers to think twice before issuing a >> memory add command memory sounds like too much (to me). > > Pardon me, but not requiring a brain while doing something on Windows > VMs is not really an argument... Why? Windows (or any other OS) is just an example that things can be done in a different way, otherwise we'll end up with arguments like "it was always like that in Linux so it's good".
Let's add Andi On Fri 10-03-17 16:53:33, Michal Hocko wrote: > On Fri 10-03-17 14:58:07, Michal Hocko wrote: > [...] > > This would explain why onlining from the last block actually works but > > to me this sounds like a completely crappy behavior. All we need to > > guarantee AFAICS is that Normal and Movable zones do not overlap. I > > believe there is even no real requirement about ordering of the physical > > memory in Normal vs. Movable zones as long as they do not overlap. But > > let's keep it simple for the start and always enforce the current status > > quo that Normal zone is physically preceeding Movable zone. > > Can somebody explain why we cannot have a simple rule for Normal vs. > > Movable which would be: > > - block [pfn, pfn+block_size] can be Normal if > > !zone_populated(MOVABLE) || pfn+block_size < ZONE_MOVABLE->zone_start_pfn > > - block [pfn, pfn+block_size] can be Movable if > > !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn > > OK, so while I was playing with this setup some more I probably got why > this is done this way. All new memblocks are added to the zone Normal > where they are accounted as spanned but not present. When we do > online_movable we just cut from the end of the Normal zone and move it > to Movable zone. This sounds really awkward. What was the reason to go > this way? Why cannot we simply add those pages to the zone at the online > time? Answering to myself. So the reason seems to be 9d99aaa31f59 ("[PATCH] x86_64: Support memory hotadd without sparsemem") which is no longer true because config MEMORY_HOTPLUG bool "Allow for memory hot-add" depends on SPARSEMEM || X86_64_ACPI_NUMA depends on ARCH_ENABLE_MEMORY_HOTPLUG depends on COMPILE_TEST || !KASAN so it is either SPARSEMEM or X86_64_ACPI_NUMA that would have to be enabled. config X86_64_ACPI_NUMA def_bool y prompt "ACPI NUMA detection" depends on X86_64 && NUMA && ACPI && PCI select ACPI_NUMA But I do not see any way how to enable anything but SPARSEMEM for x86_64 choice prompt "Memory model" depends on SELECT_MEMORY_MODEL default DISCONTIGMEM_MANUAL if ARCH_DISCONTIGMEM_DEFAULT default SPARSEMEM_MANUAL if ARCH_SPARSEMEM_DEFAULT default FLATMEM_MANUAL ARCH_SPARSEMEM_DEFAULT is 32b only config ARCH_DISCONTIGMEM_DEFAULT def_bool y depends on NUMA && X86_32 and ARCH_SPARSEMEM_DEFAULT is enabeld on 64b. So I guess whatever was the reason to add this code back in 2006 is not true anymore. So I am really wondering. Do we absolutely need to assign pages which are not onlined yet to the ZONE_NORMAL unconditionally? Why cannot we put them out of any zone and wait for memory online operation to put them where requested?
> and ARCH_SPARSEMEM_DEFAULT is enabeld on 64b. So I guess whatever was > the reason to add this code back in 2006 is not true anymore. So I am > really wondering. Do we absolutely need to assign pages which are not > onlined yet to the ZONE_NORMAL unconditionally? Why cannot we put them > out of any zone and wait for memory online operation to put them where > requested? I don't remember any specific reason. May have just been simplification. Should be fine to use any zone. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 13 Mar 2017 13:28:25 +0100 Michal Hocko <mhocko@kernel.org> wrote: > On Mon 13-03-17 11:55:54, Igor Mammedov wrote: > > On Thu, 9 Mar 2017 13:54:00 +0100 > > Michal Hocko <mhocko@kernel.org> wrote: > > > > [...] > > > > It's major regression if you remove auto online in kernels that > > > > run on top of x86 kvm/vmware hypervisors, making API cleanups > > > > while breaking useful functionality doesn't make sense. > > > > > > > > I would ACK config option removal if auto online keeps working > > > > for all x86 hypervisors (hyperv/xen isn't the only who needs it) > > > > and keep kernel CLI option to override default. > > > > > > > > That doesn't mean that others will agree with flipping default, > > > > that's why config option has been added. > > > > > > > > Now to sum up what's been discussed on this thread, there were 2 > > > > different issues discussed: > > > > 1) memory hotplug: remove in kernel auto online for all > > > > except of hyperv/xen > > > > > > > > - suggested RFC is not acceptable from virt point of view > > > > as it regresses guests on top of x86 kvm/vmware which > > > > both use ACPI based memory hotplug. > > > > > > > > - udev/userspace solution doesn't work in practice as it's > > > > too slow and unreliable when system is under load which > > > > is quite common in virt usecase. That's why auto online > > > > has been introduced in the first place. > > > > > > Please try to be more specific why "too slow" is a problem. Also how > > > much slower are we talking about? > > > > In virt case on host with lots VMs, userspace handler > > processing could be scheduled late enough to trigger a race > > between (guest memory going away/OOM handler) and memory > > coming online. > > Either you are mixing two things together or this doesn't really make > much sense. So is this a balloning based on memory hotplug (aka active > memory hotadd initiated between guest and host automatically) or a guest > asking for additional memory by other means (pay more for memory etc.)? > Because if this is an administrative operation then I seriously question > this reasoning. It doesn't have to be user initiated action, have you heard about pay as you go phone plans, same thing use case applies to cloud environments where typically hotplug user initiated action on baremetal could be easily automated to hotplug on demand. > [...] > > > > - currently if one wants to use online_movable, > > > > one has to either > > > > * disable auto online in kernel OR > > > > > > which might not just work because an unmovable allocation could have > > > made the memblock pinned. > > > > With memhp_default_state=offline on kernel CLI there won't be any > > unmovable allocation as hotplugged memory won't be onlined and > > user can online it manually. So it works for non default usecase > > of playing with memory hot-unplug. > > I was talking about the case when the auto_online was true, of course, > e.g. depending on the config option which you've said is enabled in > Fedora kernels. > > [...] > > > > I'm in favor of implementing that in kernel as it keeps > > > > kernel internals inside kernel and doesn't need > > > > kernel API to be involved (memory blocks in sysfs, > > > > online_kernel, online_movable) > > > > There would be no need in userspace which would have to > > > > deal with kernel zoo and maintain that as well. > > > > > > The kernel is supposed to provide a proper API and that is sysfs > > > currently. I am not entirely happy about it either but pulling a lot of > > > code into the kernel is not the rigth thing to do. Especially when > > > different usecases require different treatment. > > > > If it could be done from kernel side alone, it looks like a better way > > to me not to involve userspace at all. And for ACPI based x86/ARM it's > > possible to implement without adding a lot of kernel code. > > But this is not how we do the kernel development. We provide the API so > that userspace can implement the appropriate policy on top. We do not > add random knobs to implement the same thing in the kernel. Different > users might want to implement different onlining strategies and that is > hardly describable by a single global knob. Just look at the s390 > example provided earlier. Please try to think out of your usecase scope. And could you think outside of legacy sysfs based onlining usecase scope? I don't think that S390 comparing with x86 is correct as platforms and hardware implementations of memory hotplug are different with correspondingly different requirements, hence CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE were introduced to allows platform specify behavior. For x86/ARM(+ACPI) it's possible to implement hotplug in race free way inside kernel without userspace intervention, onlining memory using hardware vendor defined policy (ACPI SRAT/Memory device describe memory sufficiently to do it) so user won't have to do it manually, config option is a convenient way to enable new feature for platforms that could support it. It's good to maintain uniform API to userspace as far as API does the job, but being stuck to legacy way isn't good when there is a way (even though it's limited to limited set of platforms) to improve it by removing need for API, making overall less complex and race-less (more reliable) system. > > That's one more of a reason to keep CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE > > so we could continue on improving kernel only auto-onlining > > and fixing current memory hot(un)plug issues without affecting > > other platforms/users that are no interested in it. > > I really do not see any reason to keep the config option. Setting up > this to enabled is _wrong_ thing to do in general purpose > (distribution) kernel and a kernel for the specific usecase can achieve > the same thing via boot command line. I have to disagree with you that setting policy 'not online by default' in kernel is more valid than opposite policy 'online by default'. It maybe works for your usecases but it doesn't mean that it suits needs of others. As example RHEL distribution (x86) are shipped with memory autoonline enabled by default policy as it's what customers ask for. And onlining memory as removable considered as a specific usecase, since arguably a number of users where physical memory removal is supported is less than a number of users where just hot add is supported, plus single virt usecase adds huge userbase to the later as it's easily available/accessible versus baremetal hotplug. So default depends on target audience and distributions need a config option to pick default that suits its customers needs. If we don't provide reliably working memory hot-add solution customers will just move to OS that does (Windows or with your patch hyperv/xen based cloud instead of KVM/VMware. > > (PS: I don't care much about sysfs knob for setting auto-onlining, > > as kernel CLI override with memhp_default_state seems > > sufficient to me) > > That is good to hear! I would be OK with keeping the kernel command line > option until we resolve all the current issues with the hotplug. You RFC doesn't fix anything except of cleaning up config option, and even at that is does it inconsistently breaking both userspaces - one that does expect auto-online kernel update on Fedora will break memory hot-add (on KVM/VMware hosts) since userspace doesn't ship any scripts that would do it but will continue to work on hyperv/xen hosts. - another that doesn't expect auto-online: no change for KVM/VMware but suddenly hyperv/xen would start auto-onlinig memory. So users would have broken VMs until regression is noticed and would have to manually fix userspace that they use to accommodate 'improved' kernel. This RFC under guise of clean up is removing default choice from distributions and actually crippling linux guests on top of KVM/VMware hosts while favoring xen/hyperv hosts. IMHO it doesn't make any sense. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/13/2017 05:19 AM, Michal Hocko wrote: > On Fri 10-03-17 12:39:27, Yasuaki Ishimatsu wrote: >> On 03/10/2017 08:58 AM, Michal Hocko wrote: > [...] >>> OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G which generated >>> [...] >>> [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff] >>> [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff] >>> [ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff] >>> [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug >>> [ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff] >>> [ 0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff] >>> [ 0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff] >>> [ 0.000000] Zone ranges: >>> [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff] >>> [ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffdffff] >>> [ 0.000000] Normal empty >>> [ 0.000000] Movable zone start for each node >>> [ 0.000000] Early memory node ranges >>> [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] >>> [ 0.000000] node 0: [mem 0x0000000000100000-0x000000003fffffff] >>> [ 0.000000] node 1: [mem 0x0000000040000000-0x000000007ffdffff] >>> >>> so there is neither any normal zone nor movable one at the boot time. >>> Then I hotplugged 1G slot >>> (qemu) object_add memory-backend-ram,id=mem1,size=1G >>> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 >>> >>> unfortunatelly the memory didn't show up automatically and I got >>> [ 116.375781] acpi PNP0C80:00: Enumeration failure >>> >>> so I had to probe it manually (prbably the BIOS my qemu uses doesn't >>> support auto probing - I haven't really dug further). Anyway the SRAT >>> table printed during the boot told that we should start at 0x100000000 >>> >>> # echo 0x100000000 > /sys/devices/system/memory/probe >>> # grep . /sys/devices/system/memory/memory32/valid_zones >>> Normal Movable >>> >>> which looks reasonably right? Both Normal and Movable zones are allowed >>> >>> # echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe >>> # grep . /sys/devices/system/memory/memory3?/valid_zones >>> /sys/devices/system/memory/memory32/valid_zones:Normal >>> /sys/devices/system/memory/memory33/valid_zones:Normal Movable >>> >>> Huh, so our valid_zones have changed under our feet... >>> >>> # echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe >>> # grep . /sys/devices/system/memory/memory3?/valid_zones >>> /sys/devices/system/memory/memory32/valid_zones:Normal >>> /sys/devices/system/memory/memory33/valid_zones:Normal >>> /sys/devices/system/memory/memory34/valid_zones:Normal Movable >>> >>> and again. So only the last memblock is considered movable. Let's try to >>> online them now. >>> >>> # echo online_movable > /sys/devices/system/memory/memory34/state >>> # grep . /sys/devices/system/memory/memory3?/valid_zones >>> /sys/devices/system/memory/memory32/valid_zones:Normal >>> /sys/devices/system/memory/memory33/valid_zones:Normal Movable >>> /sys/devices/system/memory/memory34/valid_zones:Movable Normal >>> >> >> I think there is no strong reason which kernel has the restriction. >> By setting the restrictions, it seems to have made management of >> these zone structs simple. > > Could you be more specific please? How could this make management any > easier when udev is basically racing with the physical hotplug and the > result is basically undefined? > When changing zone from NORMAL(N) to MOVALBE(M), we must resize both zones, zone->zone_start_pfn and zone->spanned_pages. Currently there is the restriction. So we just simply change: zone(N)->spanned_pages -= nr_pages zone(M)->zone_start_pfn -= nr_pages But if every memory can change zone with no restriction, we must recalculate these zones spanned_pages and zone_start_pfn follows: memory section # 1 2 3 4 5 6 7 |N|M|N|N|N|M|M| | |N|N|N|N|N|M|M| * change memory section #2 from MOVABLE to NORMAL. then we must find next movable memory section (#6) and resize these zones. I think when implementing movable memory, there is no requirement of this. So kernel has the current restriction. Thanks, Yasuaki Ishimatsu -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue 14-03-17 12:05:59, YASUAKI ISHIMATSU wrote: > > > On 03/13/2017 05:19 AM, Michal Hocko wrote: > >On Fri 10-03-17 12:39:27, Yasuaki Ishimatsu wrote: > >>On 03/10/2017 08:58 AM, Michal Hocko wrote: [...] > >>># echo online_movable > /sys/devices/system/memory/memory34/state > >>># grep . /sys/devices/system/memory/memory3?/valid_zones > >>>/sys/devices/system/memory/memory32/valid_zones:Normal > >>>/sys/devices/system/memory/memory33/valid_zones:Normal Movable > >>>/sys/devices/system/memory/memory34/valid_zones:Movable Normal > >>> > >> > >>I think there is no strong reason which kernel has the restriction. > >>By setting the restrictions, it seems to have made management of > >>these zone structs simple. > > > >Could you be more specific please? How could this make management any > >easier when udev is basically racing with the physical hotplug and the > >result is basically undefined? > > > > When changing zone from NORMAL(N) to MOVALBE(M), we must resize both zones, > zone->zone_start_pfn and zone->spanned_pages. Currently there is the > restriction. > > So we just simply change: > zone(N)->spanned_pages -= nr_pages > zone(M)->zone_start_pfn -= nr_pages Yes I understand how this made the implementation simpler. I was questioning how this made user management any easier. Changing valid zones which races with the hotplug consumer (e.g. udev) sounds like a terrible idea to me. Anyway, it seems that the initial assumption/restriction that all pages have to start on the zone Normal is not really needed. I have a preliminary patch which removes that and associates newly added pages with a zone at the online time and it seems to be working reasonably well. I have to iron out some corners before I post it.
Hello, On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote: > On Fri 10-03-17 13:00:37, Reza Arbab wrote: > > On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote: > > >OK, so while I was playing with this setup some more I probably got why > > >this is done this way. All new memblocks are added to the zone Normal > > >where they are accounted as spanned but not present. > > > > It's not always zone Normal. See zone_for_memory(). This leads to a > > workaround for having to do online_movable in descending block order. > > Instead of this: > > > > 1. probe block 34, probe block 33, probe block 32, ... > > 2. online_movable 34, online_movable 33, online_movable 32, ... > > > > you can online_movable the first block before adding the rest: > > I do I enforce that behavior when the probe happens automagically? What I provided as guide to online as movable in current and older kernels is: 1) Remove udev rule 2) After adding memory with qemu/libvirt API run in the guest ------- workaround start ---- #!/bin/bash for i in `ls -d1 /sys/devices/system/memory/memory* | sort -nr -t y -k 5`; do if [ "`cat $i/state`" == "offline" ]; then echo online_movable > $i/state ; fi; done ------- workaround end ---- That's how bad is onlining as movable for memory hotunplug. > > 1. probe block 32, online_movable 32 > > 2. probe block 33, probe block 34, ... > > - zone_for_memory() will cause these to start Movable > > 3. online 33, online 34, ... > > - they're already in Movable, so online_movable is equivalentr > > > > I agree with your general sentiment that this stuff is very nonintuitive. > > My criterion for nonintuitive is probably different because I would call > this _completely_unusable_. Sorry for being so loud about this but the > more I look into this area the more WTF code I see. This has seen close > to zero review and seems to be building up more single usecase code on > top of previous. We need to change this, seriously! It's not a bug, but when I found it I called it a "constraint" and when I debugged it (IIRC) it originated here: } else if (online_type == MMOP_ONLINE_MOVABLE) { if (!zone_can_shift(pfn, nr_pages, ZONE_MOVABLE, &zone_shift)) return -EINVAL; } Fixing it so you could online as movable even if it wasn't the last block was in my todo list but then we had other plans. Unfortunately unless pfn+nr_pages of the newly created Movable zone matches the end of the normal zone (or whatever was there that has to be converted to Movable), it will fail onlining as movable. To fix it is enough to check that everything from pfn to the end of whatever zone existed there (or the end of the node perhaps safer) can be converted as movable and just convert the whole thing as movable instead of stopping at pfn+nr_pages. Also note, onlining highmem physical ranges as movable requires yet another config option to be set for the below check to pass (CONFIG_MOVABLE_NODE=y), which I'm not exactly sure why anybody would want to set =n, and perhaps would be candidate for dropping well before considering to drop _DEFAULT_ONLINE and at best it should be replaced with a kernel parameter to turn off the CONFIG_MOVABLE_NODE=y behavior. if ((zone_idx(zone) > ZONE_NORMAL || online_type == MMOP_ONLINE_MOVABLE) && !can_online_high_movable(zone)) return -EINVAL; I would suggest to drop the constraints in online_pages and perhaps also the CONFIG_MOVABLE_NODE option before going to drop the automatic onlining in kernel. Because of the above constraint the udev rule is unusable anyway for onlining memory as movable so that it can be hotunplugged reliably (well not so reliably, page_migrate does a loop and tries many times but it occurred to me it failed once to offline and at next try it worked, temporary page pins with O_DIRECT screw with page_migration, rightfully so and no easy fix). After the above constraint is fixed I suggest to look into fixing the async onlining by making the udev rule run synchronously. Adding 4T to a 8G guest is perfectly reasonable and normal operation, no excuse for it to fail as long as you don't pretend such 4T to be unpluggable too later (which we won't). As I understand it, the whole point of _DEFFAULT_ONLINE=y is precisely that it's easier to fix it in kernel than fixing the udev rule. Furthermore the above constraint for the zone shifting which breaks online_movable in udev is not an issue for _DEFFAULT_ONLINE=y because kernel onlining is synchronous by design (no special synchronous udev rule fix is required) so it can cope fine with the existing above constraint by onlining as movable from the end of the last zone in each node so that such constraint never gets in the way. Extending _DEFFAULT_ONLINE=y so that it can also online as movable has been worked on. On a side note, kernelcore=xxx passed to a kernel with _DEFFAULT_ONLINE=y should already allow to create lots of hotunpluggable memory onlined automatically as movable (never tested but I would expect it to work). After the udev rule can handle adding 4T to a 8G guest and it can handle onlining memory as movable reliably by just doing s/online/online_movable/, I think then we can restart the discussion about the removal of _DEFFAULT_ONLINE=y. As far as I can tell, there are higher priority and more interesting things to fix in this area before _DEFFAULT_ONLINE=y can be removed. Either udev gets fixed and it's reasonably simpler to fix (it will remain slower) or we should eventually obsolete the udev rule instead, which is why the focus hasn't been in fixing the udev rule and to replace it instead. To be clear, I'm not necessarily against eventually removing _DEFFAULT_ONLINE, but an equally reliable and comparably fast alternative should be provided first and there's no such alternative right now. If s390 has special issues requiring admin or a software hotplug manager app to enable blocks of memory by hand, the _DEFFAULT_ONLINE could be then an option selected or not selected by arch/*/Kconfig. The udev rule is still an automatic action so it's 1:1 with the in-kernel feature. If the in-kernel automatic onlining is not workable on s390 I would assume the udev rule is also not workable. Thanks, Andrea -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue 14-03-17 14:20:14, Igor Mammedov wrote: > On Mon, 13 Mar 2017 13:28:25 +0100 > Michal Hocko <mhocko@kernel.org> wrote: > > > On Mon 13-03-17 11:55:54, Igor Mammedov wrote: > > > On Thu, 9 Mar 2017 13:54:00 +0100 > > > Michal Hocko <mhocko@kernel.org> wrote: [...] > > > > The kernel is supposed to provide a proper API and that is sysfs > > > > currently. I am not entirely happy about it either but pulling a lot of > > > > code into the kernel is not the rigth thing to do. Especially when > > > > different usecases require different treatment. > > > > > > If it could be done from kernel side alone, it looks like a better way > > > to me not to involve userspace at all. And for ACPI based x86/ARM it's > > > possible to implement without adding a lot of kernel code. > > > > But this is not how we do the kernel development. We provide the API so > > that userspace can implement the appropriate policy on top. We do not > > add random knobs to implement the same thing in the kernel. Different > > users might want to implement different onlining strategies and that is > > hardly describable by a single global knob. Just look at the s390 > > example provided earlier. Please try to think out of your usecase scope. > > And could you think outside of legacy sysfs based onlining usecase scope? Well, I always prefer a more generic solution which supports more usecases and I am trying really hard to understand usecases you are coming up with. So far I have heard that the current sysfs behavior is broken (which is true!) and some very vague arguments about why we need to online as quickly as possible to the point that userspace handling is an absolute no go. To be honest I still consider the later a non-issue. If the only thing you care about is the memory footprint of the first phase then I believe this is fixable. Memblock and section descriptors should be the only necessary thing to allocate and that is not much. As an aside, the more I think about the way the original authors separated the physical hotadd from onlining the more I appreciate that decision because the way how the memory can be online is definitely not carved in stone and evolves with usecases. I believe nobody expected that memory could be onlined as movable back then and I am pretty sure new ways will emerge over time. > I don't think that S390 comparing with x86 is correct as platforms > and hardware implementations of memory hotplug are different with > correspondingly different requirements, hence CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE > were introduced to allows platform specify behavior. There are different usecases which are arch agnostic. E.g. decide the movability based on some criterion (e.g. specific node, physical address range and what not). Global auto onlining cannot handle those for obvious reasons and a config option will not do achieve that for the same reason. > For x86/ARM(+ACPI) it's possible to implement hotplug in race free > way inside kernel without userspace intervention, onlining memory > using hardware vendor defined policy (ACPI SRAT/Memory device describe > memory sufficiently to do it) so user won't have to do it manually, > config option is a convenient way to enable new feature for platforms > that could support it. Sigh. Can you see the actual difference between the global kernel policy and the policy coming from the specific hardware (ACPI etc...)? I am not opposing auto onlining based on the ACPI attributes. But what we have now is a policy _in_the_kernel_. This is almost always a bad idea and I do not see any strong argument why it would be any different in this particular case. Actually your current default in Fedora makes it harder for anybody to use movable zones/nodes. > It's good to maintain uniform API to userspace as far as API does > the job, but being stuck to legacy way isn't good when > there is a way (even though it's limited to limited set of platforms) > to improve it by removing need for API, making overall less complex > and race-less (more reliable) system. then convince your virtualization platform to provide necessary data for the memory auto onlining via ACPI etc... > > > That's one more of a reason to keep CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE > > > so we could continue on improving kernel only auto-onlining > > > and fixing current memory hot(un)plug issues without affecting > > > other platforms/users that are no interested in it. > > > > I really do not see any reason to keep the config option. Setting up > > this to enabled is _wrong_ thing to do in general purpose > > (distribution) kernel and a kernel for the specific usecase can achieve > > the same thing via boot command line. > > I have to disagree with you that setting policy 'not online by default' > in kernel is more valid than opposite policy 'online by default'. > It maybe works for your usecases but it doesn't mean that it suits > needs of others. Well, as described above there are good reasons to not hardwire any policy into the kernel because things tend to evolve and come with many surprising usecases original authors haven't anticipated at all. On the other hand we have your auto_online policy which handles _one_ particular class of usecases which I believe could have been addressed by enhancing the implementation of the current interface. E.g. allocate less memory in the initial phase, preemptive failing the first phase when there is too much memory waiting for onlining or even help udev to react faster by having preallocated workers to handle events. Instead, I suspect, you have chosen the path of the least resistance/effort and now we've ended up with a global policy with known limitations. I cannot say I would be happy about that. > As example RHEL distribution (x86) are shipped with memory > autoonline enabled by default policy as it's what customers ask for. > > And onlining memory as removable considered as a specific usecase, > since arguably a number of users where physical memory removal is > supported is less than a number of users where just hot add is > supported, plus single virt usecase adds huge userbase to > the later as it's easily available/accessible versus baremetal > hotplug. this might be the case now but might turn out to be a completely wrong thing to do in few years when overhyped^Wcloud workloads won't be all that interesting anymore. > So default depends on target audience and distributions need > a config option to pick default that suits its customers needs. Well, I would hope that such a thing could be achieved by more flexible means than the kernel config... E.g. pre-defined defaults that I can install as a package rather than enforcing a particular policy to everybody. > If we don't provide reliably working memory hot-add solution > customers will just move to OS that does (Windows or with your > patch hyperv/xen based cloud instead of KVM/VMware. > > > > (PS: I don't care much about sysfs knob for setting auto-onlining, > > > as kernel CLI override with memhp_default_state seems > > > sufficient to me) > > > > That is good to hear! I would be OK with keeping the kernel command line > > option until we resolve all the current issues with the hotplug. > > You RFC doesn't fix anything except of cleaning up config option, > and even at that is does it inconsistently breaking both userspaces > - one that does expect auto-online > kernel update on Fedora will break memory hot-add > (on KVM/VMware hosts) since userspace doesn't ship any > scripts that would do it but will continue to work on > hyperv/xen hosts. that is actually trivial to fix and provide a userspace fix while the kernel still offers the functionality and remove the kernel functionality later. Nobody talks about removing the whole thing at once. API changes are not that simple at all. > - another that doesn't expect auto-online: > no change for KVM/VMware but suddenly hyperv/xen would > start auto-onlinig memory. I would argue that removing a policy which covers only some usecases as a fix but whatever. We obviously disagree here... Anyway, I consider "never break the userspace" to be a hard rule and I do not want to break any usecase of course. I thought this RFC would help to trigger a constructive discussion with some reasonable outcome where we would get rid of the cruft eventually. It seems this will not be the case because getting an immediate half-solutions is preferred much more than exhausting all the potential options these days. I am sorry, but I have to say I really hate the way this all sneaked in without a wider review, though. If this went through a proper review process it would get a straight NAK, from me at least, I believe.
On Tue 14-03-17 20:35:21, Andrea Arcangeli wrote: > Hello, > > On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote: > > On Fri 10-03-17 13:00:37, Reza Arbab wrote: > > > On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote: > > > >OK, so while I was playing with this setup some more I probably got why > > > >this is done this way. All new memblocks are added to the zone Normal > > > >where they are accounted as spanned but not present. > > > > > > It's not always zone Normal. See zone_for_memory(). This leads to a > > > workaround for having to do online_movable in descending block order. > > > Instead of this: > > > > > > 1. probe block 34, probe block 33, probe block 32, ... > > > 2. online_movable 34, online_movable 33, online_movable 32, ... > > > > > > you can online_movable the first block before adding the rest: > > > > I do I enforce that behavior when the probe happens automagically? > > What I provided as guide to online as movable in current and older > kernels is: > > 1) Remove udev rule > 2) After adding memory with qemu/libvirt API run in the guest > > ------- workaround start ---- > #!/bin/bash > for i in `ls -d1 /sys/devices/system/memory/memory* | sort -nr -t y -k 5`; do if [ "`cat $i/state`" == "offline" ]; then echo online_movable > $i/state ; fi; done > ------- workaround end ---- > > That's how bad is onlining as movable for memory hotunplug. Yeah, this is really yucky. Fortunately, I already have a working prototype which removes this restriction altogether. > > > 1. probe block 32, online_movable 32 > > > 2. probe block 33, probe block 34, ... > > > - zone_for_memory() will cause these to start Movable > > > 3. online 33, online 34, ... > > > - they're already in Movable, so online_movable is equivalentr > > > > > > I agree with your general sentiment that this stuff is very nonintuitive. > > > > My criterion for nonintuitive is probably different because I would call > > this _completely_unusable_. Sorry for being so loud about this but the > > more I look into this area the more WTF code I see. This has seen close > > to zero review and seems to be building up more single usecase code on > > top of previous. We need to change this, seriously! > > It's not a bug, but when I found it I called it a "constraint" and > when I debugged it (IIRC) it originated here: > > } else if (online_type == MMOP_ONLINE_MOVABLE) { > if (!zone_can_shift(pfn, nr_pages, ZONE_MOVABLE, &zone_shift)) > return -EINVAL; > } > > Fixing it so you could online as movable even if it wasn't the last > block was in my todo list but then we had other plans. > > Unfortunately unless pfn+nr_pages of the newly created Movable zone > matches the end of the normal zone (or whatever was there that has to > be converted to Movable), it will fail onlining as movable. Well, I think the main problem is that we associate pages added in the first phase (probe) to the Normal zone. The everything else is just a fallout and fiddling to make it work somehow. [...] > To be clear, I'm not necessarily against eventually removing > _DEFFAULT_ONLINE, but an equally reliable and comparably fast > alternative should be provided first and there's no such alternative > right now. As pointed in other reply that is not an intention here. I primarily wanted to understand the scope of the problem. I believe this config option was rushed into the kernel without considering other alternatives which would make the hotplug more usable by others as well. > If s390 has special issues requiring admin or a software hotplug > manager app to enable blocks of memory by hand, the _DEFFAULT_ONLINE > could be then an option selected or not selected by > arch/*/Kconfig. The udev rule is still an automatic action so it's 1:1 > with the in-kernel feature. If the in-kernel automatic onlining is not > workable on s390 I would assume the udev rule is also not workable. But this is not about s390. It is about different usecases which require different onlining policy and that is the main problem of the hardcoded one in the kernel. See the other reply.
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c index 6b0d3ef7309c..2b1c35fb36d1 100644 --- a/drivers/acpi/acpi_memhotplug.c +++ b/drivers/acpi/acpi_memhotplug.c @@ -228,7 +228,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device) if (node < 0) node = memory_add_physaddr_to_nid(info->start_addr); - result = add_memory(node, info->start_addr, info->length); + result = add_memory(node, info->start_addr, info->length, false); /* * If the memory block has been used by the kernel, add_memory() diff --git a/drivers/base/memory.c b/drivers/base/memory.c index fa26ffd25fa6..476c2c02f938 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -446,37 +446,6 @@ print_block_size(struct device *dev, struct device_attribute *attr, static DEVICE_ATTR(block_size_bytes, 0444, print_block_size, NULL); /* - * Memory auto online policy. - */ - -static ssize_t -show_auto_online_blocks(struct device *dev, struct device_attribute *attr, - char *buf) -{ - if (memhp_auto_online) - return sprintf(buf, "online\n"); - else - return sprintf(buf, "offline\n"); -} - -static ssize_t -store_auto_online_blocks(struct device *dev, struct device_attribute *attr, - const char *buf, size_t count) -{ - if (sysfs_streq(buf, "online")) - memhp_auto_online = true; - else if (sysfs_streq(buf, "offline")) - memhp_auto_online = false; - else - return -EINVAL; - - return count; -} - -static DEVICE_ATTR(auto_online_blocks, 0644, show_auto_online_blocks, - store_auto_online_blocks); - -/* * Some architectures will have custom drivers to do this, and * will not need to do it from userspace. The fake hot-add code * as well as ppc64 will do all of their discovery in userspace @@ -500,7 +469,7 @@ memory_probe_store(struct device *dev, struct device_attribute *attr, nid = memory_add_physaddr_to_nid(phys_addr); ret = add_memory(nid, phys_addr, - MIN_MEMORY_BLOCK_SIZE * sections_per_block); + MIN_MEMORY_BLOCK_SIZE * sections_per_block, false); if (ret) goto out; diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index 14c3dc4bd23c..3e052bedade5 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -535,11 +535,6 @@ struct hv_dynmem_device { bool host_specified_ha_region; /* - * State to synchronize hot-add. - */ - struct completion ol_waitevent; - bool ha_waiting; - /* * This thread handles hot-add * requests from the host as well as notifying * the host with regards to memory pressure in @@ -587,11 +582,6 @@ static int hv_memory_notifier(struct notifier_block *nb, unsigned long val, spin_lock_irqsave(&dm_device.ha_lock, flags); dm_device.num_pages_onlined += mem->nr_pages; spin_unlock_irqrestore(&dm_device.ha_lock, flags); - case MEM_CANCEL_ONLINE: - if (dm_device.ha_waiting) { - dm_device.ha_waiting = false; - complete(&dm_device.ol_waitevent); - } break; case MEM_OFFLINE: @@ -683,12 +673,9 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size, has->covered_end_pfn += processed_pfn; spin_unlock_irqrestore(&dm_device.ha_lock, flags); - init_completion(&dm_device.ol_waitevent); - dm_device.ha_waiting = !memhp_auto_online; - nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn)); ret = add_memory(nid, PFN_PHYS((start_pfn)), - (HA_CHUNK << PAGE_SHIFT)); + (HA_CHUNK << PAGE_SHIFT), true); if (ret) { pr_warn("hot_add memory failed error is %d\n", ret); @@ -708,17 +695,6 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size, spin_unlock_irqrestore(&dm_device.ha_lock, flags); break; } - - /* - * Wait for the memory block to be onlined when memory onlining - * is done outside of kernel (memhp_auto_online). Since the hot - * add has succeeded, it is ok to proceed even if the pages in - * the hot added region have not been "onlined" within the - * allowed time. - */ - if (dm_device.ha_waiting) - wait_for_completion_timeout(&dm_device.ol_waitevent, - 5*HZ); post_status(&dm_device); } diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c index b9c5522b8a68..f54c621195b6 100644 --- a/drivers/s390/char/sclp_cmd.c +++ b/drivers/s390/char/sclp_cmd.c @@ -404,7 +404,7 @@ static void __init add_memory_merged(u16 rn) if (!size) goto skip_add; for (addr = start; addr < start + size; addr += block_size) - add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size); + add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size, false); skip_add: first_rn = rn; num = 1; diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c index db107fa50ca1..fce961de8771 100644 --- a/drivers/xen/balloon.c +++ b/drivers/xen/balloon.c @@ -355,7 +355,7 @@ static enum bp_state reserve_additional_memory(void) * callers drop the mutex before trying again. */ mutex_unlock(&balloon_mutex); - rc = add_memory_resource(nid, resource, memhp_auto_online); + rc = add_memory_resource(nid, resource, true); mutex_lock(&balloon_mutex); if (rc) { diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 134a2f69c21a..a72f7f64ee26 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -100,8 +100,6 @@ extern void __online_page_free(struct page *page); extern int try_online_node(int nid); -extern bool memhp_auto_online; - #ifdef CONFIG_MEMORY_HOTREMOVE extern bool is_pageblock_removable_nolock(struct page *page); extern int arch_remove_memory(u64 start, u64 size); @@ -272,7 +270,7 @@ static inline void remove_memory(int nid, u64 start, u64 size) {} extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn, void *arg, int (*func)(struct memory_block *, void *)); -extern int add_memory(int nid, u64 start, u64 size); +extern int add_memory(int nid, u64 start, u64 size, bool online); extern int add_memory_resource(int nid, struct resource *resource, bool online); extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default, bool for_device); diff --git a/mm/Kconfig b/mm/Kconfig index 9b8fccb969dc..a64a3bca43d5 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -193,22 +193,6 @@ config MEMORY_HOTPLUG_SPARSE def_bool y depends on SPARSEMEM && MEMORY_HOTPLUG -config MEMORY_HOTPLUG_DEFAULT_ONLINE - bool "Online the newly added memory blocks by default" - default n - depends on MEMORY_HOTPLUG - help - This option sets the default policy setting for memory hotplug - onlining policy (/sys/devices/system/memory/auto_online_blocks) which - determines what happens to newly added memory regions. Policy setting - can always be changed at runtime. - See Documentation/memory-hotplug.txt for more information. - - Say Y here if you want all hot-plugged memory blocks to appear in - 'online' state by default. - Say N here if you want the default policy to keep all hot-plugged - memory blocks in 'offline' state. - config MEMORY_HOTREMOVE bool "Allow for memory hot remove" select MEMORY_ISOLATION diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index c35dd1976574..8520c9166f47 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -78,24 +78,6 @@ static struct { #define memhp_lock_acquire() lock_map_acquire(&mem_hotplug.dep_map) #define memhp_lock_release() lock_map_release(&mem_hotplug.dep_map) -#ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE -bool memhp_auto_online; -#else -bool memhp_auto_online = true; -#endif -EXPORT_SYMBOL_GPL(memhp_auto_online); - -static int __init setup_memhp_default_state(char *str) -{ - if (!strcmp(str, "online")) - memhp_auto_online = true; - else if (!strcmp(str, "offline")) - memhp_auto_online = false; - - return 1; -} -__setup("memhp_default_state=", setup_memhp_default_state); - void get_online_mems(void) { might_sleep(); @@ -1420,7 +1402,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online) } EXPORT_SYMBOL_GPL(add_memory_resource); -int __ref add_memory(int nid, u64 start, u64 size) +int __ref add_memory(int nid, u64 start, u64 size, bool online) { struct resource *res; int ret; @@ -1429,7 +1411,7 @@ int __ref add_memory(int nid, u64 start, u64 size) if (IS_ERR(res)) return PTR_ERR(res); - ret = add_memory_resource(nid, res, memhp_auto_online); + ret = add_memory_resource(nid, res, online); if (ret < 0) release_memory_resource(res); return ret;