[RFC] mm, hotplug: get rid of auto_online_blocks

Message ID	20170227092817.23571-1-mhocko@kernel.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xen.org> From: Michal Hocko <mhocko@kernel.org> To: <linux-mm@kvack.org> Date: Mon, 27 Feb 2017 10:28:17 +0100 Message-Id: <20170227092817.23571-1-mhocko@kernel.org> Cc: linux-s390@vger.kernel.org, Michal Hocko <mhocko@suse.com>, Greg KH <gregkh@linuxfoundation.org>, Daniel Kiper <daniel.kiper@oracle.com>, LKML <linux-kernel@vger.kernel.org>, linux-api@vger.kernel.org, linux-acpi@vger.kernel.org, David Rientjes <rientjes@google.com>, xen-devel@lists.xenproject.org, Andrew Morton <akpm@linux-foundation.org>, "K. Y. Srinivasan" <kys@microsoft.com>, Vitaly Kuznetsov <vkuznets@redhat.com> Subject: [Xen-devel] [RFC PATCH] mm, hotplug: get rid of auto_online_blocks Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>

Michal Hocko Feb. 27, 2017, 9:28 a.m. UTC

From: Michal Hocko <mhocko@suse.com>

This knob has been added by 31bc3858ea3e ("memory-hotplug: add automatic
onlining policy for the newly added memory") mainly to cover memory
hotplug based balooning solutions currently implemented for HyperV
and Xen. Both of them want to online the memory as soon after
registering as possible otherwise they can register too much memory
which cannot be used and trigger the oom killer (we need ~1.5% of the
registered memory so a large increase can consume all the available
memory). hv_mem_hot_add even waits for the userspace to online the
memory if the auto onlining is disabled to mitigate that problem.

Adding yet another knob and a config option just doesn't make much sense
IMHO. How is a random user supposed to know when to enable this option?
Ballooning drivers know much better that they want to do an immediate
online rather than waiting for the userspace to do that. If the memory
is onlined for a different purpose then we already have a notification
for the userspace and udev can handle the onlining. So the knob as well
as the config option for the default behavior just doesn't make any
sense. Let's remove them and allow user of add_memory to request the
online status explicitly. Not only it makes more sense it also removes a
lot of clutter.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
I am sending this as an RFC because this is a user visible change. Maybe
we won't be able to remove the sysfs knob which would be sad, especially
when it has been added without a wider discussion and IMHO it is just
wrong. Is there any reason why a kernel command line parameter wouldn't
work just fine?

Even in that case I believe that we should remove
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE knob. It just adds to an already
messy config space. Does anybody depend on the policy during the early
boot before the userspace can set the sysfs knob? Or why those users cannot
simply use the kernel command line parameter.

I also believe that the wait-for-userspace in hyperV should just die. It
should do the unconditional onlining. Same as Xen. I do not see any
reason why those should depend on the userspace. This should be just
fixed regardless of the sysfs/config part. I can separate this out of course.

Thoughts/Concerns?

 drivers/acpi/acpi_memhotplug.c |  2 +-
 drivers/base/memory.c          | 33 +--------------------------------
 drivers/hv/hv_balloon.c        | 26 +-------------------------
 drivers/s390/char/sclp_cmd.c   |  2 +-
 drivers/xen/balloon.c          |  2 +-
 include/linux/memory_hotplug.h |  4 +---
 mm/Kconfig                     | 16 ----------------
 mm/memory_hotplug.c            | 22 ++--------------------
 8 files changed, 8 insertions(+), 99 deletions(-)

Vitaly Kuznetsov Feb. 27, 2017, 10:02 a.m. UTC | #1

Michal Hocko <mhocko@kernel.org> writes:

> From: Michal Hocko <mhocko@suse.com>
>
> This knob has been added by 31bc3858ea3e ("memory-hotplug: add automatic
> onlining policy for the newly added memory") mainly to cover memory
> hotplug based balooning solutions currently implemented for HyperV
> and Xen. Both of them want to online the memory as soon after
> registering as possible otherwise they can register too much memory
> which cannot be used and trigger the oom killer (we need ~1.5% of the
> registered memory so a large increase can consume all the available
> memory). hv_mem_hot_add even waits for the userspace to online the
> memory if the auto onlining is disabled to mitigate that problem.
>
> Adding yet another knob and a config option just doesn't make much sense
> IMHO. How is a random user supposed to know when to enable this option?
> Ballooning drivers know much better that they want to do an immediate
> online rather than waiting for the userspace to do that. If the memory
> is onlined for a different purpose then we already have a notification
> for the userspace and udev can handle the onlining. So the knob as well
> as the config option for the default behavior just doesn't make any
> sense. Let's remove them and allow user of add_memory to request the
> online status explicitly. Not only it makes more sense it also removes a
> lot of clutter.
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>
> Hi,
> I am sending this as an RFC because this is a user visible change. Maybe
> we won't be able to remove the sysfs knob which would be sad, especially
> when it has been added without a wider discussion and IMHO it is just
> wrong. Is there any reason why a kernel command line parameter wouldn't
> work just fine?
>
> Even in that case I believe that we should remove
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE knob. It just adds to an already
> messy config space. Does anybody depend on the policy during the early
> boot before the userspace can set the sysfs knob? Or why those users cannot
> simply use the kernel command line parameter.
>
> I also believe that the wait-for-userspace in hyperV should just die. It
> should do the unconditional onlining. Same as Xen. I do not see any
> reason why those should depend on the userspace. This should be just
> fixed regardless of the sysfs/config part. I can separate this out of course.
>
> Thoughts/Concerns?

I don't have anything new to add to the discussion happened last week
but I'd like to summarize my arguments against this change:

1) This patch doesn't solve any issue. Configuration option is not an
issue by itself, it is an option for distros to decide what they want to
ship: udev rule with known issues (legacy mode) or enable the new
option. Distro makers and users building their kernels should be able to
answer this simple question "do you want to automatically online all
newly added memory or not". There are distros already which ship kernels
with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE enabled (Fedora 24 and 25 as
far as I remember, maybe someone else).

2) This patch creates an imbalance between Xen/Hyper-V on one side and
KVM/Vmware on another. KVM/Vmware use pure ACPI memory hotplug and this
memory won't get onlined. I don't understand how this problem is
supposed to be solved by distros. They'll *have to* continue shipping
a udev rule which has and always will have issues.

3) Kernel command line is not a viable choice, it is rather a debug
method. Having all newly added memory online as soon as possible is a
major use-case not something a couple of users wants (and this is
proved by major distros shipping the unconditional 'offline->online'
rule with udev).

A couple of other thoughts:
1) Having all newly added memory online ASAP is probably what people
want for all virtual machines. Unfortunately, we have additional
complexity with memory zones (ZONE_NORMAL, ZONE_MOVABLE) and in some
cases manual intervention is required. Especially, when further unplug
is expected.

2) Adding new memory can (in some extreme cases) still fail as we need
some *other* memory before we're able to online the newly added
block. This is an issue to be solved and it is doable (IMO) with some
pre-allocation.

I'd also like to notice that this patch doesn't re-introduce the issue I
was fixing with in-kernel memory onlining as all memory added through
the Hyper-V driver will be auto-onlined unconditionally. What I disagree
with here is taking away choice without fixing any real world issues.

[snip]

Michal Hocko Feb. 27, 2017, 10:21 a.m. UTC | #2

On Mon 27-02-17 11:02:09, Vitaly Kuznetsov wrote:
[...]
> I don't have anything new to add to the discussion happened last week
> but I'd like to summarize my arguments against this change:
> 
> 1) This patch doesn't solve any issue. Configuration option is not an
> issue by itself, it is an option for distros to decide what they want to
> ship: udev rule with known issues (legacy mode) or enable the new
> option. Distro makers and users building their kernels should be able to
> answer this simple question "do you want to automatically online all
> newly added memory or not".

OK, so could you be more specific? Distributions have no clue about
which HW their kernel runs on so how can they possibly make a sensible
decision here?

> There are distros already which ship kernels
> with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE enabled (Fedora 24 and 25 as
> far as I remember, maybe someone else).
> 
> 2) This patch creates an imbalance between Xen/Hyper-V on one side and
> KVM/Vmware on another. KVM/Vmware use pure ACPI memory hotplug and this
> memory won't get onlined. I don't understand how this problem is
> supposed to be solved by distros. They'll *have to* continue shipping
> a udev rule which has and always will have issues.

They have notifications for udev to online that memory and AFAICU
neither KVM nor VMware are using memory hotplut for ballooning - unlike
HyperV and Xen.

> 3) Kernel command line is not a viable choice, it is rather a debug
> method.

Why?

> Having all newly added memory online as soon as possible is a
> major use-case not something a couple of users wants (and this is
> proved by major distros shipping the unconditional 'offline->online'
> rule with udev).

I would argue because this really depends on the usecase. a) somebody
might want to online memory as movable and that really depends on which
node we are talking about because not all of them can be movable b) it
is easier to handle potential errors from userspace than the kernel.

> A couple of other thoughts:
> 1) Having all newly added memory online ASAP is probably what people
> want for all virtual machines. Unfortunately, we have additional
> complexity with memory zones (ZONE_NORMAL, ZONE_MOVABLE) and in some
> cases manual intervention is required. Especially, when further unplug
> is expected.

and that is why we do not want to hardwire this into the kernel and we
have a notification to handle this in userspace.

> 2) Adding new memory can (in some extreme cases) still fail as we need
> some *other* memory before we're able to online the newly added
> block. This is an issue to be solved and it is doable (IMO) with some
> pre-allocation.

you cannot preallocate for all the possible memory that can be added.

Vitaly Kuznetsov Feb. 27, 2017, 10:49 a.m. UTC | #3

Michal Hocko <mhocko@kernel.org> writes:

> On Mon 27-02-17 11:02:09, Vitaly Kuznetsov wrote:
> [...]
>> I don't have anything new to add to the discussion happened last week
>> but I'd like to summarize my arguments against this change:
>> 
>> 1) This patch doesn't solve any issue. Configuration option is not an
>> issue by itself, it is an option for distros to decide what they want to
>> ship: udev rule with known issues (legacy mode) or enable the new
>> option. Distro makers and users building their kernels should be able to
>> answer this simple question "do you want to automatically online all
>> newly added memory or not".
>
> OK, so could you be more specific? Distributions have no clue about
> which HW their kernel runs on so how can they possibly make a sensible
> decision here?

They at least have an idea if they ship udev rule or not. I can also
imagine different choices for non-x86 architectures but I don't know
enough about them to have an opinion.

>
>> There are distros already which ship kernels
>> with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE enabled (Fedora 24 and 25 as
>> far as I remember, maybe someone else).
>> 
>> 2) This patch creates an imbalance between Xen/Hyper-V on one side and
>> KVM/Vmware on another. KVM/Vmware use pure ACPI memory hotplug and this
>> memory won't get onlined. I don't understand how this problem is
>> supposed to be solved by distros. They'll *have to* continue shipping
>> a udev rule which has and always will have issues.
>
> They have notifications for udev to online that memory and AFAICU
> neither KVM nor VMware are using memory hotplut for ballooning - unlike
> HyperV and Xen.
>

No, Hyper-V doesn't use memory hotplug for ballooning purposes. It is
just a memory hotplug. The fact that the code is located in hv_balloon
is just a coincidence.

The difference with real hardware is how the operation is performed:
with real hardware you need to take a DIMM, go to your server room, open
the box, insert DIMM, go back to your seat. Asking to do some manual
action to actually enable memory is kinda OK. The beauty of hypervisors
is that everything happens automatically (e.g. when the VM is running
out of memory).

>> 3) Kernel command line is not a viable choice, it is rather a debug
>> method.
>
> Why?
>

Because we usually have just a few things there (root=, console=) and
the rest is used when something goes wrong or for 'special' cases, not
for the majority of users.

>> Having all newly added memory online as soon as possible is a
>> major use-case not something a couple of users wants (and this is
>> proved by major distros shipping the unconditional 'offline->online'
>> rule with udev).
>
> I would argue because this really depends on the usecase. a) somebody
> might want to online memory as movable and that really depends on which
> node we are talking about because not all of them can be movable

This is possible and that's why I introduce kernel command line options
back then. To simplify, I argue that the major use-case is 'online ASAP,
never offline' and for other use-cases we have options, both for distros
(config) and for users (command-line)

> b) it
> is easier to handle potential errors from userspace than the kernel.
>

Yes, probably, but memory hotplug was around for quite some time and I
didn't see anything but the dump udev rule (offline->online) without any
handling. And I think that we should rather focus on fixing potential
issues and making failures less probable (e.g. it's really hard to come
up with something different from 'failed->retry').

>> A couple of other thoughts:
>> 1) Having all newly added memory online ASAP is probably what people
>> want for all virtual machines. Unfortunately, we have additional
>> complexity with memory zones (ZONE_NORMAL, ZONE_MOVABLE) and in some
>> cases manual intervention is required. Especially, when further unplug
>> is expected.
>
> and that is why we do not want to hardwire this into the kernel and we
> have a notification to handle this in userspace.

Yes and I don't know about any plans to remove this notification. In
case some really complex handling is required just don't turn on the
automatic onlining.

Memory hotplug in real x86 hardware is rare, memory hotplug for VMs is
ubiquitous.

>
>> 2) Adding new memory can (in some extreme cases) still fail as we need
>> some *other* memory before we're able to online the newly added
>> block. This is an issue to be solved and it is doable (IMO) with some
>> pre-allocation.
>
> you cannot preallocate for all the possible memory that can be added.

For all, no, but for 1 next block - yes, and then I'll preallocate for
the next one.

Heiko Carstens Feb. 27, 2017, 11:25 a.m. UTC | #4

On Mon, Feb 27, 2017 at 11:02:09AM +0100, Vitaly Kuznetsov wrote:
> A couple of other thoughts:
> 1) Having all newly added memory online ASAP is probably what people
> want for all virtual machines.

This is not true for s390. On s390 we have "standby" memory that a guest
sees and potentially may use if it sets it online. Every guest that sets
memory offline contributes to the hypervisor's standby memory pool, while
onlining standby memory takes memory away from the standby pool.

The use-case is that a system administrator in advance knows the maximum
size a guest will ever have and also defines how much memory should be used
at boot time. The difference is standby memory.

Auto-onlining of standby memory is the last thing we want.

> Unfortunately, we have additional complexity with memory zones
> (ZONE_NORMAL, ZONE_MOVABLE) and in some cases manual intervention is
> required. Especially, when further unplug is expected.

This also is a reason why auto-onlining doesn't seem be the best way.

Vitaly Kuznetsov Feb. 27, 2017, 11:50 a.m. UTC | #5

Heiko Carstens <heiko.carstens@de.ibm.com> writes:

> On Mon, Feb 27, 2017 at 11:02:09AM +0100, Vitaly Kuznetsov wrote:
>> A couple of other thoughts:
>> 1) Having all newly added memory online ASAP is probably what people
>> want for all virtual machines.

Sorry, obviously missed 'x86' in the above statement.

>
> This is not true for s390. On s390 we have "standby" memory that a guest
> sees and potentially may use if it sets it online. Every guest that sets
> memory offline contributes to the hypervisor's standby memory pool, while
> onlining standby memory takes memory away from the standby pool.
>
> The use-case is that a system administrator in advance knows the maximum
> size a guest will ever have and also defines how much memory should be used
> at boot time. The difference is standby memory.
>
> Auto-onlining of standby memory is the last thing we want.
>

This is actually a very good example of why we need the config
option. s390 kernels will have it disabled.

Michal Hocko Feb. 27, 2017, 12:56 p.m. UTC | #6

On Mon 27-02-17 11:49:43, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Mon 27-02-17 11:02:09, Vitaly Kuznetsov wrote:
> > [...]
> >> I don't have anything new to add to the discussion happened last week
> >> but I'd like to summarize my arguments against this change:
> >> 
> >> 1) This patch doesn't solve any issue. Configuration option is not an
> >> issue by itself, it is an option for distros to decide what they want to
> >> ship: udev rule with known issues (legacy mode) or enable the new
> >> option. Distro makers and users building their kernels should be able to
> >> answer this simple question "do you want to automatically online all
> >> newly added memory or not".
> >
> > OK, so could you be more specific? Distributions have no clue about
> > which HW their kernel runs on so how can they possibly make a sensible
> > decision here?
> 
> They at least have an idea if they ship udev rule or not. I can also
> imagine different choices for non-x86 architectures but I don't know
> enough about them to have an opinion.

I really do not follow. If they know whether they ship the udev rule
then why do they need a kernel help at all? Anyway this global policy
actually breaks some usecases. Say you would have a default set to
online. What should user do if _some_ nodes should be online_movable?
Or, say that HyperV or other hotplug based ballooning implementation
really want to online the movable memory in order to have a realiable
hotremove. Now you have a global policy which goes against it.

> >> There are distros already which ship kernels
> >> with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE enabled (Fedora 24 and 25 as
> >> far as I remember, maybe someone else).
> >> 
> >> 2) This patch creates an imbalance between Xen/Hyper-V on one side and
> >> KVM/Vmware on another. KVM/Vmware use pure ACPI memory hotplug and this
> >> memory won't get onlined. I don't understand how this problem is
> >> supposed to be solved by distros. They'll *have to* continue shipping
> >> a udev rule which has and always will have issues.
> >
> > They have notifications for udev to online that memory and AFAICU
> > neither KVM nor VMware are using memory hotplut for ballooning - unlike
> > HyperV and Xen.
> >
> 
> No, Hyper-V doesn't use memory hotplug for ballooning purposes. It is
> just a memory hotplug. The fact that the code is located in hv_balloon
> is just a coincidence.

OK, I might be wrong here but 1cac8cd4d146 ("Drivers: hv: balloon:
Implement hot-add functionality") suggests otherwise.

> The difference with real hardware is how the operation is performed:
> with real hardware you need to take a DIMM, go to your server room, open
> the box, insert DIMM, go back to your seat. Asking to do some manual
> action to actually enable memory is kinda OK. The beauty of hypervisors
> is that everything happens automatically (e.g. when the VM is running
> out of memory).

I do not see your point. Either you have some (semi)automatic way to
balance memory in guest based on the memory pressure (let's call it
ballooning) or this is an administration operation (say you buy more
DIMs or pay more to your virtualization provider) and then it is up to
the guest owner to tell what to do about that memory. In other words you
really do not want to wait in the first case as you are under memory
pressure which is _actively_ managed or this is much more relaxed
environment.

> >> 3) Kernel command line is not a viable choice, it is rather a debug
> >> method.
> >
> > Why?
> >
> 
> Because we usually have just a few things there (root=, console=) and
> the rest is used when something goes wrong or for 'special' cases, not
> for the majority of users.

auto online or even memory hotplug seems something that requires
a special HW/configuration already so I fail to see your point. It is
normal to put kernel parameters to override the default. And AFAIU
default offline is a sensible default for the standard memory hotplug.

[...]

> >> 2) Adding new memory can (in some extreme cases) still fail as we need
> >> some *other* memory before we're able to online the newly added
> >> block. This is an issue to be solved and it is doable (IMO) with some
> >> pre-allocation.
> >
> > you cannot preallocate for all the possible memory that can be added.
> 
> For all, no, but for 1 next block - yes, and then I'll preallocate for
> the next one.

You are still thinking in the scope of your particular use case and I
believe the whole thing is shaped around that very same thing and that
is why it should have been rejected in the first place. Especially when
that use case can be handled without user visible configuration knob.

Vitaly Kuznetsov Feb. 27, 2017, 1:17 p.m. UTC | #7

Michal Hocko <mhocko@kernel.org> writes:

> On Mon 27-02-17 11:49:43, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Mon 27-02-17 11:02:09, Vitaly Kuznetsov wrote:
>> > [...]
>> >> I don't have anything new to add to the discussion happened last week
>> >> but I'd like to summarize my arguments against this change:
>> >> 
>> >> 1) This patch doesn't solve any issue. Configuration option is not an
>> >> issue by itself, it is an option for distros to decide what they want to
>> >> ship: udev rule with known issues (legacy mode) or enable the new
>> >> option. Distro makers and users building their kernels should be able to
>> >> answer this simple question "do you want to automatically online all
>> >> newly added memory or not".
>> >
>> > OK, so could you be more specific? Distributions have no clue about
>> > which HW their kernel runs on so how can they possibly make a sensible
>> > decision here?
>> 
>> They at least have an idea if they ship udev rule or not. I can also
>> imagine different choices for non-x86 architectures but I don't know
>> enough about them to have an opinion.
>
> I really do not follow. If they know whether they ship the udev rule
> then why do they need a kernel help at all? Anyway this global policy
> actually breaks some usecases. Say you would have a default set to
> online. What should user do if _some_ nodes should be online_movable?
> Or, say that HyperV or other hotplug based ballooning implementation
> really want to online the movable memory in order to have a realiable
> hotremove. Now you have a global policy which goes against it.
>

While I think that hotremove is a special case which really requires
manual intervention (at least to decide which memory goes NORMAL and
which MOVABLE), MEMORY_HOTPLUG_DEFAULT_ONLINE is probably not for it.

[snip]

>
>> The difference with real hardware is how the operation is performed:
>> with real hardware you need to take a DIMM, go to your server room, open
>> the box, insert DIMM, go back to your seat. Asking to do some manual
>> action to actually enable memory is kinda OK. The beauty of hypervisors
>> is that everything happens automatically (e.g. when the VM is running
>> out of memory).
>
> I do not see your point. Either you have some (semi)automatic way to
> balance memory in guest based on the memory pressure (let's call it
> ballooning) or this is an administration operation (say you buy more
> DIMs or pay more to your virtualization provider) and then it is up to
> the guest owner to tell what to do about that memory. In other words you
> really do not want to wait in the first case as you are under memory
> pressure which is _actively_ managed or this is much more relaxed
> environment.

I don't see a contradiction between what I say and what you say here :-)
Yes, there are case when we're not in a hurry and there are cases when
we can't wait.

>
>> >> 3) Kernel command line is not a viable choice, it is rather a debug
>> >> method.
>> >
>> > Why?
>> >
>> 
>> Because we usually have just a few things there (root=, console=) and
>> the rest is used when something goes wrong or for 'special' cases, not
>> for the majority of users.
>
> auto online or even memory hotplug seems something that requires
> a special HW/configuration already so I fail to see your point. It is
> normal to put kernel parameters to override the default. And AFAIU
> default offline is a sensible default for the standard memory hotplug.
>

It depends how we define 'standard'. The point I'm trying to make is
that it's really common for VMs to use this technique while in hardware
(x86) world it is a rare occasion. The 'sensible default' may differ.

> [...]
>
>> >> 2) Adding new memory can (in some extreme cases) still fail as we need
>> >> some *other* memory before we're able to online the newly added
>> >> block. This is an issue to be solved and it is doable (IMO) with some
>> >> pre-allocation.
>> >
>> > you cannot preallocate for all the possible memory that can be added.
>> 
>> For all, no, but for 1 next block - yes, and then I'll preallocate for
>> the next one.
>
> You are still thinking in the scope of your particular use case and I
> believe the whole thing is shaped around that very same thing and that
> is why it should have been rejected in the first place. Especially when
> that use case can be handled without user visible configuration knob.

I think my use case is broad enough. At least it applies to all
virtualization technoligies and not only to Hyper-V. But yes, I agree
that adding a parameter to add_memory() solves my particular use case as
well.

Michal Hocko Feb. 27, 2017, 3:43 p.m. UTC | #8

On Mon 27-02-17 12:25:10, Heiko Carstens wrote:
> On Mon, Feb 27, 2017 at 11:02:09AM +0100, Vitaly Kuznetsov wrote:
> > A couple of other thoughts:
> > 1) Having all newly added memory online ASAP is probably what people
> > want for all virtual machines.
> 
> This is not true for s390. On s390 we have "standby" memory that a guest
> sees and potentially may use if it sets it online. Every guest that sets
> memory offline contributes to the hypervisor's standby memory pool, while
> onlining standby memory takes memory away from the standby pool.
> 
> The use-case is that a system administrator in advance knows the maximum
> size a guest will ever have and also defines how much memory should be used
> at boot time. The difference is standby memory.
> 
> Auto-onlining of standby memory is the last thing we want.
> 
> > Unfortunately, we have additional complexity with memory zones
> > (ZONE_NORMAL, ZONE_MOVABLE) and in some cases manual intervention is
> > required. Especially, when further unplug is expected.
> 
> This also is a reason why auto-onlining doesn't seem be the best way.

Can you imagine any situation when somebody actually might want to have
this knob enabled? From what I understand it doesn't seem to be the
case.

Reza Arbab Feb. 27, 2017, 5:28 p.m. UTC | #9

On Mon, Feb 27, 2017 at 10:28:17AM +0100, Michal Hocko wrote: 
>diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>index 134a2f69c21a..a72f7f64ee26 100644
>--- a/include/linux/memory_hotplug.h
>+++ b/include/linux/memory_hotplug.h
>@@ -100,8 +100,6 @@ extern void __online_page_free(struct page *page);
>
> extern int try_online_node(int nid);
>
>-extern bool memhp_auto_online;
>-
> #ifdef CONFIG_MEMORY_HOTREMOVE
> extern bool is_pageblock_removable_nolock(struct page *page);
> extern int arch_remove_memory(u64 start, u64 size);
>@@ -272,7 +270,7 @@ static inline void remove_memory(int nid, u64 start, u64 size) {}
>
> extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
> 		void *arg, int (*func)(struct memory_block *, void *));
>-extern int add_memory(int nid, u64 start, u64 size);
>+extern int add_memory(int nid, u64 start, u64 size, bool online);
> extern int add_memory_resource(int nid, struct resource *resource, bool online);
> extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
> 		bool for_device);

It would be nice if instead of a 'bool online' argument, add_memory() 
and add_memory_resource() took an 'int online_type', ala online_pages().

That way we could specify offline, online, online+movable, etc.

Michal Hocko Feb. 27, 2017, 5:34 p.m. UTC | #10

On Mon 27-02-17 11:28:52, Reza Arbab wrote:
> On Mon, Feb 27, 2017 at 10:28:17AM +0100, Michal Hocko wrote:
> >diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> >index 134a2f69c21a..a72f7f64ee26 100644
> >--- a/include/linux/memory_hotplug.h
> >+++ b/include/linux/memory_hotplug.h
> >@@ -100,8 +100,6 @@ extern void __online_page_free(struct page *page);
> >
> >extern int try_online_node(int nid);
> >
> >-extern bool memhp_auto_online;
> >-
> >#ifdef CONFIG_MEMORY_HOTREMOVE
> >extern bool is_pageblock_removable_nolock(struct page *page);
> >extern int arch_remove_memory(u64 start, u64 size);
> >@@ -272,7 +270,7 @@ static inline void remove_memory(int nid, u64 start, u64 size) {}
> >
> >extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
> >		void *arg, int (*func)(struct memory_block *, void *));
> >-extern int add_memory(int nid, u64 start, u64 size);
> >+extern int add_memory(int nid, u64 start, u64 size, bool online);
> >extern int add_memory_resource(int nid, struct resource *resource, bool online);
> >extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
> >		bool for_device);
> 
> It would be nice if instead of a 'bool online' argument, add_memory() and
> add_memory_resource() took an 'int online_type', ala online_pages().
> 
> That way we could specify offline, online, online+movable, etc.

Sure that would require more changes though and as such it is out of
scope of this patch. But you are right, this is a logical follow up
step.

Heiko Carstens Feb. 28, 2017, 10:21 a.m. UTC | #11

On Mon, Feb 27, 2017 at 04:43:04PM +0100, Michal Hocko wrote:
> On Mon 27-02-17 12:25:10, Heiko Carstens wrote:
> > On Mon, Feb 27, 2017 at 11:02:09AM +0100, Vitaly Kuznetsov wrote:
> > > A couple of other thoughts:
> > > 1) Having all newly added memory online ASAP is probably what people
> > > want for all virtual machines.
> > 
> > This is not true for s390. On s390 we have "standby" memory that a guest
> > sees and potentially may use if it sets it online. Every guest that sets
> > memory offline contributes to the hypervisor's standby memory pool, while
> > onlining standby memory takes memory away from the standby pool.
> > 
> > The use-case is that a system administrator in advance knows the maximum
> > size a guest will ever have and also defines how much memory should be used
> > at boot time. The difference is standby memory.
> > 
> > Auto-onlining of standby memory is the last thing we want.
> > 
> > > Unfortunately, we have additional complexity with memory zones
> > > (ZONE_NORMAL, ZONE_MOVABLE) and in some cases manual intervention is
> > > required. Especially, when further unplug is expected.
> > 
> > This also is a reason why auto-onlining doesn't seem be the best way.
> 
> Can you imagine any situation when somebody actually might want to have
> this knob enabled? From what I understand it doesn't seem to be the
> case.

I can only speak for s390, and at least here I think auto-online is always
wrong, especially if you consider the added complexity that you may want to
online memory sometimes to ZONE_NORMAL and sometimes to ZONE_MOVABLE.

Igor Mammedov March 2, 2017, 1:53 p.m. UTC | #12

On Mon 27-02-17 16:43:04, Michal Hocko wrote:
> On Mon 27-02-17 12:25:10, Heiko Carstens wrote:
> > On Mon, Feb 27, 2017 at 11:02:09AM +0100, Vitaly Kuznetsov wrote:  
> > > A couple of other thoughts:
> > > 1) Having all newly added memory online ASAP is probably what people
> > > want for all virtual machines.  
> > 
> > This is not true for s390. On s390 we have "standby" memory that a guest
> > sees and potentially may use if it sets it online. Every guest that sets
> > memory offline contributes to the hypervisor's standby memory pool, while
> > onlining standby memory takes memory away from the standby pool.
> > 
> > The use-case is that a system administrator in advance knows the maximum
> > size a guest will ever have and also defines how much memory should be used
> > at boot time. The difference is standby memory.
> > 
> > Auto-onlining of standby memory is the last thing we want.
I don't know much about anything other than x86 so all comments
below are from that point of view,
archetectures that don't need auto online can keep current default

> > > Unfortunately, we have additional complexity with memory zones
> > > (ZONE_NORMAL, ZONE_MOVABLE) and in some cases manual intervention is
> > > required. Especially, when further unplug is expected.  
> > 
> > This also is a reason why auto-onlining doesn't seem be the best way.  

When trying to support memory unplug on guest side in RHEL7,
experience shows otherwise. Simplistic udev rule which onlines
added block doesn't work in case one wants to online it as movable.

Hotplugged blocks in current kernel should be onlined in reverse
order to online blocks as movable depending on adjacent blocks zone.
Which means simple udev rule isn't usable since it gets event from
the first to the last hotplugged block order. So now we would have
to write a daemon that would
 - watch for all blocks in hotplugged memory appear (how would it know)
 - online them in right order (order might also be different depending
   on kernel version)
   -- it becomes even more complicated in NUMA case when there are
      multiple zones and kernel would have to provide user-space
      with information about zone maps

In short current experience shows that userspace approach
 - doesn't solve issues that Vitaly has been fixing (i.e. onlining
   fast and/or under memory pressure) when udev (or something else
   might be killed)
 - doesn't reduce overall system complexity, it only gets worse
   as user-space handler needs to know a lot about kernel internals
   and implementation details/kernel versions to work properly

It's might be not easy but doing onlining in kernel on the other hand is:
 - faster
 - more reliable (can't be killed under memory pressure)
 - kernel has access to all info needed for onlining and how it
   internals work to implement auto-online correctly
 - since there is no need to mantain ABI for user-space
   (zones layout/ordering/maybe something else), kernel is
   free change internal implemetation without breaking userspace
   (currently hotplug+unplug doesn't work reliably and we might
    need something more flexible than zones)
    That's direction of research in progress, i.e. making kernel
    implementation better instead of putting responsibility on
    user-space to deal with kernel shortcomings.

> Can you imagine any situation when somebody actually might want to have
> this knob enabled? From what I understand it doesn't seem to be the
> case.
For x86:
 * this config option is enabled by default in recent Fedora,
 * RHEL6 ships similar downstream patches to do the same thing for years
 * RHEL7 has udev rule (because there wasn't kernel side solution at fork time)
   that auto-onlines it unconditionally, Vitaly might backport it later
   when he has time.
Not linux kernel but still auto online policy is used by Windows
both on baremetal and guest configurations.

That's somewhat shows that current defaults upstream on x86
might be not what end-users wish for.

When auto_online_blocks were introduced, Vitaly has been
conservative and left current upstream defaults where they were
lest it would break someone else setup but allowing downstreams
set their own auto-online policy, eventually we might switch it
upstream too.

Michal Hocko March 2, 2017, 2:28 p.m. UTC | #13

On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
[...]
> When trying to support memory unplug on guest side in RHEL7,
> experience shows otherwise. Simplistic udev rule which onlines
> added block doesn't work in case one wants to online it as movable.
> 
> Hotplugged blocks in current kernel should be onlined in reverse
> order to online blocks as movable depending on adjacent blocks zone.

Could you be more specific please? Setting online_movable from the udev
rule should just work regardless of the ordering or the state of other
memblocks. If that doesn't work I would call it a bug.

> Which means simple udev rule isn't usable since it gets event from
> the first to the last hotplugged block order. So now we would have
> to write a daemon that would
>  - watch for all blocks in hotplugged memory appear (how would it know)
>  - online them in right order (order might also be different depending
>    on kernel version)
>    -- it becomes even more complicated in NUMA case when there are
>       multiple zones and kernel would have to provide user-space
>       with information about zone maps
> 
> In short current experience shows that userspace approach
>  - doesn't solve issues that Vitaly has been fixing (i.e. onlining
>    fast and/or under memory pressure) when udev (or something else
>    might be killed)

yeah and that is why the patch does the onlining from the kernel.
 
> > Can you imagine any situation when somebody actually might want to have
> > this knob enabled? From what I understand it doesn't seem to be the
> > case.
> For x86:
>  * this config option is enabled by default in recent Fedora,

How do you want to support usecases which really want to online memory
as movable? Do you expect those users to disable the option because
unless I am missing something the in kernel auto onlining only supporst
regular onlining.

Igor Mammedov March 2, 2017, 5:03 p.m. UTC | #14

On Thu, 2 Mar 2017 15:28:16 +0100
Michal Hocko <mhocko@kernel.org> wrote:

> On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
> [...]
> > When trying to support memory unplug on guest side in RHEL7,
> > experience shows otherwise. Simplistic udev rule which onlines
> > added block doesn't work in case one wants to online it as movable.
> > 
> > Hotplugged blocks in current kernel should be onlined in reverse
> > order to online blocks as movable depending on adjacent blocks zone.  
> 
> Could you be more specific please? Setting online_movable from the udev
> rule should just work regardless of the ordering or the state of other
> memblocks. If that doesn't work I would call it a bug.
It's rather an implementation constrain than a bug
for details and workaround patch see
 [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7
patch attached there is limited by another memory hotplug
issue, which is NORMAL/MOVABLE zone balance, if kernel runs
on configuration where the most of memory is hot-removable
kernel might experience lack of memory in zone NORMAL.

> 
> > Which means simple udev rule isn't usable since it gets event from
> > the first to the last hotplugged block order. So now we would have
> > to write a daemon that would
> >  - watch for all blocks in hotplugged memory appear (how would it know)
> >  - online them in right order (order might also be different depending
> >    on kernel version)
> >    -- it becomes even more complicated in NUMA case when there are
> >       multiple zones and kernel would have to provide user-space
> >       with information about zone maps
> > 
> > In short current experience shows that userspace approach
> >  - doesn't solve issues that Vitaly has been fixing (i.e. onlining
> >    fast and/or under memory pressure) when udev (or something else
> >    might be killed)  
> 
> yeah and that is why the patch does the onlining from the kernel.
onlining in this patch is limited to hyperv and patch breaks
auto-online on x86 kvm/vmware/baremetal as they reuse the same
hotplug path.

> > > Can you imagine any situation when somebody actually might want to have
> > > this knob enabled? From what I understand it doesn't seem to be the
> > > case.  
> > For x86:
> >  * this config option is enabled by default in recent Fedora,  
> 
> How do you want to support usecases which really want to online memory
> as movable? Do you expect those users to disable the option because
> unless I am missing something the in kernel auto onlining only supporst
> regular onlining.
current auto onlining config option does what it's been designed for,
i.e. it onlines hotplugged memory.
It's possible for non average Fedora user to override default
(commit 86dd995d6) if she/he needs non default behavior
(i.e. user knows how to online manually and/or can write
a daemon that would handle all of nuances of kernel in use).

For the rest when Fedora is used in cloud and user increases memory
via management interface of whatever cloud she/he uses, it just works.

So it's choice of distribution to pick its own default that makes
majority of user-base happy and this patch removes it without taking
that in consideration.

How to online memory is different issue not related to this patch,
current default onlining as ZONE_NORMAL works well for scaling
up VMs.

Memory unplug is rather new and it doesn't work reliably so far,
moving onlining to user-space won't really help. Further work
is need to be done so that it would work reliably.

Now about the question of onlining removable memory as movable,
x86 kernel is able to get info, if hotadded memory is removable,
from ACPI subsystem and online it as movable one without any
intervention from user-space where it's hard to do so,
as patch[1] shows.
Problem is still researched and when we figure out how to fix
hot-remove issues we might enable auto-onlining by default for x86.

Michal Hocko March 3, 2017, 8:27 a.m. UTC | #15

On Thu 02-03-17 18:03:15, Igor Mammedov wrote:
> On Thu, 2 Mar 2017 15:28:16 +0100
> Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
> > [...]
> > > When trying to support memory unplug on guest side in RHEL7,
> > > experience shows otherwise. Simplistic udev rule which onlines
> > > added block doesn't work in case one wants to online it as movable.
> > > 
> > > Hotplugged blocks in current kernel should be onlined in reverse
> > > order to online blocks as movable depending on adjacent blocks zone.  
> > 
> > Could you be more specific please? Setting online_movable from the udev
> > rule should just work regardless of the ordering or the state of other
> > memblocks. If that doesn't work I would call it a bug.
> It's rather an implementation constrain than a bug
> for details and workaround patch see
>  [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7

"You are not authorized to access bug #1314306"

could you paste the reasoning here please?

> patch attached there is limited by another memory hotplug
> issue, which is NORMAL/MOVABLE zone balance, if kernel runs
> on configuration where the most of memory is hot-removable
> kernel might experience lack of memory in zone NORMAL.

yes and that is an inherent problem of movable memory.

> > > Which means simple udev rule isn't usable since it gets event from
> > > the first to the last hotplugged block order. So now we would have
> > > to write a daemon that would
> > >  - watch for all blocks in hotplugged memory appear (how would it know)
> > >  - online them in right order (order might also be different depending
> > >    on kernel version)
> > >    -- it becomes even more complicated in NUMA case when there are
> > >       multiple zones and kernel would have to provide user-space
> > >       with information about zone maps
> > > 
> > > In short current experience shows that userspace approach
> > >  - doesn't solve issues that Vitaly has been fixing (i.e. onlining
> > >    fast and/or under memory pressure) when udev (or something else
> > >    might be killed)  
> > 
> > yeah and that is why the patch does the onlining from the kernel.
> onlining in this patch is limited to hyperv and patch breaks
> auto-online on x86 kvm/vmware/baremetal as they reuse the same
> hotplug path.

Those can use the udev or do you see any reason why they couldn't?

> > > > Can you imagine any situation when somebody actually might want to have
> > > > this knob enabled? From what I understand it doesn't seem to be the
> > > > case.  
> > > For x86:
> > >  * this config option is enabled by default in recent Fedora,  
> > 
> > How do you want to support usecases which really want to online memory
> > as movable? Do you expect those users to disable the option because
> > unless I am missing something the in kernel auto onlining only supporst
> > regular onlining.
>
> current auto onlining config option does what it's been designed for,
> i.e. it onlines hotplugged memory.
> It's possible for non average Fedora user to override default
> (commit 86dd995d6) if she/he needs non default behavior
> (i.e. user knows how to online manually and/or can write
> a daemon that would handle all of nuances of kernel in use).
> 
> For the rest when Fedora is used in cloud and user increases memory
> via management interface of whatever cloud she/he uses, it just works.
> 
> So it's choice of distribution to pick its own default that makes
> majority of user-base happy and this patch removes it without taking
> that in consideration.

You still can have a udev rule to achive the same thing for
non-ballooning based hotplug.

> How to online memory is different issue not related to this patch,
> current default onlining as ZONE_NORMAL works well for scaling
> up VMs.
> 
> Memory unplug is rather new and it doesn't work reliably so far,
> moving onlining to user-space won't really help. Further work
> is need to be done so that it would work reliably.

The main problem I have with this is that this is a limited usecase
driven configuration knob which doesn't work properly for other usecases
(namely movable online once your distribution choses to set the config
option to auto online). There is a userspace solution for this so this
shouldn't have been merged in the first place! It sneaked a proper review
process (linux-api wasn't CC to get a broader attenttion) which is
really sad.

So unless this causes a major regression which would be hard to fix I
will submit the patch for inclusion.

Igor Mammedov March 3, 2017, 5:34 p.m. UTC | #16

On Fri, 3 Mar 2017 09:27:23 +0100
Michal Hocko <mhocko@kernel.org> wrote:

> On Thu 02-03-17 18:03:15, Igor Mammedov wrote:
> > On Thu, 2 Mar 2017 15:28:16 +0100
> > Michal Hocko <mhocko@kernel.org> wrote:
> >   
> > > On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
> > > [...]  
> > > > When trying to support memory unplug on guest side in RHEL7,
> > > > experience shows otherwise. Simplistic udev rule which onlines
> > > > added block doesn't work in case one wants to online it as movable.
> > > > 
> > > > Hotplugged blocks in current kernel should be onlined in reverse
> > > > order to online blocks as movable depending on adjacent blocks zone.    
> > > 
> > > Could you be more specific please? Setting online_movable from the udev
> > > rule should just work regardless of the ordering or the state of other
> > > memblocks. If that doesn't work I would call it a bug.  
> > It's rather an implementation constrain than a bug
> > for details and workaround patch see
> >  [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7  
> 
> "You are not authorized to access bug #1314306"
Sorry,
I've made it public, related comments and patch should be accessible now
(code snippets in BZ are based on older kernel but logic is still the same upstream)
 
> could you paste the reasoning here please?
sure here is reproducer:
start VM with CLI:
  qemu-system-x86_64  -enable-kvm -m size=1G,slots=2,maxmem=4G -numa node \
  -object memory-backend-ram,id=m1,size=1G -device pc-dimm,node=0,memdev=m1 \
  /path/to/guest_image

then in guest dimm1 blocks are from 32-39

  echo online_movable > /sys/devices/system/memory/memory32/state
-bash: echo: write error: Invalid argument

in current mainline kernel it triggers following code path:

online_pages()
  ...
       if (online_type == MMOP_ONLINE_KERNEL) {                                 
                if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))    
                        return -EINVAL;

  zone_can_shift()
    ...
        if (idx < target) {                                                      
                /* pages must be at end of current zone */                       
                if (pfn + nr_pages != zone_end_pfn(zone))                        
                        return false;            

since we are trying to online as movable not the last section in
ZONE_NORMAL.

Here is what makes hotplugged memory end up in ZONE_NORMAL:
 acpi_memory_enable_device() -> add_memory -> add_memory_resource ->
   -> arch/x86/mm/init_64.c  

     /*
      * Memory is added always to NORMAL zone. This means you will never get
      * additional DMA/DMA32 memory.
      */
     int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
     {
        ...
        struct zone *zone = pgdat->node_zones +
                zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);

i.e. all hot-plugged memory modules always go to ZONE_NORMAL
and only the first/last block in zone is allowed to be moved
to another zone. Patch [1] tries to fix issue by assigning
removable memory resource to movable zone so hotplugged+removable
blocks look like:
  movable normal, movable, movable
instead of current:
  normal, normal, normal movable

but then with this fixed as suggested, auto online by default
should work just fine in kernel with normal and movable zones
without any need for user-space.

> > patch attached there is limited by another memory hotplug
> > issue, which is NORMAL/MOVABLE zone balance, if kernel runs
> > on configuration where the most of memory is hot-removable
> > kernel might experience lack of memory in zone NORMAL.  
> 
> yes and that is an inherent problem of movable memory.
> 
> > > > Which means simple udev rule isn't usable since it gets event from
> > > > the first to the last hotplugged block order. So now we would have
> > > > to write a daemon that would
> > > >  - watch for all blocks in hotplugged memory appear (how would it know)
> > > >  - online them in right order (order might also be different depending
> > > >    on kernel version)
> > > >    -- it becomes even more complicated in NUMA case when there are
> > > >       multiple zones and kernel would have to provide user-space
> > > >       with information about zone maps
> > > > 
> > > > In short current experience shows that userspace approach
> > > >  - doesn't solve issues that Vitaly has been fixing (i.e. onlining
> > > >    fast and/or under memory pressure) when udev (or something else
> > > >    might be killed)    
> > > 
> > > yeah and that is why the patch does the onlining from the kernel.  
> > onlining in this patch is limited to hyperv and patch breaks
> > auto-online on x86 kvm/vmware/baremetal as they reuse the same
> > hotplug path.  
> 
> Those can use the udev or do you see any reason why they couldn't?
Reasons are above, under >>>> and >> quotations, patch breaks
what Vitaly's fixed (including kvm/vmware usecases) i.e. udev/some
user-space process could be killed if hotplugged memory isn't onlined
fast enough leading to service termination and/or memory not
being onlined at all (if udev is killed)

Currently udev rule is not usable and one needs a daemon
which would correctly do onlining and keep zone balance
even for simple case usecase of 1 normal and 1 movable zone.
And it gets more complicated in case of multiple numa nodes
with multiple zones.

> > > > > Can you imagine any situation when somebody actually might want to have
> > > > > this knob enabled? From what I understand it doesn't seem to be the
> > > > > case.    
> > > > For x86:
> > > >  * this config option is enabled by default in recent Fedora,    
> > > 
> > > How do you want to support usecases which really want to online memory
> > > as movable? Do you expect those users to disable the option because
> > > unless I am missing something the in kernel auto onlining only supporst
> > > regular onlining.  
> >
> > current auto onlining config option does what it's been designed for,
> > i.e. it onlines hotplugged memory.
> > It's possible for non average Fedora user to override default
> > (commit 86dd995d6) if she/he needs non default behavior
> > (i.e. user knows how to online manually and/or can write
> > a daemon that would handle all of nuances of kernel in use).
> > 
> > For the rest when Fedora is used in cloud and user increases memory
> > via management interface of whatever cloud she/he uses, it just works.
> > 
> > So it's choice of distribution to pick its own default that makes
> > majority of user-base happy and this patch removes it without taking
> > that in consideration.  
> 
> You still can have a udev rule to achive the same thing for
> non-ballooning based hotplug.
not in case when system is under load, udev path might be slow
and udev might be killed by OOM leading to permanent disablement
of memory onlining.
 
> > How to online memory is different issue not related to this patch,
> > current default onlining as ZONE_NORMAL works well for scaling
> > up VMs.
> > 
> > Memory unplug is rather new and it doesn't work reliably so far,
> > moving onlining to user-space won't really help. Further work
> > is need to be done so that it would work reliably.  
> 
> The main problem I have with this is that this is a limited usecase
> driven configuration knob which doesn't work properly for other usecases
> (namely movable online once your distribution choses to set the config
> option to auto online).
it works for default usecase in Fedora and non-default
movable can be used with
 1) removable memory auto-online as movable in kernel, like
    patch [1] would make movable hotplugged memory
    (when I have time I'll try to work on it)
 2) (or in worst case due to lack of alternative) explicitly
    disabled auto-online on kernel CLI + onlining daemon 
    (since udev isn't working in current kernel due to ordering issue)

> There is a userspace solution for this so this
> shouldn't have been merged in the first place!
Sorry, currently user-space udev solution doesn't work nor
will it work reliably in extreme conditions.

> It sneaked a proper review
> process (linux-api wasn't CC to get a broader attenttion) which is
> really sad.
get_maintainer.pl doesn't lists linux-api for 31bc3858ea3e,
MAINTAINERS should be fixed if linux-api were to be CCed.

> So unless this causes a major regression which would be hard to fix I
> will submit the patch for inclusion.
it will be a major regression due to lack of daemon that
could online fast and can't be killed on OOM. So this
clean up patch does break used feature without providing
a viable alternative.

I wouldn't object to removing config option as in this patch
if memory were onlined for x86 by default but that's not the case yet.


[1] https://bugzilla.redhat.com/attachment.cgi?id=1146332

Michal Hocko March 6, 2017, 2:54 p.m. UTC | #17

On Fri 03-03-17 18:34:22, Igor Mammedov wrote:
> On Fri, 3 Mar 2017 09:27:23 +0100
> Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Thu 02-03-17 18:03:15, Igor Mammedov wrote:
> > > On Thu, 2 Mar 2017 15:28:16 +0100
> > > Michal Hocko <mhocko@kernel.org> wrote:
> > >   
> > > > On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
> > > > [...]  
> > > > > When trying to support memory unplug on guest side in RHEL7,
> > > > > experience shows otherwise. Simplistic udev rule which onlines
> > > > > added block doesn't work in case one wants to online it as movable.
> > > > > 
> > > > > Hotplugged blocks in current kernel should be onlined in reverse
> > > > > order to online blocks as movable depending on adjacent blocks zone.    
> > > > 
> > > > Could you be more specific please? Setting online_movable from the udev
> > > > rule should just work regardless of the ordering or the state of other
> > > > memblocks. If that doesn't work I would call it a bug.  
> > > It's rather an implementation constrain than a bug
> > > for details and workaround patch see
> > >  [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7  
> > 
> > "You are not authorized to access bug #1314306"
> Sorry,
> I've made it public, related comments and patch should be accessible now
> (code snippets in BZ are based on older kernel but logic is still the same upstream)
>  
> > could you paste the reasoning here please?
> sure here is reproducer:
> start VM with CLI:
>   qemu-system-x86_64  -enable-kvm -m size=1G,slots=2,maxmem=4G -numa node \
>   -object memory-backend-ram,id=m1,size=1G -device pc-dimm,node=0,memdev=m1 \
>   /path/to/guest_image
> 
> then in guest dimm1 blocks are from 32-39
> 
>   echo online_movable > /sys/devices/system/memory/memory32/state
> -bash: echo: write error: Invalid argument
> 
> in current mainline kernel it triggers following code path:
> 
> online_pages()
>   ...
>        if (online_type == MMOP_ONLINE_KERNEL) {                                 
>                 if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))    
>                         return -EINVAL;

Are you sure? I would expect MMOP_ONLINE_MOVABLE here

>   zone_can_shift()
>     ...
>         if (idx < target) {                                                      
>                 /* pages must be at end of current zone */                       
>                 if (pfn + nr_pages != zone_end_pfn(zone))                        
>                         return false;            
> 
> since we are trying to online as movable not the last section in
> ZONE_NORMAL.
> 
> Here is what makes hotplugged memory end up in ZONE_NORMAL:
>  acpi_memory_enable_device() -> add_memory -> add_memory_resource ->
>    -> arch/x86/mm/init_64.c  
> 
>      /*
>       * Memory is added always to NORMAL zone. This means you will never get
>       * additional DMA/DMA32 memory.
>       */
>      int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
>      {
>         ...
>         struct zone *zone = pgdat->node_zones +
>                 zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
> 
> i.e. all hot-plugged memory modules always go to ZONE_NORMAL
> and only the first/last block in zone is allowed to be moved
> to another zone. Patch [1] tries to fix issue by assigning
> removable memory resource to movable zone so hotplugged+removable
> blocks look like:
>   movable normal, movable, movable
> instead of current:
>   normal, normal, normal movable

Hmm, this code is confusing and clean as mud. I have to stare there some
more but AFAIK zones shouldn't have problems with holes so the only
thing we have to guarantee is that different zones do not overlap. So
this smells like a bug rather than the ineherent implementation
limitation.

[...]
> > > > > Which means simple udev rule isn't usable since it gets event from
> > > > > the first to the last hotplugged block order. So now we would have
> > > > > to write a daemon that would
> > > > >  - watch for all blocks in hotplugged memory appear (how would it know)
> > > > >  - online them in right order (order might also be different depending
> > > > >    on kernel version)
> > > > >    -- it becomes even more complicated in NUMA case when there are
> > > > >       multiple zones and kernel would have to provide user-space
> > > > >       with information about zone maps
> > > > > 
> > > > > In short current experience shows that userspace approach
> > > > >  - doesn't solve issues that Vitaly has been fixing (i.e. onlining
> > > > >    fast and/or under memory pressure) when udev (or something else
> > > > >    might be killed)    
> > > > 
> > > > yeah and that is why the patch does the onlining from the kernel.  
> > > onlining in this patch is limited to hyperv and patch breaks
> > > auto-online on x86 kvm/vmware/baremetal as they reuse the same
> > > hotplug path.  
> > 
> > Those can use the udev or do you see any reason why they couldn't?
>
> Reasons are above, under >>>> and >> quotations, patch breaks
> what Vitaly's fixed (including kvm/vmware usecases) i.e. udev/some
> user-space process could be killed if hotplugged memory isn't onlined
> fast enough leading to service termination and/or memory not
> being onlined at all (if udev is killed)

OK, so from the discussion so far I have learned that this would be
problem _only_ if we are trying to hotplug a _lot_ of memory at once
(~1.5% of the online memory is needed).  I am kind of skeptical this is
a reasonable usecase. Who is going to hotadd 8G to 256M machine (which
would eat half of the available memory which is still quite far from
OOM)? Even if the memory balloning uses hotplug then such a grow sounds
a bit excessive.

> Currently udev rule is not usable and one needs a daemon
> which would correctly do onlining and keep zone balance
> even for simple case usecase of 1 normal and 1 movable zone.
> And it gets more complicated in case of multiple numa nodes
> with multiple zones.

That sounds to be more related to the current implementation than
anything else and as such is not a reason to invent specific user
visible api. Btw. you are talking about movable zones byt the auto
onlining doesn't allow to auto online movable memory. So either I miss
your point or I am utterly confused.

[...]
> > > Memory unplug is rather new and it doesn't work reliably so far,
> > > moving onlining to user-space won't really help. Further work
> > > is need to be done so that it would work reliably.  
> > 
> > The main problem I have with this is that this is a limited usecase
> > driven configuration knob which doesn't work properly for other usecases
> > (namely movable online once your distribution choses to set the config
> > option to auto online).
>
> it works for default usecase in Fedora and non-default
> movable can be used with
>  1) removable memory auto-online as movable in kernel, like
>     patch [1] would make movable hotplugged memory
>     (when I have time I'll try to work on it)
>  2) (or in worst case due to lack of alternative) explicitly
>     disabled auto-online on kernel CLI + onlining daemon 
>     (since udev isn't working in current kernel due to ordering issue)

So I fail to see how this can work. Say the default will auto online
all the added memory. This will be all in Zone Normal because the auto
onlining doesn't do online_movable. Now I would like to online the
memory as movable but that might fail because some kernel objects might
be already sitting there so the offline would fail.

> > There is a userspace solution for this so this
> > shouldn't have been merged in the first place!
> Sorry, currently user-space udev solution doesn't work nor
> will it work reliably in extreme conditions.
> 
> > It sneaked a proper review
> > process (linux-api wasn't CC to get a broader attenttion) which is
> > really sad.
>
> get_maintainer.pl doesn't lists linux-api for 31bc3858ea3e,
> MAINTAINERS should be fixed if linux-api were to be CCed.

user visible APIs _should_ be discussed at this mailing list regardless
what get_maintainer.pl says. This is not about who is the maintainer but
about getting as wide audience for things that would have to be
maintained basically for ever.
 
> > So unless this causes a major regression which would be hard to fix I
> > will submit the patch for inclusion.
> it will be a major regression due to lack of daemon that
> could online fast and can't be killed on OOM. So this
> clean up patch does break used feature without providing
> a viable alternative.

So let's discuss the current memory hotplug shortcomings and get rid of
the crud which developed on top. I will start by splitting up the patch
into 3 parts. Do the auto online thing from the HyperV and xen balloning
drivers and dropping the config option and finally drop the sysfs knob.
The last patch might be NAKed and I can live with that as long as the
reasoning is proper and there is a general consensus on that.

Igor Mammedov March 7, 2017, 12:40 p.m. UTC | #18

On Mon, 6 Mar 2017 15:54:17 +0100
Michal Hocko <mhocko@kernel.org> wrote:

> On Fri 03-03-17 18:34:22, Igor Mammedov wrote:
> > On Fri, 3 Mar 2017 09:27:23 +0100
> > Michal Hocko <mhocko@kernel.org> wrote:
> >   
> > > On Thu 02-03-17 18:03:15, Igor Mammedov wrote:  
> > > > On Thu, 2 Mar 2017 15:28:16 +0100
> > > > Michal Hocko <mhocko@kernel.org> wrote:
> > > >     
> > > > > On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
[...]

> > > > > memblocks. If that doesn't work I would call it a bug.    
> > > > It's rather an implementation constrain than a bug
> > > > for details and workaround patch see
> > > >  [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7    
> > > 
> > > "You are not authorized to access bug #1314306"  
> > Sorry,
> > I've made it public, related comments and patch should be accessible now
> > (code snippets in BZ are based on older kernel but logic is still the same upstream)
> >    
> > > could you paste the reasoning here please?  
> > sure here is reproducer:
> > start VM with CLI:
> >   qemu-system-x86_64  -enable-kvm -m size=1G,slots=2,maxmem=4G -numa node \
> >   -object memory-backend-ram,id=m1,size=1G -device pc-dimm,node=0,memdev=m1 \
> >   /path/to/guest_image
> > 
> > then in guest dimm1 blocks are from 32-39
> > 
> >   echo online_movable > /sys/devices/system/memory/memory32/state
> > -bash: echo: write error: Invalid argument
> > 
> > in current mainline kernel it triggers following code path:
> > 
> > online_pages()
> >   ...
> >        if (online_type == MMOP_ONLINE_KERNEL) {                                 
> >                 if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))    
> >                         return -EINVAL;  
> 
> Are you sure? I would expect MMOP_ONLINE_MOVABLE here
pretty much, reproducer is above so try and see for yourself

[...]
> [...]
> > > > > > Which means simple udev rule isn't usable since it gets event from
> > > > > > the first to the last hotplugged block order. So now we would have
> > > > > > to write a daemon that would
> > > > > >  - watch for all blocks in hotplugged memory appear (how would it know)
> > > > > >  - online them in right order (order might also be different depending
> > > > > >    on kernel version)
> > > > > >    -- it becomes even more complicated in NUMA case when there are
> > > > > >       multiple zones and kernel would have to provide user-space
> > > > > >       with information about zone maps
> > > > > > 
> > > > > > In short current experience shows that userspace approach
> > > > > >  - doesn't solve issues that Vitaly has been fixing (i.e. onlining
> > > > > >    fast and/or under memory pressure) when udev (or something else
> > > > > >    might be killed)      
> > > > > 
> > > > > yeah and that is why the patch does the onlining from the kernel.    
> > > > onlining in this patch is limited to hyperv and patch breaks
> > > > auto-online on x86 kvm/vmware/baremetal as they reuse the same
> > > > hotplug path.    
> > > 
> > > Those can use the udev or do you see any reason why they couldn't?  
> >
> > Reasons are above, under >>>> and >> quotations, patch breaks
> > what Vitaly's fixed (including kvm/vmware usecases) i.e. udev/some
> > user-space process could be killed if hotplugged memory isn't onlined
> > fast enough leading to service termination and/or memory not
> > being onlined at all (if udev is killed)  
> 
> OK, so from the discussion so far I have learned that this would be
> problem _only_ if we are trying to hotplug a _lot_ of memory at once
> (~1.5% of the online memory is needed).  I am kind of skeptical this is
> a reasonable usecase. Who is going to hotadd 8G to 256M machine (which
> would eat half of the available memory which is still quite far from
> OOM)? Even if the memory balloning uses hotplug then such a grow sounds
> a bit excessive.
Slow and killable udev issue doesn't really depends on
amount of hotplugged memory since it's onlined in blocks
(128M for x64). Considering that it's currently onlined
as zone normal, kernel doesn't have any issues adding more
follow up blocks of memory.

> > Currently udev rule is not usable and one needs a daemon
> > which would correctly do onlining and keep zone balance
> > even for simple case usecase of 1 normal and 1 movable zone.
> > And it gets more complicated in case of multiple numa nodes
> > with multiple zones.  
> 
> That sounds to be more related to the current implementation than
> anything else and as such is not a reason to invent specific user
> visible api. Btw. you are talking about movable zones byt the auto
> onlining doesn't allow to auto online movable memory. So either I miss
> your point or I am utterly confused.
in current state neither does udev rule as memory is onlined
as NORMAL (x64 variant at least), which is the same as auto online
does now.

We are discussing 2 different issues here and thread got pretty
hard to follow. I'll try to sum up at results the end.

> [...]
> > > > Memory unplug is rather new and it doesn't work reliably so far,
> > > > moving onlining to user-space won't really help. Further work
> > > > is need to be done so that it would work reliably.    
> > > 
> > > The main problem I have with this is that this is a limited usecase
> > > driven configuration knob which doesn't work properly for other usecases
> > > (namely movable online once your distribution choses to set the config
> > > option to auto online).  
> >
> > it works for default usecase in Fedora and non-default
> > movable can be used with
> >  1) removable memory auto-online as movable in kernel, like
> >     patch [1] would make movable hotplugged memory
> >     (when I have time I'll try to work on it)
> >  2) (or in worst case due to lack of alternative) explicitly
> >     disabled auto-online on kernel CLI + onlining daemon 
> >     (since udev isn't working in current kernel due to ordering issue)  
> 
> So I fail to see how this can work. Say the default will auto online
> all the added memory. This will be all in Zone Normal because the auto
> onlining doesn't do online_movable. Now I would like to online the
> memory as movable but that might fail because some kernel objects might
> be already sitting there so the offline would fail.
same happens with udev rule as currently it can online from
the first to last hotplugged block only as Zone Normal.
see
  arch/x86/mm/init_64.c:
      acpi_memory_enable_device() -> add_memory -> add_memory_resource

both auto online and udev could be fixed by patch
https://bugzilla.redhat.com/attachment.cgi?id=1146332
to online reomovable memory as movable.

> > > There is a userspace solution for this so this
> > > shouldn't have been merged in the first place!  
> > Sorry, currently user-space udev solution doesn't work nor
> > will it work reliably in extreme conditions.
> >   
> > > It sneaked a proper review
> > > process (linux-api wasn't CC to get a broader attenttion) which is
> > > really sad.  
> >
> > get_maintainer.pl doesn't lists linux-api for 31bc3858ea3e,
> > MAINTAINERS should be fixed if linux-api were to be CCed.  
> 
> user visible APIs _should_ be discussed at this mailing list regardless
> what get_maintainer.pl says. This is not about who is the maintainer but
> about getting as wide audience for things that would have to be
> maintained basically for ever.
How would random contributor know which list to CC?

> > > So unless this causes a major regression which would be hard to fix I
> > > will submit the patch for inclusion.  
> > it will be a major regression due to lack of daemon that
> > could online fast and can't be killed on OOM. So this
> > clean up patch does break used feature without providing
> > a viable alternative.  
> 
> So let's discuss the current memory hotplug shortcomings and get rid of
> the crud which developed on top. I will start by splitting up the patch
> into 3 parts. Do the auto online thing from the HyperV and xen balloning
> drivers and dropping the config option and finally drop the sysfs knob.
> The last patch might be NAKed and I can live with that as long as the
> reasoning is proper and there is a general consensus on that.
PS: CC me on that patches too

It's major regression if you remove auto online in kernels that
run on top of x86 kvm/vmware hypervisors, making API cleanups
while breaking useful functionality doesn't make sense.

I would ACK config option removal if auto online keeps working
for all x86 hypervisors (hyperv/xen isn't the only who needs it)
and keep kernel CLI option to override default.

That doesn't mean that others will agree with flipping default,
that's why config option has been added.

Now to sum up what's been discussed on this thread, there were 2
different issues discussed:
  1) memory hotplug: remove in kernel auto online for all
                     except of hyperv/xen

       - suggested RFC is not acceptable from virt point of view
         as it regresses guests on top of x86 kvm/vmware which
         both use ACPI based memory hotplug.

       - udev/userspace solution doesn't work in practice as it's
         too slow and unreliable when system is under load which
         is quite common in virt usecase. That's why auto online
         has been introduced in the first place.

  2) memory unplug: online memory as movable

       - doesn't work currently with udev rule due to kernel
         issues https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7

       - could be fixed both for in kernel auto online and udev
         with following patch:
         https://bugzilla.redhat.com/attachment.cgi?id=1146332
         but fixing it this way exposes zone disbalance issues,
         which are not present in current kernel as blocks are
         onlined in Zone Normal. So this is area to work and
         improve on.

       - currently if one wants to use online_movable,
         one has to either
           * disable auto online in kernel OR
           * remove udev rule that distro ships
         AND write custom daemon that will be able to online
         block in right zone/order. So currently whole
         online_movable thing isn't working by default
         regardless of who onlines memory.

         I'm in favor of implementing that in kernel as it keeps
         kernel internals inside kernel and doesn't need
         kernel API to be involved (memory blocks in sysfs,
         online_kernel, online_movable)
         There would be no need in userspace which would have to
         deal with kernel zoo and maintain that as well.

Michal Hocko March 9, 2017, 12:54 p.m. UTC | #19

On Tue 07-03-17 13:40:04, Igor Mammedov wrote:
> On Mon, 6 Mar 2017 15:54:17 +0100
> Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Fri 03-03-17 18:34:22, Igor Mammedov wrote:
[...]
> > > in current mainline kernel it triggers following code path:
> > > 
> > > online_pages()
> > >   ...
> > >        if (online_type == MMOP_ONLINE_KERNEL) {                                 
> > >                 if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))    
> > >                         return -EINVAL;  
> > 
> > Are you sure? I would expect MMOP_ONLINE_MOVABLE here
> pretty much, reproducer is above so try and see for yourself

I will play with this...
 
[...]
> > > get_maintainer.pl doesn't lists linux-api for 31bc3858ea3e,
> > > MAINTAINERS should be fixed if linux-api were to be CCed.  
> > 
> > user visible APIs _should_ be discussed at this mailing list regardless
> > what get_maintainer.pl says. This is not about who is the maintainer but
> > about getting as wide audience for things that would have to be
> > maintained basically for ever.
>
> How would random contributor know which list to CC?

This should have been brought up during the review process which was
less than sufficient in this case.

> > > > So unless this causes a major regression which would be hard to fix I
> > > > will submit the patch for inclusion.  
> > > it will be a major regression due to lack of daemon that
> > > could online fast and can't be killed on OOM. So this
> > > clean up patch does break used feature without providing
> > > a viable alternative.  
> > 
> > So let's discuss the current memory hotplug shortcomings and get rid of
> > the crud which developed on top. I will start by splitting up the patch
> > into 3 parts. Do the auto online thing from the HyperV and xen balloning
> > drivers and dropping the config option and finally drop the sysfs knob.
> > The last patch might be NAKed and I can live with that as long as the
> > reasoning is proper and there is a general consensus on that.
> PS: CC me on that patches too
> 
> It's major regression if you remove auto online in kernels that
> run on top of x86 kvm/vmware hypervisors, making API cleanups
> while breaking useful functionality doesn't make sense.
> 
> I would ACK config option removal if auto online keeps working
> for all x86 hypervisors (hyperv/xen isn't the only who needs it)
> and keep kernel CLI option to override default.
> 
> That doesn't mean that others will agree with flipping default,
> that's why config option has been added.
> 
> Now to sum up what's been discussed on this thread, there were 2
> different issues discussed:
>   1) memory hotplug: remove in kernel auto online for all
>                      except of hyperv/xen
> 
>        - suggested RFC is not acceptable from virt point of view
>          as it regresses guests on top of x86 kvm/vmware which
>          both use ACPI based memory hotplug.
> 
>        - udev/userspace solution doesn't work in practice as it's
>          too slow and unreliable when system is under load which
>          is quite common in virt usecase. That's why auto online
>          has been introduced in the first place.

Please try to be more specific why "too slow" is a problem. Also how
much slower are we talking about?
 
>   2) memory unplug: online memory as movable
> 
>        - doesn't work currently with udev rule due to kernel
>          issues https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7

These should be fixed
 
>        - could be fixed both for in kernel auto online and udev
>          with following patch:
>          https://bugzilla.redhat.com/attachment.cgi?id=1146332
>          but fixing it this way exposes zone disbalance issues,
>          which are not present in current kernel as blocks are
>          onlined in Zone Normal. So this is area to work and
>          improve on.
> 
>        - currently if one wants to use online_movable,
>          one has to either
>            * disable auto online in kernel OR

which might not just work because an unmovable allocation could have
made the memblock pinned.

>            * remove udev rule that distro ships
>          AND write custom daemon that will be able to online
>          block in right zone/order. So currently whole
>          online_movable thing isn't working by default
>          regardless of who onlines memory.

my epxperience with onlining full nodes as movable shows this works just
fine (with all the limitations of the movable zones but that is a
separate thing). I haven't played with configurations where movable
zones are sharing the node with other zones.

>          I'm in favor of implementing that in kernel as it keeps
>          kernel internals inside kernel and doesn't need
>          kernel API to be involved (memory blocks in sysfs,
>          online_kernel, online_movable)
>          There would be no need in userspace which would have to
>          deal with kernel zoo and maintain that as well.

The kernel is supposed to provide a proper API and that is sysfs
currently. I am not entirely happy about it either but pulling a lot of
code into the kernel is not the rigth thing to do. Especially when
different usecases require different treatment.

Michal Hocko March 10, 2017, 1:58 p.m. UTC | #20

Let's CC people touching this logic. A short summary is that onlining
memory via udev is currently unusable for online_movable because blocks
are added from lower addresses while movable blocks are allowed from
last blocks. More below.

On Thu 09-03-17 13:54:00, Michal Hocko wrote:
> On Tue 07-03-17 13:40:04, Igor Mammedov wrote:
> > On Mon, 6 Mar 2017 15:54:17 +0100
> > Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > > On Fri 03-03-17 18:34:22, Igor Mammedov wrote:
> [...]
> > > > in current mainline kernel it triggers following code path:
> > > > 
> > > > online_pages()
> > > >   ...
> > > >        if (online_type == MMOP_ONLINE_KERNEL) {                                 
> > > >                 if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))    
> > > >                         return -EINVAL;  
> > > 
> > > Are you sure? I would expect MMOP_ONLINE_MOVABLE here
> > pretty much, reproducer is above so try and see for yourself
> 
> I will play with this...

OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G which generated
[...]
[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
[    0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug
[    0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff]
[    0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff]
[    0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff]
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.000000]   DMA32    [mem 0x0000000001000000-0x000000007ffdffff]
[    0.000000]   Normal   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
[    0.000000]   node   0: [mem 0x0000000000100000-0x000000003fffffff]
[    0.000000]   node   1: [mem 0x0000000040000000-0x000000007ffdffff]

so there is neither any normal zone nor movable one at the boot time.
Then I hotplugged 1G slot
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1

unfortunatelly the memory didn't show up automatically and I got
[  116.375781] acpi PNP0C80:00: Enumeration failure

so I had to probe it manually (prbably the BIOS my qemu uses doesn't
support auto probing - I haven't really dug further). Anyway the SRAT
table printed during the boot told that we should start at 0x100000000

# echo 0x100000000 > /sys/devices/system/memory/probe
# grep . /sys/devices/system/memory/memory32/valid_zones
Normal Movable

which looks reasonably right? Both Normal and Movable zones are allowed

# echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
# grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable

Huh, so our valid_zones have changed under our feet...

# echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
# grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal
/sys/devices/system/memory/memory34/valid_zones:Normal Movable

and again. So only the last memblock is considered movable. Let's try to
online them now.

# echo online_movable > /sys/devices/system/memory/memory34/state
# grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable Normal

This would explain why onlining from the last block actually works but
to me this sounds like a completely crappy behavior. All we need to
guarantee AFAICS is that Normal and Movable zones do not overlap. I
believe there is even no real requirement about ordering of the physical
memory in Normal vs. Movable zones as long as they do not overlap. But
let's keep it simple for the start and always enforce the current status
quo that Normal zone is physically preceeding Movable zone.
Can somebody explain why we cannot have a simple rule for Normal vs.
Movable which would be:
	- block [pfn, pfn+block_size] can be Normal if
	  !zone_populated(MOVABLE) || pfn+block_size < ZONE_MOVABLE->zone_start_pfn
	- block [pfn, pfn+block_size] can be Movable if
	  !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn

I haven't fully grokked all the restrictions on the movable zone size
based on the kernel parameters (find_zone_movable_pfns_for_nodes) but
this shouldn't really make the situation really much more complicated I
believe because those parameters should be mostly about static
initialization rather than hotplug but I might be easily missing
something.

Michal Hocko March 10, 2017, 3:53 p.m. UTC | #21

On Fri 10-03-17 14:58:07, Michal Hocko wrote:
[...]
> This would explain why onlining from the last block actually works but
> to me this sounds like a completely crappy behavior. All we need to
> guarantee AFAICS is that Normal and Movable zones do not overlap. I
> believe there is even no real requirement about ordering of the physical
> memory in Normal vs. Movable zones as long as they do not overlap. But
> let's keep it simple for the start and always enforce the current status
> quo that Normal zone is physically preceeding Movable zone.
> Can somebody explain why we cannot have a simple rule for Normal vs.
> Movable which would be:
> 	- block [pfn, pfn+block_size] can be Normal if
> 	  !zone_populated(MOVABLE) || pfn+block_size < ZONE_MOVABLE->zone_start_pfn
> 	- block [pfn, pfn+block_size] can be Movable if
> 	  !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn

OK, so while I was playing with this setup some more I probably got why
this is done this way. All new memblocks are added to the zone Normal
where they are accounted as spanned but not present. When we do
online_movable we just cut from the end of the Normal zone and move it
to Movable zone. This sounds really awkward. What was the reason to go
this way? Why cannot we simply add those pages to the zone at the online
time?

Yasuaki Ishimatsu March 10, 2017, 5:39 p.m. UTC | #22

On 03/10/2017 08:58 AM, Michal Hocko wrote:
> Let's CC people touching this logic. A short summary is that onlining
> memory via udev is currently unusable for online_movable because blocks
> are added from lower addresses while movable blocks are allowed from
> last blocks. More below.
>
> On Thu 09-03-17 13:54:00, Michal Hocko wrote:
>> On Tue 07-03-17 13:40:04, Igor Mammedov wrote:
>>> On Mon, 6 Mar 2017 15:54:17 +0100
>>> Michal Hocko <mhocko@kernel.org> wrote:
>>>
>>>> On Fri 03-03-17 18:34:22, Igor Mammedov wrote:
>> [...]
>>>>> in current mainline kernel it triggers following code path:
>>>>>
>>>>> online_pages()
>>>>>   ...
>>>>>        if (online_type == MMOP_ONLINE_KERNEL) {
>>>>>                 if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))
>>>>>                         return -EINVAL;
>>>>
>>>> Are you sure? I would expect MMOP_ONLINE_MOVABLE here
>>> pretty much, reproducer is above so try and see for yourself
>>
>> I will play with this...
>
> OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G which generated
> [...]
> [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
> [    0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
> [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug
> [    0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff]
> [    0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff]
> [    0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff]
> [    0.000000] Zone ranges:
> [    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> [    0.000000]   DMA32    [mem 0x0000000001000000-0x000000007ffdffff]
> [    0.000000]   Normal   empty
> [    0.000000] Movable zone start for each node
> [    0.000000] Early memory node ranges
> [    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
> [    0.000000]   node   0: [mem 0x0000000000100000-0x000000003fffffff]
> [    0.000000]   node   1: [mem 0x0000000040000000-0x000000007ffdffff]
>
> so there is neither any normal zone nor movable one at the boot time.
> Then I hotplugged 1G slot
> (qemu) object_add memory-backend-ram,id=mem1,size=1G
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
>
> unfortunatelly the memory didn't show up automatically and I got
> [  116.375781] acpi PNP0C80:00: Enumeration failure
>
> so I had to probe it manually (prbably the BIOS my qemu uses doesn't
> support auto probing - I haven't really dug further). Anyway the SRAT
> table printed during the boot told that we should start at 0x100000000
>
> # echo 0x100000000 > /sys/devices/system/memory/probe
> # grep . /sys/devices/system/memory/memory32/valid_zones
> Normal Movable
>
> which looks reasonably right? Both Normal and Movable zones are allowed
>
> # echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
> # grep . /sys/devices/system/memory/memory3?/valid_zones
> /sys/devices/system/memory/memory32/valid_zones:Normal
> /sys/devices/system/memory/memory33/valid_zones:Normal Movable
>
> Huh, so our valid_zones have changed under our feet...
>
> # echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
> # grep . /sys/devices/system/memory/memory3?/valid_zones
> /sys/devices/system/memory/memory32/valid_zones:Normal
> /sys/devices/system/memory/memory33/valid_zones:Normal
> /sys/devices/system/memory/memory34/valid_zones:Normal Movable
>
> and again. So only the last memblock is considered movable. Let's try to
> online them now.
>
> # echo online_movable > /sys/devices/system/memory/memory34/state
> # grep . /sys/devices/system/memory/memory3?/valid_zones
> /sys/devices/system/memory/memory32/valid_zones:Normal
> /sys/devices/system/memory/memory33/valid_zones:Normal Movable
> /sys/devices/system/memory/memory34/valid_zones:Movable Normal
>

I think there is no strong reason which kernel has the restriction.
By setting the restrictions, it seems to have made management of
these zone structs simple.

Thanks,
Yasuaki Ishimatsu

> This would explain why onlining from the last block actually works but
> to me this sounds like a completely crappy behavior. All we need to
> guarantee AFAICS is that Normal and Movable zones do not overlap. I
> believe there is even no real requirement about ordering of the physical
> memory in Normal vs. Movable zones as long as they do not overlap. But
> let's keep it simple for the start and always enforce the current status
> quo that Normal zone is physically preceeding Movable zone.
> Can somebody explain why we cannot have a simple rule for Normal vs.
> Movable which would be:
> 	- block [pfn, pfn+block_size] can be Normal if
> 	  !zone_populated(MOVABLE) || pfn+block_size < ZONE_MOVABLE->zone_start_pfn
> 	- block [pfn, pfn+block_size] can be Movable if
> 	  !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn
>
> I haven't fully grokked all the restrictions on the movable zone size
> based on the kernel parameters (find_zone_movable_pfns_for_nodes) but
> this shouldn't really make the situation really much more complicated I
> believe because those parameters should be mostly about static
> initialization rather than hotplug but I might be easily missing
> something.
>

Reza Arbab March 10, 2017, 7 p.m. UTC | #23

On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote:
>OK, so while I was playing with this setup some more I probably got why
>this is done this way. All new memblocks are added to the zone Normal
>where they are accounted as spanned but not present.

It's not always zone Normal. See zone_for_memory(). This leads to a 
workaround for having to do online_movable in descending block order. 
Instead of this:

1. probe block 34, probe block 33, probe block 32, ...
2. online_movable 34, online_movable 33, online_movable 32, ...

you can online_movable the first block before adding the rest:

1. probe block 32, online_movable 32
2. probe block 33, probe block 34, ...
	- zone_for_memory() will cause these to start Movable
3. online 33, online 34, ...
	- they're already in Movable, so online_movable is equivalentr

I agree with your general sentiment that this stuff is very 
nonintuitive.

Daniel Kiper March 10, 2017, 10 p.m. UTC | #24

Hey,

On Mon, Mar 06, 2017 at 03:54:17PM +0100, Michal Hocko wrote:

[...]

> So let's discuss the current memory hotplug shortcomings and get rid of
> the crud which developed on top. I will start by splitting up the patch
> into 3 parts. Do the auto online thing from the HyperV and xen balloning
> drivers and dropping the config option and finally drop the sysfs knob.
> The last patch might be NAKed and I can live with that as long as the
> reasoning is proper and there is a general consensus on that.

If it is not too late please CC me on the patches and relevant threads.

Daniel

Michal Hocko March 13, 2017, 9:19 a.m. UTC | #25

On Fri 10-03-17 12:39:27, Yasuaki Ishimatsu wrote:
> On 03/10/2017 08:58 AM, Michal Hocko wrote:
[...]
> >OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G which generated
> >[...]
> >[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> >[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
> >[    0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
> >[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug
> >[    0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff]
> >[    0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff]
> >[    0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff]
> >[    0.000000] Zone ranges:
> >[    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> >[    0.000000]   DMA32    [mem 0x0000000001000000-0x000000007ffdffff]
> >[    0.000000]   Normal   empty
> >[    0.000000] Movable zone start for each node
> >[    0.000000] Early memory node ranges
> >[    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
> >[    0.000000]   node   0: [mem 0x0000000000100000-0x000000003fffffff]
> >[    0.000000]   node   1: [mem 0x0000000040000000-0x000000007ffdffff]
> >
> >so there is neither any normal zone nor movable one at the boot time.
> >Then I hotplugged 1G slot
> >(qemu) object_add memory-backend-ram,id=mem1,size=1G
> >(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
> >
> >unfortunatelly the memory didn't show up automatically and I got
> >[  116.375781] acpi PNP0C80:00: Enumeration failure
> >
> >so I had to probe it manually (prbably the BIOS my qemu uses doesn't
> >support auto probing - I haven't really dug further). Anyway the SRAT
> >table printed during the boot told that we should start at 0x100000000
> >
> ># echo 0x100000000 > /sys/devices/system/memory/probe
> ># grep . /sys/devices/system/memory/memory32/valid_zones
> >Normal Movable
> >
> >which looks reasonably right? Both Normal and Movable zones are allowed
> >
> ># echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
> ># grep . /sys/devices/system/memory/memory3?/valid_zones
> >/sys/devices/system/memory/memory32/valid_zones:Normal
> >/sys/devices/system/memory/memory33/valid_zones:Normal Movable
> >
> >Huh, so our valid_zones have changed under our feet...
> >
> ># echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
> ># grep . /sys/devices/system/memory/memory3?/valid_zones
> >/sys/devices/system/memory/memory32/valid_zones:Normal
> >/sys/devices/system/memory/memory33/valid_zones:Normal
> >/sys/devices/system/memory/memory34/valid_zones:Normal Movable
> >
> >and again. So only the last memblock is considered movable. Let's try to
> >online them now.
> >
> ># echo online_movable > /sys/devices/system/memory/memory34/state
> ># grep . /sys/devices/system/memory/memory3?/valid_zones
> >/sys/devices/system/memory/memory32/valid_zones:Normal
> >/sys/devices/system/memory/memory33/valid_zones:Normal Movable
> >/sys/devices/system/memory/memory34/valid_zones:Movable Normal
> >
> 
> I think there is no strong reason which kernel has the restriction.
> By setting the restrictions, it seems to have made management of
> these zone structs simple.

Could you be more specific please? How could this make management any
easier when udev is basically racing with the physical hotplug and the
result is basically undefined?

Michal Hocko March 13, 2017, 9:21 a.m. UTC | #26

On Fri 10-03-17 13:00:37, Reza Arbab wrote:
> On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote:
> >OK, so while I was playing with this setup some more I probably got why
> >this is done this way. All new memblocks are added to the zone Normal
> >where they are accounted as spanned but not present.
> 
> It's not always zone Normal. See zone_for_memory(). This leads to a
> workaround for having to do online_movable in descending block order.
> Instead of this:
> 
> 1. probe block 34, probe block 33, probe block 32, ...
> 2. online_movable 34, online_movable 33, online_movable 32, ...
> 
> you can online_movable the first block before adding the rest:

I do I enforce that behavior when the probe happens automagically?

> 1. probe block 32, online_movable 32
> 2. probe block 33, probe block 34, ...
> 	- zone_for_memory() will cause these to start Movable
> 3. online 33, online 34, ...
> 	- they're already in Movable, so online_movable is equivalentr
> 
> I agree with your general sentiment that this stuff is very nonintuitive.

My criterion for nonintuitive is probably different because I would call
this _completely_unusable_. Sorry for being so loud about this but the
more I look into this area the more WTF code I see. This has seen close
to zero review and seems to be building up more single usecase code on
top of previous. We need to change this, seriously!

Igor Mammedov March 13, 2017, 10:31 a.m. UTC | #27

On Fri, 10 Mar 2017 14:58:07 +0100
Michal Hocko <mhocko@kernel.org> wrote:

> Let's CC people touching this logic. A short summary is that onlining
> memory via udev is currently unusable for online_movable because blocks
> are added from lower addresses while movable blocks are allowed from
> last blocks. More below.
> 
> On Thu 09-03-17 13:54:00, Michal Hocko wrote:
> > On Tue 07-03-17 13:40:04, Igor Mammedov wrote:  
> > > On Mon, 6 Mar 2017 15:54:17 +0100
> > > Michal Hocko <mhocko@kernel.org> wrote:
> > >   
> > > > On Fri 03-03-17 18:34:22, Igor Mammedov wrote:  
> > [...]  
> > > > > in current mainline kernel it triggers following code path:
> > > > > 
> > > > > online_pages()
> > > > >   ...
> > > > >        if (online_type == MMOP_ONLINE_KERNEL) {                                 
> > > > >                 if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))    
> > > > >                         return -EINVAL;    
> > > > 
> > > > Are you sure? I would expect MMOP_ONLINE_MOVABLE here  
> > > pretty much, reproducer is above so try and see for yourself  
> > 
> > I will play with this...  
> 
> OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G which generated
'mem' here distributes boot memory specified by "-m 2G" and does not
include memory specified by -device pc-dimm.

> [...]
> [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
> [    0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
> [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug
> [    0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff]
> [    0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff]
> [    0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff]
> [    0.000000] Zone ranges:
> [    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> [    0.000000]   DMA32    [mem 0x0000000001000000-0x000000007ffdffff]
> [    0.000000]   Normal   empty
> [    0.000000] Movable zone start for each node
> [    0.000000] Early memory node ranges
> [    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
> [    0.000000]   node   0: [mem 0x0000000000100000-0x000000003fffffff]
> [    0.000000]   node   1: [mem 0x0000000040000000-0x000000007ffdffff]
> 
> so there is neither any normal zone nor movable one at the boot time.
it could be if hotpluggable memory were present at boot time in E802 table
(if I remember right when running on hyperv there is movable zone at boot time),

but in qemu hotpluggable memory isn't put into E820,
so zone is allocated later when memory is enumerated
by ACPI subsystem and onlined.
It causes less issues wrt movable zone and works for
different versions of linux/windows as well.

That's where in kernel auto-onlining could be also useful,
since user would be able to start-up with with small
non removable memory plus several removable DIMMs
and have all the memory onlined/available by the time
initrd is loaded. (missing piece here is onling
removable memory as movable by default).


> Then I hotplugged 1G slot
> (qemu) object_add memory-backend-ram,id=mem1,size=1G
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
You can also specify node a pc-dimm goes to with 'node' property
if it should go to other then node 0.

device_add pc-dimm,id=dimm1,memdev=mem1,node=1


> unfortunatelly the memory didn't show up automatically and I got
> [  116.375781] acpi PNP0C80:00: Enumeration failure
it should work,
do you have CONFIG_ACPI_HOTPLUG_MEMORY enabled?
 
> so I had to probe it manually (prbably the BIOS my qemu uses doesn't
> support auto probing - I haven't really dug further). Anyway the SRAT
> table printed during the boot told that we should start at 0x100000000

Michal Hocko March 13, 2017, 10:43 a.m. UTC | #28

On Mon 13-03-17 11:31:10, Igor Mammedov wrote:
> On Fri, 10 Mar 2017 14:58:07 +0100
[...]
> > [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> > [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
> > [    0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
> > [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug
> > [    0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff]
> > [    0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff]
> > [    0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff]
> > [    0.000000] Zone ranges:
> > [    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> > [    0.000000]   DMA32    [mem 0x0000000001000000-0x000000007ffdffff]
> > [    0.000000]   Normal   empty
> > [    0.000000] Movable zone start for each node
> > [    0.000000] Early memory node ranges
> > [    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
> > [    0.000000]   node   0: [mem 0x0000000000100000-0x000000003fffffff]
> > [    0.000000]   node   1: [mem 0x0000000040000000-0x000000007ffdffff]
> > 
> > so there is neither any normal zone nor movable one at the boot time.
> it could be if hotpluggable memory were present at boot time in E802 table
> (if I remember right when running on hyperv there is movable zone at boot time),
> 
> but in qemu hotpluggable memory isn't put into E820,
> so zone is allocated later when memory is enumerated
> by ACPI subsystem and onlined.
> It causes less issues wrt movable zone and works for
> different versions of linux/windows as well.
> 
> That's where in kernel auto-onlining could be also useful,
> since user would be able to start-up with with small
> non removable memory plus several removable DIMMs
> and have all the memory onlined/available by the time
> initrd is loaded. (missing piece here is onling
> removable memory as movable by default).

Why we should even care to online that memory that early rather than
making it available via e820?
 
> > Then I hotplugged 1G slot
> > (qemu) object_add memory-backend-ram,id=mem1,size=1G
> > (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
> You can also specify node a pc-dimm goes to with 'node' property
> if it should go to other then node 0.
> 
> device_add pc-dimm,id=dimm1,memdev=mem1,node=1

thanks for the tip

> > unfortunatelly the memory didn't show up automatically and I got
> > [  116.375781] acpi PNP0C80:00: Enumeration failure
> it should work,
> do you have CONFIG_ACPI_HOTPLUG_MEMORY enabled?

No I didn't. Thanks, good to know!

Igor Mammedov March 13, 2017, 10:55 a.m. UTC | #29

On Thu, 9 Mar 2017 13:54:00 +0100
Michal Hocko <mhocko@kernel.org> wrote:

[...]
> > It's major regression if you remove auto online in kernels that
> > run on top of x86 kvm/vmware hypervisors, making API cleanups
> > while breaking useful functionality doesn't make sense.
> > 
> > I would ACK config option removal if auto online keeps working
> > for all x86 hypervisors (hyperv/xen isn't the only who needs it)
> > and keep kernel CLI option to override default.
> > 
> > That doesn't mean that others will agree with flipping default,
> > that's why config option has been added.
> > 
> > Now to sum up what's been discussed on this thread, there were 2
> > different issues discussed:
> >   1) memory hotplug: remove in kernel auto online for all
> >                      except of hyperv/xen
> > 
> >        - suggested RFC is not acceptable from virt point of view
> >          as it regresses guests on top of x86 kvm/vmware which
> >          both use ACPI based memory hotplug.
> > 
> >        - udev/userspace solution doesn't work in practice as it's
> >          too slow and unreliable when system is under load which
> >          is quite common in virt usecase. That's why auto online
> >          has been introduced in the first place.  
> 
> Please try to be more specific why "too slow" is a problem. Also how
> much slower are we talking about?
In virt case on host with lots VMs, userspace handler
processing could be scheduled late enough to trigger a race
between (guest memory going away/OOM handler) and memory
coming online.

>  
> >   2) memory unplug: online memory as movable
> > 
> >        - doesn't work currently with udev rule due to kernel
> >          issues https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7  
> 
> These should be fixed
>  
> >        - could be fixed both for in kernel auto online and udev
> >          with following patch:
> >          https://bugzilla.redhat.com/attachment.cgi?id=1146332
> >          but fixing it this way exposes zone disbalance issues,
> >          which are not present in current kernel as blocks are
> >          onlined in Zone Normal. So this is area to work and
> >          improve on.
> > 
> >        - currently if one wants to use online_movable,
> >          one has to either
> >            * disable auto online in kernel OR  
> 
> which might not just work because an unmovable allocation could have
> made the memblock pinned.
With memhp_default_state=offline on kernel CLI there won't be any
unmovable allocation as hotplugged memory won't be onlined and
user can online it manually. So it works for non default usecase
of playing with memory hot-unplug.
 
> >            * remove udev rule that distro ships
> >          AND write custom daemon that will be able to online
> >          block in right zone/order. So currently whole
> >          online_movable thing isn't working by default
> >          regardless of who onlines memory.  
> 
> my epxperience with onlining full nodes as movable shows this works just
> fine (with all the limitations of the movable zones but that is a
> separate thing). I haven't played with configurations where movable
> zones are sharing the node with other zones.
I don't have access to a such baremetal configuration to play
with anymore.


> >          I'm in favor of implementing that in kernel as it keeps
> >          kernel internals inside kernel and doesn't need
> >          kernel API to be involved (memory blocks in sysfs,
> >          online_kernel, online_movable)
> >          There would be no need in userspace which would have to
> >          deal with kernel zoo and maintain that as well.  
> 
> The kernel is supposed to provide a proper API and that is sysfs
> currently. I am not entirely happy about it either but pulling a lot of
> code into the kernel is not the rigth thing to do. Especially when
> different usecases require different treatment.
If it could be done from kernel side alone, it looks like a better way
to me not to involve userspace at all. And for ACPI based x86/ARM it's
possible to implement without adding a lot of kernel code.
That's one more of a reason to keep CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
so we could continue on improving kernel only auto-onlining
and fixing current memory hot(un)plug issues without affecting
other platforms/users that are no interested in it.
(PS: I don't care much about sysfs knob for setting auto-onlining,
as kernel CLI override with memhp_default_state seems
sufficient to me)

Michal Hocko March 13, 2017, 12:28 p.m. UTC | #30

On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
> On Thu, 9 Mar 2017 13:54:00 +0100
> Michal Hocko <mhocko@kernel.org> wrote:
> 
> [...]
> > > It's major regression if you remove auto online in kernels that
> > > run on top of x86 kvm/vmware hypervisors, making API cleanups
> > > while breaking useful functionality doesn't make sense.
> > > 
> > > I would ACK config option removal if auto online keeps working
> > > for all x86 hypervisors (hyperv/xen isn't the only who needs it)
> > > and keep kernel CLI option to override default.
> > > 
> > > That doesn't mean that others will agree with flipping default,
> > > that's why config option has been added.
> > > 
> > > Now to sum up what's been discussed on this thread, there were 2
> > > different issues discussed:
> > >   1) memory hotplug: remove in kernel auto online for all
> > >                      except of hyperv/xen
> > > 
> > >        - suggested RFC is not acceptable from virt point of view
> > >          as it regresses guests on top of x86 kvm/vmware which
> > >          both use ACPI based memory hotplug.
> > > 
> > >        - udev/userspace solution doesn't work in practice as it's
> > >          too slow and unreliable when system is under load which
> > >          is quite common in virt usecase. That's why auto online
> > >          has been introduced in the first place.  
> > 
> > Please try to be more specific why "too slow" is a problem. Also how
> > much slower are we talking about?
>
> In virt case on host with lots VMs, userspace handler
> processing could be scheduled late enough to trigger a race
> between (guest memory going away/OOM handler) and memory
> coming online.

Either you are mixing two things together or this doesn't really make
much sense. So is this a balloning based on memory hotplug (aka active
memory hotadd initiated between guest and host automatically) or a guest
asking for additional memory by other means (pay more for memory etc.)?
Because if this is an administrative operation then I seriously question
this reasoning.

[...]
> > >        - currently if one wants to use online_movable,
> > >          one has to either
> > >            * disable auto online in kernel OR  
> > 
> > which might not just work because an unmovable allocation could have
> > made the memblock pinned.
>
> With memhp_default_state=offline on kernel CLI there won't be any
> unmovable allocation as hotplugged memory won't be onlined and
> user can online it manually. So it works for non default usecase
> of playing with memory hot-unplug.

I was talking about the case when the auto_online was true, of course,
e.g. depending on the config option which you've said is enabled in
Fedora kernels.
  
[...] 
> > >          I'm in favor of implementing that in kernel as it keeps
> > >          kernel internals inside kernel and doesn't need
> > >          kernel API to be involved (memory blocks in sysfs,
> > >          online_kernel, online_movable)
> > >          There would be no need in userspace which would have to
> > >          deal with kernel zoo and maintain that as well.  
> > 
> > The kernel is supposed to provide a proper API and that is sysfs
> > currently. I am not entirely happy about it either but pulling a lot of
> > code into the kernel is not the rigth thing to do. Especially when
> > different usecases require different treatment.
>
> If it could be done from kernel side alone, it looks like a better way
> to me not to involve userspace at all. And for ACPI based x86/ARM it's
> possible to implement without adding a lot of kernel code.

But this is not how we do the kernel development. We provide the API so
that userspace can implement the appropriate policy on top. We do not
add random knobs to implement the same thing in the kernel. Different
users might want to implement different onlining strategies and that is
hardly describable by a single global knob. Just look at the s390
example provided earlier. Please try to think out of your usecase scope.

> That's one more of a reason to keep CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
> so we could continue on improving kernel only auto-onlining
> and fixing current memory hot(un)plug issues without affecting
> other platforms/users that are no interested in it.

I really do not see any reason to keep the config option. Setting up
this to enabled is _wrong_ thing to do in general purpose
(distribution) kernel and a kernel for the specific usecase can achieve
the same thing via boot command line.

> (PS: I don't care much about sysfs knob for setting auto-onlining,
> as kernel CLI override with memhp_default_state seems
> sufficient to me)

That is good to hear! I would be OK with keeping the kernel command line
option until we resolve all the current issues with the hotplug.

Vitaly Kuznetsov March 13, 2017, 12:54 p.m. UTC | #31

Michal Hocko <mhocko@kernel.org> writes:

> On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
>> > > 
>> > >        - suggested RFC is not acceptable from virt point of view
>> > >          as it regresses guests on top of x86 kvm/vmware which
>> > >          both use ACPI based memory hotplug.
>> > > 
>> > >        - udev/userspace solution doesn't work in practice as it's
>> > >          too slow and unreliable when system is under load which
>> > >          is quite common in virt usecase. That's why auto online
>> > >          has been introduced in the first place.  
>> > 
>> > Please try to be more specific why "too slow" is a problem. Also how
>> > much slower are we talking about?
>>
>> In virt case on host with lots VMs, userspace handler
>> processing could be scheduled late enough to trigger a race
>> between (guest memory going away/OOM handler) and memory
>> coming online.
>
> Either you are mixing two things together or this doesn't really make
> much sense. So is this a balloning based on memory hotplug (aka active
> memory hotadd initiated between guest and host automatically) or a guest
> asking for additional memory by other means (pay more for memory etc.)?
> Because if this is an administrative operation then I seriously question
> this reasoning.

I'm probably repeating myself but it seems this point was lost:

This is not really a 'ballooning', it is just a pure memory
hotplug. People may have any tools monitoring their VM memory usage and
when a VM is running low on memory they may want to hotplug more memory
to it. With udev-style memory onlining they should be aware of page
tables and other in-kernel structures which require allocation so they
need to add memory slowly and gradually or they risk running into OOM
(at least getting some processes killed and these processes may be
important). With in-kernel memory hotplug everything happens
synchronously and no 'slowly and gradually' algorithm is required in
all tools which may trigger memory hotplug.

It's not about slowness, it's about being synchronous or
asynchronous. This is not related to the virtualization technology used,
the use-case is the same for all of them which support memory hotplug.

Michal Hocko March 13, 2017, 1:19 p.m. UTC | #32

On Mon 13-03-17 13:54:59, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
> >> > > 
> >> > >        - suggested RFC is not acceptable from virt point of view
> >> > >          as it regresses guests on top of x86 kvm/vmware which
> >> > >          both use ACPI based memory hotplug.
> >> > > 
> >> > >        - udev/userspace solution doesn't work in practice as it's
> >> > >          too slow and unreliable when system is under load which
> >> > >          is quite common in virt usecase. That's why auto online
> >> > >          has been introduced in the first place.  
> >> > 
> >> > Please try to be more specific why "too slow" is a problem. Also how
> >> > much slower are we talking about?
> >>
> >> In virt case on host with lots VMs, userspace handler
> >> processing could be scheduled late enough to trigger a race
> >> between (guest memory going away/OOM handler) and memory
> >> coming online.
> >
> > Either you are mixing two things together or this doesn't really make
> > much sense. So is this a balloning based on memory hotplug (aka active
> > memory hotadd initiated between guest and host automatically) or a guest
> > asking for additional memory by other means (pay more for memory etc.)?
> > Because if this is an administrative operation then I seriously question
> > this reasoning.
> 
> I'm probably repeating myself but it seems this point was lost:
> 
> This is not really a 'ballooning', it is just a pure memory
> hotplug. People may have any tools monitoring their VM memory usage and
> when a VM is running low on memory they may want to hotplug more memory
> to it.

What is the API those guests ask for the memory? And who is actually
responsible to ask for that memory? Is it a kernel or userspace
solution?

> With udev-style memory onlining they should be aware of page
> tables and other in-kernel structures which require allocation so they
> need to add memory slowly and gradually or they risk running into OOM
> (at least getting some processes killed and these processes may be
> important). With in-kernel memory hotplug everything happens
> synchronously and no 'slowly and gradually' algorithm is required in
> all tools which may trigger memory hotplug.

What prevents those APIs being used reasonably and only asks so much
memory as they can afford? I mean 1.5% available memory necessary for
the hotplug is not all that much. Or more precisely what prevents to ask
for this additional memory in a synchronous way?

And just to prevent from a further solution, I can see why _in-kernel_
hotplug based 'ballooning (or whatever you call an on demand memory hotplug
based on the memory pressure)' want's to be synchronous and that is why
my patch changed those onlined the memory explicitly. I am questioning
memory hotplug requested by admin/user space component to need any
special from kernel assistance becuase it is only a shortcut which can
be implemented from the userspace. I hope I've made myself clear
finally.

Vitaly Kuznetsov March 13, 2017, 1:42 p.m. UTC | #33

Michal Hocko <mhocko@kernel.org> writes:

> On Mon 13-03-17 13:54:59, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
>> >> > > 
>> >> > >        - suggested RFC is not acceptable from virt point of view
>> >> > >          as it regresses guests on top of x86 kvm/vmware which
>> >> > >          both use ACPI based memory hotplug.
>> >> > > 
>> >> > >        - udev/userspace solution doesn't work in practice as it's
>> >> > >          too slow and unreliable when system is under load which
>> >> > >          is quite common in virt usecase. That's why auto online
>> >> > >          has been introduced in the first place.  
>> >> > 
>> >> > Please try to be more specific why "too slow" is a problem. Also how
>> >> > much slower are we talking about?
>> >>
>> >> In virt case on host with lots VMs, userspace handler
>> >> processing could be scheduled late enough to trigger a race
>> >> between (guest memory going away/OOM handler) and memory
>> >> coming online.
>> >
>> > Either you are mixing two things together or this doesn't really make
>> > much sense. So is this a balloning based on memory hotplug (aka active
>> > memory hotadd initiated between guest and host automatically) or a guest
>> > asking for additional memory by other means (pay more for memory etc.)?
>> > Because if this is an administrative operation then I seriously question
>> > this reasoning.
>> 
>> I'm probably repeating myself but it seems this point was lost:
>> 
>> This is not really a 'ballooning', it is just a pure memory
>> hotplug. People may have any tools monitoring their VM memory usage and
>> when a VM is running low on memory they may want to hotplug more memory
>> to it.
>
> What is the API those guests ask for the memory? And who is actually
> responsible to ask for that memory? Is it a kernel or userspace
> solution?

Whatever, this can even be a system administrator running
'free'. Hyper-V driver sends si_mem_available() and
vm_memory_committed() metrics to the host every second and this can be
later queried by any tool (e.g. powershell script).

>
>> With udev-style memory onlining they should be aware of page
>> tables and other in-kernel structures which require allocation so they
>> need to add memory slowly and gradually or they risk running into OOM
>> (at least getting some processes killed and these processes may be
>> important). With in-kernel memory hotplug everything happens
>> synchronously and no 'slowly and gradually' algorithm is required in
>> all tools which may trigger memory hotplug.
>
> What prevents those APIs being used reasonably and only asks so much
> memory as they can afford? I mean 1.5% available memory necessary for
> the hotplug is not all that much. Or more precisely what prevents to ask
> for this additional memory in a synchronous way?

The knowledge about the fact that we need to add memory slowly and
wait till it gets onlined is not obvious. AFAIR when you hotplug memory
to Windows VMs there is no such thing as 'onlining', and no brain is
required, a simple script 'low memory -> add mory memory' always
works. Asking all these script writers to think twice before issuing a
memory add command memory sounds like too much (to me).

Igor Mammedov March 13, 2017, 1:57 p.m. UTC | #34

On Mon, 13 Mar 2017 11:43:02 +0100
Michal Hocko <mhocko@kernel.org> wrote:

> On Mon 13-03-17 11:31:10, Igor Mammedov wrote:
> > On Fri, 10 Mar 2017 14:58:07 +0100  
> [...]
> > > [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> > > [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
> > > [    0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
> > > [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug
> > > [    0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff]
> > > [    0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff]
> > > [    0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff]
> > > [    0.000000] Zone ranges:
> > > [    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> > > [    0.000000]   DMA32    [mem 0x0000000001000000-0x000000007ffdffff]
> > > [    0.000000]   Normal   empty
> > > [    0.000000] Movable zone start for each node
> > > [    0.000000] Early memory node ranges
> > > [    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
> > > [    0.000000]   node   0: [mem 0x0000000000100000-0x000000003fffffff]
> > > [    0.000000]   node   1: [mem 0x0000000040000000-0x000000007ffdffff]
> > > 
> > > so there is neither any normal zone nor movable one at the boot time.  
> > it could be if hotpluggable memory were present at boot time in E802 table
> > (if I remember right when running on hyperv there is movable zone at boot time),
> > 
> > but in qemu hotpluggable memory isn't put into E820,
> > so zone is allocated later when memory is enumerated
> > by ACPI subsystem and onlined.
> > It causes less issues wrt movable zone and works for
> > different versions of linux/windows as well.
> > 
> > That's where in kernel auto-onlining could be also useful,
> > since user would be able to start-up with with small
> > non removable memory plus several removable DIMMs
> > and have all the memory onlined/available by the time
> > initrd is loaded. (missing piece here is onling
> > removable memory as movable by default).  
> 
> Why we should even care to online that memory that early rather than
> making it available via e820?

It's not forbidden by spec and has less complications
when it comes to removable memory. Declaring it in E820
would add following limitations/drawbacks:
 - firmware should be able to exclude removable memory
   from its usage (currently SeaBIOS nor EFI have to
   know/care about it) => less qemu-guest ABI to maintain.
 - OS should be taught to avoid/move (early) nonmovable
   allocations from removable address ranges.
   There were patches targeting that in recent kernels,
   but it won't work with older kernels that don't have it.
   So limiting a range of OSes that could run on QEMU
   and do memory removal.

E820 less approach works reasonably well with wide range
of guest OSes and less complex that if removable memory
were present it E820. Hence I don't have a compelling
reason to introduce removable memory in E820 as it
only adds to hot(un)plug issues.

I have an off-tree QEMU hack that puts hotremovable
memory added with "-device pc-dimm" on CLI into E820
to experiment with. It could be useful to play
with zone layouts at boot time, so if you are
interested I can rebase it on top of current master
and post it here to play with.

Michal Hocko March 13, 2017, 2:32 p.m. UTC | #35

On Mon 13-03-17 14:42:37, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Mon 13-03-17 13:54:59, Vitaly Kuznetsov wrote:
> >> Michal Hocko <mhocko@kernel.org> writes:
> >> 
> >> > On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
> >> >> > > 
> >> >> > >        - suggested RFC is not acceptable from virt point of view
> >> >> > >          as it regresses guests on top of x86 kvm/vmware which
> >> >> > >          both use ACPI based memory hotplug.
> >> >> > > 
> >> >> > >        - udev/userspace solution doesn't work in practice as it's
> >> >> > >          too slow and unreliable when system is under load which
> >> >> > >          is quite common in virt usecase. That's why auto online
> >> >> > >          has been introduced in the first place.  
> >> >> > 
> >> >> > Please try to be more specific why "too slow" is a problem. Also how
> >> >> > much slower are we talking about?
> >> >>
> >> >> In virt case on host with lots VMs, userspace handler
> >> >> processing could be scheduled late enough to trigger a race
> >> >> between (guest memory going away/OOM handler) and memory
> >> >> coming online.
> >> >
> >> > Either you are mixing two things together or this doesn't really make
> >> > much sense. So is this a balloning based on memory hotplug (aka active
> >> > memory hotadd initiated between guest and host automatically) or a guest
> >> > asking for additional memory by other means (pay more for memory etc.)?
> >> > Because if this is an administrative operation then I seriously question
> >> > this reasoning.
> >> 
> >> I'm probably repeating myself but it seems this point was lost:
> >> 
> >> This is not really a 'ballooning', it is just a pure memory
> >> hotplug. People may have any tools monitoring their VM memory usage and
> >> when a VM is running low on memory they may want to hotplug more memory
> >> to it.
> >
> > What is the API those guests ask for the memory? And who is actually
> > responsible to ask for that memory? Is it a kernel or userspace
> > solution?
> 
> Whatever, this can even be a system administrator running
> 'free'.

I am pretty sure that 'free' will not give you additional memory but
let's be serious here... If this is solely about monitoring from
userspace and requesting more memory from the userspace then I would
consider arguing about timely hotplug operation as void because there is
absolutely no guarantee to do the request itself in a timely fashion.

> Hyper-V driver sends si_mem_available() and
> vm_memory_committed() metrics to the host every second and this can be
> later queried by any tool (e.g. powershell script).

And how exactly is this related to the acpi hotplug which you were
arguing needs the timely handling as well?
 
> >> With udev-style memory onlining they should be aware of page
> >> tables and other in-kernel structures which require allocation so they
> >> need to add memory slowly and gradually or they risk running into OOM
> >> (at least getting some processes killed and these processes may be
> >> important). With in-kernel memory hotplug everything happens
> >> synchronously and no 'slowly and gradually' algorithm is required in
> >> all tools which may trigger memory hotplug.
> >
> > What prevents those APIs being used reasonably and only asks so much
> > memory as they can afford? I mean 1.5% available memory necessary for
> > the hotplug is not all that much. Or more precisely what prevents to ask
> > for this additional memory in a synchronous way?
> 
> The knowledge about the fact that we need to add memory slowly and
> wait till it gets onlined is not obvious.

yes it is and we cannot afford to give a better experience with the
implementation that requires to have memory to online a memory.

> AFAIR when you hotplug memory
> to Windows VMs there is no such thing as 'onlining', and no brain is
> required, a simple script 'low memory -> add mory memory' always
> works. Asking all these script writers to think twice before issuing a
> memory add command memory sounds like too much (to me).

Pardon me, but not requiring a brain while doing something on Windows
VMs is not really an argument...

Michal Hocko March 13, 2017, 2:36 p.m. UTC | #36

On Mon 13-03-17 14:57:12, Igor Mammedov wrote:
> On Mon, 13 Mar 2017 11:43:02 +0100
> Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Mon 13-03-17 11:31:10, Igor Mammedov wrote:
> > > On Fri, 10 Mar 2017 14:58:07 +0100  
> > [...]
> > > > [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> > > > [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
> > > > [    0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
> > > > [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug
> > > > [    0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff]
> > > > [    0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff]
> > > > [    0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff]
> > > > [    0.000000] Zone ranges:
> > > > [    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> > > > [    0.000000]   DMA32    [mem 0x0000000001000000-0x000000007ffdffff]
> > > > [    0.000000]   Normal   empty
> > > > [    0.000000] Movable zone start for each node
> > > > [    0.000000] Early memory node ranges
> > > > [    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
> > > > [    0.000000]   node   0: [mem 0x0000000000100000-0x000000003fffffff]
> > > > [    0.000000]   node   1: [mem 0x0000000040000000-0x000000007ffdffff]
> > > > 
> > > > so there is neither any normal zone nor movable one at the boot time.  
> > > it could be if hotpluggable memory were present at boot time in E802 table
> > > (if I remember right when running on hyperv there is movable zone at boot time),
> > > 
> > > but in qemu hotpluggable memory isn't put into E820,
> > > so zone is allocated later when memory is enumerated
> > > by ACPI subsystem and onlined.
> > > It causes less issues wrt movable zone and works for
> > > different versions of linux/windows as well.
> > > 
> > > That's where in kernel auto-onlining could be also useful,
> > > since user would be able to start-up with with small
> > > non removable memory plus several removable DIMMs
> > > and have all the memory onlined/available by the time
> > > initrd is loaded. (missing piece here is onling
> > > removable memory as movable by default).  
> > 
> > Why we should even care to online that memory that early rather than
> > making it available via e820?
> 
> It's not forbidden by spec and has less complications
> when it comes to removable memory. Declaring it in E820
> would add following limitations/drawbacks:
>  - firmware should be able to exclude removable memory
>    from its usage (currently SeaBIOS nor EFI have to
>    know/care about it) => less qemu-guest ABI to maintain.
>  - OS should be taught to avoid/move (early) nonmovable
>    allocations from removable address ranges.
>    There were patches targeting that in recent kernels,
>    but it won't work with older kernels that don't have it.
>    So limiting a range of OSes that could run on QEMU
>    and do memory removal.
> 
> E820 less approach works reasonably well with wide range
> of guest OSes and less complex that if removable memory
> were present it E820. Hence I don't have a compelling
> reason to introduce removable memory in E820 as it
> only adds to hot(un)plug issues.

OK I see and that sounds like an argument to not put those ranges to
E820. I still fail to see why we haeve to online the memory early during
the boot and cannot wait for userspace to run?

Reza Arbab March 13, 2017, 2:58 p.m. UTC | #37

On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote:
>> I agree with your general sentiment that this stuff is very 
>> nonintuitive.
>
>My criterion for nonintuitive is probably different because I would 
>call this _completely_unusable_. Sorry for being so loud about this but 
>the more I look into this area the more WTF code I see. This has seen 
>close to zero review and seems to be building up more single usecase 
>code on top of previous. We need to change this, seriously!

No argument here. I'm happy to help however I can.

Vitaly Kuznetsov March 13, 2017, 3:10 p.m. UTC | #38

Michal Hocko <mhocko@kernel.org> writes:

> On Mon 13-03-17 14:42:37, Vitaly Kuznetsov wrote:
>> >
>> > What is the API those guests ask for the memory? And who is actually
>> > responsible to ask for that memory? Is it a kernel or userspace
>> > solution?
>> 
>> Whatever, this can even be a system administrator running
>> 'free'.
>
> I am pretty sure that 'free' will not give you additional memory but
> let's be serious here... 
> If this is solely about monitoring from
> userspace and requesting more memory from the userspace then I would
> consider arguing about timely hotplug operation as void because there is
> absolutely no guarantee to do the request itself in a timely fashion.
>
>> Hyper-V driver sends si_mem_available() and
>> vm_memory_committed() metrics to the host every second and this can be
>> later queried by any tool (e.g. powershell script).
>
> And how exactly is this related to the acpi hotplug which you were
> arguing needs the timely handling as well?
>

What I meant to say is that there is no single 'right' way to get memory
usage from a VM, make a decision somewhere (in the hypervisor, on some
other host or even in someone's head) and issue a command to add more
memory. I don't know what particular tools people use with ESX/KVM VMs
but I think that multiple options are available.

>> >> With udev-style memory onlining they should be aware of page
>> >> tables and other in-kernel structures which require allocation so they
>> >> need to add memory slowly and gradually or they risk running into OOM
>> >> (at least getting some processes killed and these processes may be
>> >> important). With in-kernel memory hotplug everything happens
>> >> synchronously and no 'slowly and gradually' algorithm is required in
>> >> all tools which may trigger memory hotplug.
>> >
>> > What prevents those APIs being used reasonably and only asks so much
>> > memory as they can afford? I mean 1.5% available memory necessary for
>> > the hotplug is not all that much. Or more precisely what prevents to ask
>> > for this additional memory in a synchronous way?
>> 
>> The knowledge about the fact that we need to add memory slowly and
>> wait till it gets onlined is not obvious.
>
> yes it is and we cannot afford to give a better experience with the
> implementation that requires to have memory to online a memory.

Actually, we need memory to add memory, not to online it. And as I said
before, this is a real issue which requires addressing, it should always
be possible to add more memory (and to online already added memory if
these events are separated).

>
>> AFAIR when you hotplug memory
>> to Windows VMs there is no such thing as 'onlining', and no brain is
>> required, a simple script 'low memory -> add mory memory' always
>> works. Asking all these script writers to think twice before issuing a
>> memory add command memory sounds like too much (to me).
>
> Pardon me, but not requiring a brain while doing something on Windows
> VMs is not really an argument...

Why? Windows (or any other OS) is just an example that things can be
done in a different way, otherwise we'll end up with arguments like "it
was always like that in Linux so it's good".

Michal Hocko March 13, 2017, 3:11 p.m. UTC | #39

Let's add Andi

On Fri 10-03-17 16:53:33, Michal Hocko wrote:
> On Fri 10-03-17 14:58:07, Michal Hocko wrote:
> [...]
> > This would explain why onlining from the last block actually works but
> > to me this sounds like a completely crappy behavior. All we need to
> > guarantee AFAICS is that Normal and Movable zones do not overlap. I
> > believe there is even no real requirement about ordering of the physical
> > memory in Normal vs. Movable zones as long as they do not overlap. But
> > let's keep it simple for the start and always enforce the current status
> > quo that Normal zone is physically preceeding Movable zone.
> > Can somebody explain why we cannot have a simple rule for Normal vs.
> > Movable which would be:
> > 	- block [pfn, pfn+block_size] can be Normal if
> > 	  !zone_populated(MOVABLE) || pfn+block_size < ZONE_MOVABLE->zone_start_pfn
> > 	- block [pfn, pfn+block_size] can be Movable if
> > 	  !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn
> 
> OK, so while I was playing with this setup some more I probably got why
> this is done this way. All new memblocks are added to the zone Normal
> where they are accounted as spanned but not present. When we do
> online_movable we just cut from the end of the Normal zone and move it
> to Movable zone. This sounds really awkward. What was the reason to go
> this way? Why cannot we simply add those pages to the zone at the online
> time?

Answering to myself. So the reason seems to be 9d99aaa31f59 ("[PATCH]
x86_64: Support memory hotadd without sparsemem") which is no longer
true because 
config MEMORY_HOTPLUG
        bool "Allow for memory hot-add"
        depends on SPARSEMEM || X86_64_ACPI_NUMA
        depends on ARCH_ENABLE_MEMORY_HOTPLUG
        depends on COMPILE_TEST || !KASAN

so it is either SPARSEMEM or X86_64_ACPI_NUMA that would have to be enabled.
config X86_64_ACPI_NUMA
        def_bool y
        prompt "ACPI NUMA detection"
        depends on X86_64 && NUMA && ACPI && PCI
        select ACPI_NUMA

But I do not see any way how to enable anything but SPARSEMEM for x86_64
choice
        prompt "Memory model"
        depends on SELECT_MEMORY_MODEL
        default DISCONTIGMEM_MANUAL if ARCH_DISCONTIGMEM_DEFAULT
        default SPARSEMEM_MANUAL if ARCH_SPARSEMEM_DEFAULT
        default FLATMEM_MANUAL

ARCH_SPARSEMEM_DEFAULT is 32b only
config ARCH_DISCONTIGMEM_DEFAULT
        def_bool y
        depends on NUMA && X86_32

and ARCH_SPARSEMEM_DEFAULT is enabeld on 64b. So I guess whatever was
the reason to add this code back in 2006 is not true anymore. So I am
really wondering. Do we absolutely need to assign pages which are not
onlined yet to the ZONE_NORMAL unconditionally? Why cannot we put them
out of any zone and wait for memory online operation to put them where
requested?

Andi Kleen March 13, 2017, 11:16 p.m. UTC | #40

> and ARCH_SPARSEMEM_DEFAULT is enabeld on 64b. So I guess whatever was
> the reason to add this code back in 2006 is not true anymore. So I am
> really wondering. Do we absolutely need to assign pages which are not
> onlined yet to the ZONE_NORMAL unconditionally? Why cannot we put them
> out of any zone and wait for memory online operation to put them where
> requested?

I don't remember any specific reason. May have just been simplification.
Should be fine to use any zone.

-Andi

Igor Mammedov March 14, 2017, 1:20 p.m. UTC | #41

On Mon, 13 Mar 2017 13:28:25 +0100
Michal Hocko <mhocko@kernel.org> wrote:

> On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
> > On Thu, 9 Mar 2017 13:54:00 +0100
> > Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > [...]  
> > > > It's major regression if you remove auto online in kernels that
> > > > run on top of x86 kvm/vmware hypervisors, making API cleanups
> > > > while breaking useful functionality doesn't make sense.
> > > > 
> > > > I would ACK config option removal if auto online keeps working
> > > > for all x86 hypervisors (hyperv/xen isn't the only who needs it)
> > > > and keep kernel CLI option to override default.
> > > > 
> > > > That doesn't mean that others will agree with flipping default,
> > > > that's why config option has been added.
> > > > 
> > > > Now to sum up what's been discussed on this thread, there were 2
> > > > different issues discussed:
> > > >   1) memory hotplug: remove in kernel auto online for all
> > > >                      except of hyperv/xen
> > > > 
> > > >        - suggested RFC is not acceptable from virt point of view
> > > >          as it regresses guests on top of x86 kvm/vmware which
> > > >          both use ACPI based memory hotplug.
> > > > 
> > > >        - udev/userspace solution doesn't work in practice as it's
> > > >          too slow and unreliable when system is under load which
> > > >          is quite common in virt usecase. That's why auto online
> > > >          has been introduced in the first place.    
> > > 
> > > Please try to be more specific why "too slow" is a problem. Also how
> > > much slower are we talking about?  
> >
> > In virt case on host with lots VMs, userspace handler
> > processing could be scheduled late enough to trigger a race
> > between (guest memory going away/OOM handler) and memory
> > coming online.  
> 
> Either you are mixing two things together or this doesn't really make
> much sense. So is this a balloning based on memory hotplug (aka active
> memory hotadd initiated between guest and host automatically) or a guest
> asking for additional memory by other means (pay more for memory etc.)?
> Because if this is an administrative operation then I seriously question
> this reasoning.
It doesn't have to be user initiated action,
have you heard about pay as you go phone plans, same thing use case
applies to cloud environments where typically hotplug user initiated
action on baremetal could be easily automated to hotplug on demand.


> [...]
> > > >        - currently if one wants to use online_movable,
> > > >          one has to either
> > > >            * disable auto online in kernel OR    
> > > 
> > > which might not just work because an unmovable allocation could have
> > > made the memblock pinned.  
> >
> > With memhp_default_state=offline on kernel CLI there won't be any
> > unmovable allocation as hotplugged memory won't be onlined and
> > user can online it manually. So it works for non default usecase
> > of playing with memory hot-unplug.  
> 
> I was talking about the case when the auto_online was true, of course,
> e.g. depending on the config option which you've said is enabled in
> Fedora kernels.
>
> [...] 
> > > >          I'm in favor of implementing that in kernel as it keeps
> > > >          kernel internals inside kernel and doesn't need
> > > >          kernel API to be involved (memory blocks in sysfs,
> > > >          online_kernel, online_movable)
> > > >          There would be no need in userspace which would have to
> > > >          deal with kernel zoo and maintain that as well.    
> > > 
> > > The kernel is supposed to provide a proper API and that is sysfs
> > > currently. I am not entirely happy about it either but pulling a lot of
> > > code into the kernel is not the rigth thing to do. Especially when
> > > different usecases require different treatment.  
> >
> > If it could be done from kernel side alone, it looks like a better way
> > to me not to involve userspace at all. And for ACPI based x86/ARM it's
> > possible to implement without adding a lot of kernel code.  
> 
> But this is not how we do the kernel development. We provide the API so
> that userspace can implement the appropriate policy on top. We do not
> add random knobs to implement the same thing in the kernel. Different
> users might want to implement different onlining strategies and that is
> hardly describable by a single global knob. Just look at the s390
> example provided earlier. Please try to think out of your usecase scope.
And could you think outside of legacy sysfs based onlining usecase scope?

I don't think that S390 comparing with x86 is correct as platforms
and hardware implementations of memory hotplug are different with
correspondingly different requirements, hence CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
were introduced to allows platform specify behavior.

For x86/ARM(+ACPI) it's possible to implement hotplug in race free
way inside kernel without userspace intervention, onlining memory
using hardware vendor defined policy (ACPI SRAT/Memory device describe
memory sufficiently to do it) so user won't have to do it manually,
config option is a convenient way to enable new feature for platforms
that could support it.

It's good to maintain uniform API to userspace as far as API does
the job, but being stuck to legacy way isn't good when
there is a way (even though it's limited to limited set of platforms)
to improve it by removing need for API, making overall less complex
and race-less (more reliable) system.

> > That's one more of a reason to keep CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
> > so we could continue on improving kernel only auto-onlining
> > and fixing current memory hot(un)plug issues without affecting
> > other platforms/users that are no interested in it.  
> 
> I really do not see any reason to keep the config option. Setting up
> this to enabled is _wrong_ thing to do in general purpose
> (distribution) kernel and a kernel for the specific usecase can achieve
> the same thing via boot command line.
I have to disagree with you that setting policy 'not online by default'
in kernel is more valid than opposite policy 'online by default'.
It maybe works for your usecases but it doesn't mean that it suits
needs of others.

As example RHEL distribution (x86) are shipped with memory
autoonline enabled by default policy as it's what customers ask for.

And onlining memory as removable considered as a specific usecase,
since arguably a number of users where physical memory removal is
supported is less than a number of users where just hot add is
supported, plus single virt usecase adds huge userbase to
the later as it's easily available/accessible versus baremetal
hotplug.

So default depends on target audience and distributions need
a config option to pick default that suits its customers needs.
If we don't provide reliably working memory hot-add solution
customers will just move to OS that does (Windows or with your
patch hyperv/xen based cloud instead of KVM/VMware.

> > (PS: I don't care much about sysfs knob for setting auto-onlining,
> > as kernel CLI override with memhp_default_state seems
> > sufficient to me)  
> 
> That is good to hear! I would be OK with keeping the kernel command line
> option until we resolve all the current issues with the hotplug.
You RFC doesn't fix anything except of cleaning up config option,
and even at that is does it inconsistently breaking both userspaces
 - one that does expect auto-online
    kernel update on Fedora will break memory hot-add
    (on KVM/VMware hosts) since userspace doesn't ship any
    scripts that would do it but will continue to work on
    hyperv/xen hosts.
 - another that doesn't expect auto-online:
    no change for KVM/VMware but suddenly hyperv/xen would
    start auto-onlinig memory.
So users would have broken VMs until regression is noticed
and would have to manually fix userspace that they use to
accommodate 'improved' kernel.

This RFC under guise of clean up is removing default choice
from distributions and actually crippling linux guests on top
of KVM/VMware hosts while favoring xen/hyperv hosts.
IMHO it doesn't make any sense.

Yasuaki Ishimatsu March 14, 2017, 4:05 p.m. UTC | #42

On 03/13/2017 05:19 AM, Michal Hocko wrote:
> On Fri 10-03-17 12:39:27, Yasuaki Ishimatsu wrote:
>> On 03/10/2017 08:58 AM, Michal Hocko wrote:
> [...]
>>> OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G which generated
>>> [...]
>>> [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
>>> [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
>>> [    0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
>>> [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug
>>> [    0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff]
>>> [    0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff]
>>> [    0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff]
>>> [    0.000000] Zone ranges:
>>> [    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
>>> [    0.000000]   DMA32    [mem 0x0000000001000000-0x000000007ffdffff]
>>> [    0.000000]   Normal   empty
>>> [    0.000000] Movable zone start for each node
>>> [    0.000000] Early memory node ranges
>>> [    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
>>> [    0.000000]   node   0: [mem 0x0000000000100000-0x000000003fffffff]
>>> [    0.000000]   node   1: [mem 0x0000000040000000-0x000000007ffdffff]
>>>
>>> so there is neither any normal zone nor movable one at the boot time.
>>> Then I hotplugged 1G slot
>>> (qemu) object_add memory-backend-ram,id=mem1,size=1G
>>> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
>>>
>>> unfortunatelly the memory didn't show up automatically and I got
>>> [  116.375781] acpi PNP0C80:00: Enumeration failure
>>>
>>> so I had to probe it manually (prbably the BIOS my qemu uses doesn't
>>> support auto probing - I haven't really dug further). Anyway the SRAT
>>> table printed during the boot told that we should start at 0x100000000
>>>
>>> # echo 0x100000000 > /sys/devices/system/memory/probe
>>> # grep . /sys/devices/system/memory/memory32/valid_zones
>>> Normal Movable
>>>
>>> which looks reasonably right? Both Normal and Movable zones are allowed
>>>
>>> # echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
>>> # grep . /sys/devices/system/memory/memory3?/valid_zones
>>> /sys/devices/system/memory/memory32/valid_zones:Normal
>>> /sys/devices/system/memory/memory33/valid_zones:Normal Movable
>>>
>>> Huh, so our valid_zones have changed under our feet...
>>>
>>> # echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
>>> # grep . /sys/devices/system/memory/memory3?/valid_zones
>>> /sys/devices/system/memory/memory32/valid_zones:Normal
>>> /sys/devices/system/memory/memory33/valid_zones:Normal
>>> /sys/devices/system/memory/memory34/valid_zones:Normal Movable
>>>
>>> and again. So only the last memblock is considered movable. Let's try to
>>> online them now.
>>>
>>> # echo online_movable > /sys/devices/system/memory/memory34/state
>>> # grep . /sys/devices/system/memory/memory3?/valid_zones
>>> /sys/devices/system/memory/memory32/valid_zones:Normal
>>> /sys/devices/system/memory/memory33/valid_zones:Normal Movable
>>> /sys/devices/system/memory/memory34/valid_zones:Movable Normal
>>>
>>
>> I think there is no strong reason which kernel has the restriction.
>> By setting the restrictions, it seems to have made management of
>> these zone structs simple.
>
> Could you be more specific please? How could this make management any
> easier when udev is basically racing with the physical hotplug and the
> result is basically undefined?
>

When changing zone from NORMAL(N) to MOVALBE(M), we must resize both zones,
zone->zone_start_pfn and zone->spanned_pages. Currently there is the
restriction.

So we just simply change:
   zone(N)->spanned_pages -= nr_pages
   zone(M)->zone_start_pfn -= nr_pages

But if every memory can change zone with no restriction, we must recalculate
these zones spanned_pages and zone_start_pfn follows:

   memory section #
    1 2 3 4 5 6 7
   |N|M|N|N|N|M|M|
      |
   |N|N|N|N|N|M|M|
  * change memory section #2 from MOVABLE to NORMAL.
    then we must find next movable memory section (#6) and resize these zones.

I think when implementing movable memory, there is no requirement of this.
So kernel has the current restriction.

Thanks,
Yasuaki Ishimatsu

Michal Hocko March 14, 2017, 4:20 p.m. UTC | #43

On Tue 14-03-17 12:05:59, YASUAKI ISHIMATSU wrote:
> 
> 
> On 03/13/2017 05:19 AM, Michal Hocko wrote:
> >On Fri 10-03-17 12:39:27, Yasuaki Ishimatsu wrote:
> >>On 03/10/2017 08:58 AM, Michal Hocko wrote:
[...]
> >>># echo online_movable > /sys/devices/system/memory/memory34/state
> >>># grep . /sys/devices/system/memory/memory3?/valid_zones
> >>>/sys/devices/system/memory/memory32/valid_zones:Normal
> >>>/sys/devices/system/memory/memory33/valid_zones:Normal Movable
> >>>/sys/devices/system/memory/memory34/valid_zones:Movable Normal
> >>>
> >>
> >>I think there is no strong reason which kernel has the restriction.
> >>By setting the restrictions, it seems to have made management of
> >>these zone structs simple.
> >
> >Could you be more specific please? How could this make management any
> >easier when udev is basically racing with the physical hotplug and the
> >result is basically undefined?
> >
> 
> When changing zone from NORMAL(N) to MOVALBE(M), we must resize both zones,
> zone->zone_start_pfn and zone->spanned_pages. Currently there is the
> restriction.
> 
> So we just simply change:
>   zone(N)->spanned_pages -= nr_pages
>   zone(M)->zone_start_pfn -= nr_pages

Yes I understand how this made the implementation simpler. I was
questioning how this made user management any easier. Changing
valid zones which races with the hotplug consumer (e.g. udev) sounds
like a terrible idea to me.

Anyway, it seems that the initial assumption/restriction that all
pages have to start on the zone Normal is not really needed. I have a
preliminary patch which removes that and associates newly added pages
with a zone at the online time and it seems to be working reasonably
well. I have to iron out some corners before I post it.

Andrea Arcangeli March 14, 2017, 7:35 p.m. UTC | #44

Hello,

On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote:
> On Fri 10-03-17 13:00:37, Reza Arbab wrote:
> > On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote:
> > >OK, so while I was playing with this setup some more I probably got why
> > >this is done this way. All new memblocks are added to the zone Normal
> > >where they are accounted as spanned but not present.
> > 
> > It's not always zone Normal. See zone_for_memory(). This leads to a
> > workaround for having to do online_movable in descending block order.
> > Instead of this:
> > 
> > 1. probe block 34, probe block 33, probe block 32, ...
> > 2. online_movable 34, online_movable 33, online_movable 32, ...
> > 
> > you can online_movable the first block before adding the rest:
> 
> I do I enforce that behavior when the probe happens automagically?

What I provided as guide to online as movable in current and older
kernels is:

1) Remove udev rule
2) After adding memory with qemu/libvirt API run in the guest

------- workaround start ----
#!/bin/bash
for i in `ls -d1 /sys/devices/system/memory/memory* | sort -nr -t y -k 5`; do if [ "`cat $i/state`" == "offline" ]; then echo online_movable > $i/state ; fi; done
------- workaround end ----

That's how bad is onlining as movable for memory hotunplug.

> > 1. probe block 32, online_movable 32
> > 2. probe block 33, probe block 34, ...
> > 	- zone_for_memory() will cause these to start Movable
> > 3. online 33, online 34, ...
> > 	- they're already in Movable, so online_movable is equivalentr
> > 
> > I agree with your general sentiment that this stuff is very nonintuitive.
> 
> My criterion for nonintuitive is probably different because I would call
> this _completely_unusable_. Sorry for being so loud about this but the
> more I look into this area the more WTF code I see. This has seen close
> to zero review and seems to be building up more single usecase code on
> top of previous. We need to change this, seriously!

It's not a bug, but when I found it I called it a "constraint" and
when I debugged it (IIRC) it originated here:

	} else if (online_type == MMOP_ONLINE_MOVABLE) {
		if (!zone_can_shift(pfn, nr_pages, ZONE_MOVABLE, &zone_shift))
			return -EINVAL;
	}

Fixing it so you could online as movable even if it wasn't the last
block was in my todo list but then we had other plans.

Unfortunately unless pfn+nr_pages of the newly created Movable zone
matches the end of the normal zone (or whatever was there that has to
be converted to Movable), it will fail onlining as movable.

To fix it is enough to check that everything from pfn to the end of
whatever zone existed there (or the end of the node perhaps safer) can
be converted as movable and just convert the whole thing as movable
instead of stopping at pfn+nr_pages.

Also note, onlining highmem physical ranges as movable requires yet
another config option to be set for the below check to pass
(CONFIG_MOVABLE_NODE=y), which I'm not exactly sure why anybody would
want to set =n, and perhaps would be candidate for dropping well
before considering to drop _DEFAULT_ONLINE and at best it should be
replaced with a kernel parameter to turn off the CONFIG_MOVABLE_NODE=y
behavior.

	if ((zone_idx(zone) > ZONE_NORMAL ||
	    online_type == MMOP_ONLINE_MOVABLE) &&
	    !can_online_high_movable(zone))
		return -EINVAL;

I would suggest to drop the constraints in online_pages and perhaps
also the CONFIG_MOVABLE_NODE option before going to drop the automatic
onlining in kernel.

Because of the above constraint the udev rule is unusable anyway for
onlining memory as movable so that it can be hotunplugged reliably
(well not so reliably, page_migrate does a loop and tries many times
but it occurred to me it failed once to offline and at next try it
worked, temporary page pins with O_DIRECT screw with page_migration,
rightfully so and no easy fix).

After the above constraint is fixed I suggest to look into fixing the
async onlining by making the udev rule run synchronously. Adding 4T to
a 8G guest is perfectly reasonable and normal operation, no excuse for
it to fail as long as you don't pretend such 4T to be unpluggable too
later (which we won't).

As I understand it, the whole point of _DEFFAULT_ONLINE=y is precisely
that it's easier to fix it in kernel than fixing the udev
rule. Furthermore the above constraint for the zone shifting which
breaks online_movable in udev is not an issue for _DEFFAULT_ONLINE=y
because kernel onlining is synchronous by design (no special
synchronous udev rule fix is required) so it can cope fine with the
existing above constraint by onlining as movable from the end of the
last zone in each node so that such constraint never gets in the way.

Extending _DEFFAULT_ONLINE=y so that it can also online as movable has
been worked on.

On a side note, kernelcore=xxx passed to a kernel with
_DEFFAULT_ONLINE=y should already allow to create lots of
hotunpluggable memory onlined automatically as movable (never tested
but I would expect it to work).

After the udev rule can handle adding 4T to a 8G guest and it can
handle onlining memory as movable reliably by just doing
s/online/online_movable/, I think then we can restart the discussion
about the removal of _DEFFAULT_ONLINE=y. As far as I can tell, there
are higher priority and more interesting things to fix in this area
before _DEFFAULT_ONLINE=y can be removed. Either udev gets fixed and
it's reasonably simpler to fix (it will remain slower) or we should
eventually obsolete the udev rule instead, which is why the focus
hasn't been in fixing the udev rule and to replace it instead.

To be clear, I'm not necessarily against eventually removing
_DEFFAULT_ONLINE, but an equally reliable and comparably fast
alternative should be provided first and there's no such alternative
right now.

If s390 has special issues requiring admin or a software hotplug
manager app to enable blocks of memory by hand, the _DEFFAULT_ONLINE
could be then an option selected or not selected by
arch/*/Kconfig. The udev rule is still an automatic action so it's 1:1
with the in-kernel feature. If the in-kernel automatic onlining is not
workable on s390 I would assume the udev rule is also not workable.

Thanks,
Andrea

Michal Hocko March 15, 2017, 7:53 a.m. UTC | #45

On Tue 14-03-17 14:20:14, Igor Mammedov wrote:
> On Mon, 13 Mar 2017 13:28:25 +0100
> Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
> > > On Thu, 9 Mar 2017 13:54:00 +0100
> > > Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > > > The kernel is supposed to provide a proper API and that is sysfs
> > > > currently. I am not entirely happy about it either but pulling a lot of
> > > > code into the kernel is not the rigth thing to do. Especially when
> > > > different usecases require different treatment.  
> > >
> > > If it could be done from kernel side alone, it looks like a better way
> > > to me not to involve userspace at all. And for ACPI based x86/ARM it's
> > > possible to implement without adding a lot of kernel code.  
> > 
> > But this is not how we do the kernel development. We provide the API so
> > that userspace can implement the appropriate policy on top. We do not
> > add random knobs to implement the same thing in the kernel. Different
> > users might want to implement different onlining strategies and that is
> > hardly describable by a single global knob. Just look at the s390
> > example provided earlier. Please try to think out of your usecase scope.
>
> And could you think outside of legacy sysfs based onlining usecase scope?

Well, I always prefer a more generic solution which supports more
usecases and I am trying really hard to understand usecases you are
coming up with. So far I have heard that the current sysfs behavior is
broken (which is true!) and some very vague arguments about why we need
to online as quickly as possible to the point that userspace handling is
an absolute no go.

To be honest I still consider the later a non-issue. If the only thing
you care about is the memory footprint of the first phase then I believe
this is fixable. Memblock and section descriptors should be the only
necessary thing to allocate and that is not much.

As an aside, the more I think about the way the original authors
separated the physical hotadd from onlining the more I appreciate that
decision because the way how the memory can be online is definitely not
carved in stone and evolves with usecases. I believe nobody expected
that memory could be onlined as movable back then and I am pretty sure
new ways will emerge over time.

> I don't think that S390 comparing with x86 is correct as platforms
> and hardware implementations of memory hotplug are different with
> correspondingly different requirements, hence CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
> were introduced to allows platform specify behavior.

There are different usecases which are arch agnostic. E.g. decide the
movability based on some criterion (e.g. specific node, physical address
range and what not). Global auto onlining cannot handle those for obvious
reasons and a config option will not do achieve that for the same
reason.

> For x86/ARM(+ACPI) it's possible to implement hotplug in race free
> way inside kernel without userspace intervention, onlining memory
> using hardware vendor defined policy (ACPI SRAT/Memory device describe
> memory sufficiently to do it) so user won't have to do it manually,
> config option is a convenient way to enable new feature for platforms
> that could support it.

Sigh. Can you see the actual difference between the global kernel policy
and the policy coming from the specific hardware (ACPI etc...)? I am not
opposing auto onlining based on the ACPI attributes. But what we have
now is a policy _in_the_kernel_. This is almost always a bad idea and
I do not see any strong argument why it would be any different in this
particular case. Actually your current default in Fedora makes it harder
for anybody to use movable zones/nodes.

> It's good to maintain uniform API to userspace as far as API does
> the job, but being stuck to legacy way isn't good when
> there is a way (even though it's limited to limited set of platforms)
> to improve it by removing need for API, making overall less complex
> and race-less (more reliable) system.

then convince your virtualization platform to provide necessary data
for the memory auto onlining via ACPI etc...

> > > That's one more of a reason to keep CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
> > > so we could continue on improving kernel only auto-onlining
> > > and fixing current memory hot(un)plug issues without affecting
> > > other platforms/users that are no interested in it.  
> > 
> > I really do not see any reason to keep the config option. Setting up
> > this to enabled is _wrong_ thing to do in general purpose
> > (distribution) kernel and a kernel for the specific usecase can achieve
> > the same thing via boot command line.
>
> I have to disagree with you that setting policy 'not online by default'
> in kernel is more valid than opposite policy 'online by default'.
> It maybe works for your usecases but it doesn't mean that it suits
> needs of others.

Well, as described above there are good reasons to not hardwire any
policy into the kernel because things tend to evolve and come with many
surprising usecases original authors haven't anticipated at all.

On the other hand we have your auto_online policy which handles _one_
particular class of usecases which I believe could have been addressed
by enhancing the implementation of the current interface.  E.g. allocate
less memory in the initial phase, preemptive failing the first phase
when there is too much memory waiting for onlining or even help udev to
react faster by having preallocated workers to handle events. Instead, I
suspect, you have chosen the path of the least resistance/effort and now
we've ended up with a global policy with known limitations. I cannot say
I would be happy about that.

> As example RHEL distribution (x86) are shipped with memory
> autoonline enabled by default policy as it's what customers ask for.
> 
> And onlining memory as removable considered as a specific usecase,
> since arguably a number of users where physical memory removal is
> supported is less than a number of users where just hot add is
> supported, plus single virt usecase adds huge userbase to
> the later as it's easily available/accessible versus baremetal
> hotplug.

this might be the case now but might turn out to be a completely wrong
thing to do in few years when overhyped^Wcloud workloads won't be all
that interesting anymore.

> So default depends on target audience and distributions need
> a config option to pick default that suits its customers needs.

Well, I would hope that such a thing could be achieved by more flexible
means than the kernel config... E.g. pre-defined defaults that I can
install as a package rather than enforcing a particular policy to
everybody.

> If we don't provide reliably working memory hot-add solution
> customers will just move to OS that does (Windows or with your
> patch hyperv/xen based cloud instead of KVM/VMware.
>
> > > (PS: I don't care much about sysfs knob for setting auto-onlining,
> > > as kernel CLI override with memhp_default_state seems
> > > sufficient to me)  
> > 
> > That is good to hear! I would be OK with keeping the kernel command line
> > option until we resolve all the current issues with the hotplug.
>
> You RFC doesn't fix anything except of cleaning up config option,
> and even at that is does it inconsistently breaking both userspaces
>  - one that does expect auto-online
>     kernel update on Fedora will break memory hot-add
>     (on KVM/VMware hosts) since userspace doesn't ship any
>     scripts that would do it but will continue to work on
>     hyperv/xen hosts.

that is actually trivial to fix and provide a userspace fix while
the kernel still offers the functionality and remove the kernel
functionality later. Nobody talks about removing the whole thing at
once. API changes are not that simple at all.

>  - another that doesn't expect auto-online:
>     no change for KVM/VMware but suddenly hyperv/xen would
>     start auto-onlinig memory.

I would argue that removing a policy which covers only some usecases as
a fix but whatever. We obviously disagree here...

Anyway, I consider "never break the userspace" to be a hard rule and I
do not want to break any usecase of course. I thought this RFC would
help to trigger a constructive discussion with some reasonable outcome
where we would get rid of the cruft eventually. It seems this will not
be the case because getting an immediate half-solutions is preferred
much more than exhausting all the potential options these days.

I am sorry, but I have to say I really hate the way this all sneaked in
without a wider review, though. If this went through a proper review
process it would get a straight NAK, from me at least, I believe.

Michal Hocko March 15, 2017, 7:57 a.m. UTC | #46

On Tue 14-03-17 20:35:21, Andrea Arcangeli wrote:
> Hello,
> 
> On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote:
> > On Fri 10-03-17 13:00:37, Reza Arbab wrote:
> > > On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote:
> > > >OK, so while I was playing with this setup some more I probably got why
> > > >this is done this way. All new memblocks are added to the zone Normal
> > > >where they are accounted as spanned but not present.
> > > 
> > > It's not always zone Normal. See zone_for_memory(). This leads to a
> > > workaround for having to do online_movable in descending block order.
> > > Instead of this:
> > > 
> > > 1. probe block 34, probe block 33, probe block 32, ...
> > > 2. online_movable 34, online_movable 33, online_movable 32, ...
> > > 
> > > you can online_movable the first block before adding the rest:
> > 
> > I do I enforce that behavior when the probe happens automagically?
> 
> What I provided as guide to online as movable in current and older
> kernels is:
> 
> 1) Remove udev rule
> 2) After adding memory with qemu/libvirt API run in the guest
> 
> ------- workaround start ----
> #!/bin/bash
> for i in `ls -d1 /sys/devices/system/memory/memory* | sort -nr -t y -k 5`; do if [ "`cat $i/state`" == "offline" ]; then echo online_movable > $i/state ; fi; done
> ------- workaround end ----
> 
> That's how bad is onlining as movable for memory hotunplug.

Yeah, this is really yucky. Fortunately, I already have a working prototype
which removes this restriction altogether.

> > > 1. probe block 32, online_movable 32
> > > 2. probe block 33, probe block 34, ...
> > > 	- zone_for_memory() will cause these to start Movable
> > > 3. online 33, online 34, ...
> > > 	- they're already in Movable, so online_movable is equivalentr
> > > 
> > > I agree with your general sentiment that this stuff is very nonintuitive.
> > 
> > My criterion for nonintuitive is probably different because I would call
> > this _completely_unusable_. Sorry for being so loud about this but the
> > more I look into this area the more WTF code I see. This has seen close
> > to zero review and seems to be building up more single usecase code on
> > top of previous. We need to change this, seriously!
> 
> It's not a bug, but when I found it I called it a "constraint" and
> when I debugged it (IIRC) it originated here:
> 
> 	} else if (online_type == MMOP_ONLINE_MOVABLE) {
> 		if (!zone_can_shift(pfn, nr_pages, ZONE_MOVABLE, &zone_shift))
> 			return -EINVAL;
> 	}
> 
> Fixing it so you could online as movable even if it wasn't the last
> block was in my todo list but then we had other plans.
> 
> Unfortunately unless pfn+nr_pages of the newly created Movable zone
> matches the end of the normal zone (or whatever was there that has to
> be converted to Movable), it will fail onlining as movable.

Well, I think the main problem is that we associate pages added in the
first phase (probe) to the Normal zone. The everything else is just a
fallout and fiddling to make it work somehow.
 
[...]

> To be clear, I'm not necessarily against eventually removing
> _DEFFAULT_ONLINE, but an equally reliable and comparably fast
> alternative should be provided first and there's no such alternative
> right now.

As pointed in other reply that is not an intention here. I primarily
wanted to understand the scope of the problem. I believe this config
option was rushed into the kernel without considering other alternatives
which would make the hotplug more usable by others as well.
 
> If s390 has special issues requiring admin or a software hotplug
> manager app to enable blocks of memory by hand, the _DEFFAULT_ONLINE
> could be then an option selected or not selected by
> arch/*/Kconfig. The udev rule is still an automatic action so it's 1:1
> with the in-kernel feature. If the in-kernel automatic onlining is not
> workable on s390 I would assume the udev rule is also not workable.

But this is not about s390. It is about different usecases which require
different onlining policy and that is the main problem of the hardcoded
one in the kernel. See the other reply.

[RFC] mm, hotplug: get rid of auto_online_blocks

Commit Message

Comments

Patch