mbox series

[RFCv2,0/4] mm/memory_hotplug: Introduce memory block types

Message ID 20181130175922.10425-1-david@redhat.com (mailing list archive)
Headers show
Series mm/memory_hotplug: Introduce memory block types | expand

Message

David Hildenbrand Nov. 30, 2018, 5:59 p.m. UTC
This is the second approach, introducing more meaningful memory block
types and not changing online behavior in the kernel. It is based on
latest linux-next.

As we found out during dicussion, user space should always handle onlining
of memory, in any case. However in order to make smart decisions in user
space about if and how to online memory, we have to export more information
about memory blocks. This way, we can formulate rules in user space.

One such information is the type of memory block we are talking about.
This helps to answer some questions like:
- Does this memory block belong to a DIMM?
- Can this DIMM theoretically ever be unplugged again?
- Was this memory added by a balloon driver that will rely on balloon
  inflation to remove chunks of that memory again? Which zone is advised?
- Is this special standby memory on s390x that is usually not automatically
  onlined?

And in short it helps to answer to some extend (excluding zone imbalances)
- Should I online this memory block?
- To which zone should I online this memory block?
... of course special use cases will result in different anwers. But that's
why user space has control of onlining memory.

More details can be found in Patch 1 and Patch 3.
Tested on x86 with hotplugged DIMMs. Cross-compiled for PPC and s390x.


Example:
$ udevadm info -q all -a /sys/devices/system/memory/memory0
	KERNEL=="memory0"
	SUBSYSTEM=="memory"
	DRIVER==""
	ATTR{online}=="1"
	ATTR{phys_device}=="0"
	ATTR{phys_index}=="00000000"
	ATTR{removable}=="0"
	ATTR{state}=="online"
	ATTR{type}=="boot"
	ATTR{valid_zones}=="none"
$ udevadm info -q all -a /sys/devices/system/memory/memory90
	KERNEL=="memory90"
	SUBSYSTEM=="memory"
	DRIVER==""
	ATTR{online}=="1"
	ATTR{phys_device}=="0"
	ATTR{phys_index}=="0000005a"
	ATTR{removable}=="1"
	ATTR{state}=="online"
	ATTR{type}=="dimm"
	ATTR{valid_zones}=="Normal"


RFC -> RFCv2:
- Now also taking care of PPC (somehow missed it :/ )
- Split the series up to some degree (some ideas on how to split up patch 3
  would be very welcome)
- Introduce more memory block types. Turns out abstracting too much was
  rather confusing and not helpful. Properly document them.

Notes:
- I wanted to convert the enum of types into a named enum but this
  provoked all kinds of different errors. For now, I am doing it just like
  the other types (e.g. online_type) we are using in that context.
- The "removable" property should never have been named like that. It
  should have been "offlinable". Can we still rename that? E.g. boot memory
  is sometimes marked as removable ...

David Hildenbrand (4):
  mm/memory_hotplug: Introduce memory block types
  mm/memory_hotplug: Replace "bool want_memblock" by "int type"
  mm/memory_hotplug: Introduce and use more memory types
  mm/memory_hotplug: Drop MEMORY_TYPE_UNSPECIFIED

 arch/ia64/mm/init.c                           |  4 +-
 arch/powerpc/mm/mem.c                         |  4 +-
 arch/powerpc/platforms/powernv/memtrace.c     |  9 +--
 .../platforms/pseries/hotplug-memory.c        |  7 +-
 arch/s390/mm/init.c                           |  4 +-
 arch/sh/mm/init.c                             |  4 +-
 arch/x86/mm/init_32.c                         |  4 +-
 arch/x86/mm/init_64.c                         |  8 +--
 drivers/acpi/acpi_memhotplug.c                | 16 ++++-
 drivers/base/memory.c                         | 60 ++++++++++++++--
 drivers/hv/hv_balloon.c                       |  3 +-
 drivers/s390/char/sclp_cmd.c                  |  3 +-
 drivers/xen/balloon.c                         |  2 +-
 include/linux/memory.h                        | 69 ++++++++++++++++++-
 include/linux/memory_hotplug.h                | 18 ++---
 kernel/memremap.c                             |  6 +-
 mm/memory_hotplug.c                           | 29 ++++----
 17 files changed, 194 insertions(+), 56 deletions(-)

Comments

Wei Yang Dec. 1, 2018, 12:48 a.m. UTC | #1
On Fri, Nov 30, 2018 at 06:59:18PM +0100, David Hildenbrand wrote:
>This is the second approach, introducing more meaningful memory block
>types and not changing online behavior in the kernel. It is based on
>latest linux-next.
>
>As we found out during dicussion, user space should always handle onlining
>of memory, in any case. However in order to make smart decisions in user
>space about if and how to online memory, we have to export more information
>about memory blocks. This way, we can formulate rules in user space.
>
>One such information is the type of memory block we are talking about.
>This helps to answer some questions like:
>- Does this memory block belong to a DIMM?
>- Can this DIMM theoretically ever be unplugged again?
>- Was this memory added by a balloon driver that will rely on balloon
>  inflation to remove chunks of that memory again? Which zone is advised?
>- Is this special standby memory on s390x that is usually not automatically
>  onlined?
>
>And in short it helps to answer to some extend (excluding zone imbalances)
>- Should I online this memory block?
>- To which zone should I online this memory block?
>... of course special use cases will result in different anwers. But that's
>why user space has control of onlining memory.
>
>More details can be found in Patch 1 and Patch 3.
>Tested on x86 with hotplugged DIMMs. Cross-compiled for PPC and s390x.
>
>
>Example:
>$ udevadm info -q all -a /sys/devices/system/memory/memory0
>	KERNEL=="memory0"
>	SUBSYSTEM=="memory"
>	DRIVER==""
>	ATTR{online}=="1"
>	ATTR{phys_device}=="0"
>	ATTR{phys_index}=="00000000"
>	ATTR{removable}=="0"
>	ATTR{state}=="online"
>	ATTR{type}=="boot"
>	ATTR{valid_zones}=="none"
>$ udevadm info -q all -a /sys/devices/system/memory/memory90
>	KERNEL=="memory90"
>	SUBSYSTEM=="memory"
>	DRIVER==""
>	ATTR{online}=="1"
>	ATTR{phys_device}=="0"
>	ATTR{phys_index}=="0000005a"
>	ATTR{removable}=="1"
>	ATTR{state}=="online"
>	ATTR{type}=="dimm"
>	ATTR{valid_zones}=="Normal"
>
>
>RFC -> RFCv2:
>- Now also taking care of PPC (somehow missed it :/ )
>- Split the series up to some degree (some ideas on how to split up patch 3
>  would be very welcome)
>- Introduce more memory block types. Turns out abstracting too much was
>  rather confusing and not helpful. Properly document them.
>
>Notes:
>- I wanted to convert the enum of types into a named enum but this
>  provoked all kinds of different errors. For now, I am doing it just like
>  the other types (e.g. online_type) we are using in that context.
>- The "removable" property should never have been named like that. It
>  should have been "offlinable". Can we still rename that? E.g. boot memory
>  is sometimes marked as removable ...
>

This make sense to me. Remove usually describe physical hotplug phase,
if I am correct.
David Hildenbrand Dec. 20, 2018, 12:58 p.m. UTC | #2
On 30.11.18 18:59, David Hildenbrand wrote:
> This is the second approach, introducing more meaningful memory block
> types and not changing online behavior in the kernel. It is based on
> latest linux-next.
> 
> As we found out during dicussion, user space should always handle onlining
> of memory, in any case. However in order to make smart decisions in user
> space about if and how to online memory, we have to export more information
> about memory blocks. This way, we can formulate rules in user space.
> 
> One such information is the type of memory block we are talking about.
> This helps to answer some questions like:
> - Does this memory block belong to a DIMM?
> - Can this DIMM theoretically ever be unplugged again?
> - Was this memory added by a balloon driver that will rely on balloon
>   inflation to remove chunks of that memory again? Which zone is advised?
> - Is this special standby memory on s390x that is usually not automatically
>   onlined?
> 
> And in short it helps to answer to some extend (excluding zone imbalances)
> - Should I online this memory block?
> - To which zone should I online this memory block?
> ... of course special use cases will result in different anwers. But that's
> why user space has control of onlining memory.
> 
> More details can be found in Patch 1 and Patch 3.
> Tested on x86 with hotplugged DIMMs. Cross-compiled for PPC and s390x.
> 
> 
> Example:
> $ udevadm info -q all -a /sys/devices/system/memory/memory0
> 	KERNEL=="memory0"
> 	SUBSYSTEM=="memory"
> 	DRIVER==""
> 	ATTR{online}=="1"
> 	ATTR{phys_device}=="0"
> 	ATTR{phys_index}=="00000000"
> 	ATTR{removable}=="0"
> 	ATTR{state}=="online"
> 	ATTR{type}=="boot"
> 	ATTR{valid_zones}=="none"
> $ udevadm info -q all -a /sys/devices/system/memory/memory90
> 	KERNEL=="memory90"
> 	SUBSYSTEM=="memory"
> 	DRIVER==""
> 	ATTR{online}=="1"
> 	ATTR{phys_device}=="0"
> 	ATTR{phys_index}=="0000005a"
> 	ATTR{removable}=="1"
> 	ATTR{state}=="online"
> 	ATTR{type}=="dimm"
> 	ATTR{valid_zones}=="Normal"
> 
> 
> RFC -> RFCv2:
> - Now also taking care of PPC (somehow missed it :/ )
> - Split the series up to some degree (some ideas on how to split up patch 3
>   would be very welcome)
> - Introduce more memory block types. Turns out abstracting too much was
>   rather confusing and not helpful. Properly document them.
> 
> Notes:
> - I wanted to convert the enum of types into a named enum but this
>   provoked all kinds of different errors. For now, I am doing it just like
>   the other types (e.g. online_type) we are using in that context.
> - The "removable" property should never have been named like that. It
>   should have been "offlinable". Can we still rename that? E.g. boot memory
>   is sometimes marked as removable ...
> 


Any feedback regarding the suggested block types would be very much
appreciated!
Michal Hocko Dec. 20, 2018, 1:08 p.m. UTC | #3
On Thu 20-12-18 13:58:16, David Hildenbrand wrote:
> On 30.11.18 18:59, David Hildenbrand wrote:
> > This is the second approach, introducing more meaningful memory block
> > types and not changing online behavior in the kernel. It is based on
> > latest linux-next.
> > 
> > As we found out during dicussion, user space should always handle onlining
> > of memory, in any case. However in order to make smart decisions in user
> > space about if and how to online memory, we have to export more information
> > about memory blocks. This way, we can formulate rules in user space.
> > 
> > One such information is the type of memory block we are talking about.
> > This helps to answer some questions like:
> > - Does this memory block belong to a DIMM?
> > - Can this DIMM theoretically ever be unplugged again?
> > - Was this memory added by a balloon driver that will rely on balloon
> >   inflation to remove chunks of that memory again? Which zone is advised?
> > - Is this special standby memory on s390x that is usually not automatically
> >   onlined?
> > 
> > And in short it helps to answer to some extend (excluding zone imbalances)
> > - Should I online this memory block?
> > - To which zone should I online this memory block?
> > ... of course special use cases will result in different anwers. But that's
> > why user space has control of onlining memory.
> > 
> > More details can be found in Patch 1 and Patch 3.
> > Tested on x86 with hotplugged DIMMs. Cross-compiled for PPC and s390x.
> > 
> > 
> > Example:
> > $ udevadm info -q all -a /sys/devices/system/memory/memory0
> > 	KERNEL=="memory0"
> > 	SUBSYSTEM=="memory"
> > 	DRIVER==""
> > 	ATTR{online}=="1"
> > 	ATTR{phys_device}=="0"
> > 	ATTR{phys_index}=="00000000"
> > 	ATTR{removable}=="0"
> > 	ATTR{state}=="online"
> > 	ATTR{type}=="boot"
> > 	ATTR{valid_zones}=="none"
> > $ udevadm info -q all -a /sys/devices/system/memory/memory90
> > 	KERNEL=="memory90"
> > 	SUBSYSTEM=="memory"
> > 	DRIVER==""
> > 	ATTR{online}=="1"
> > 	ATTR{phys_device}=="0"
> > 	ATTR{phys_index}=="0000005a"
> > 	ATTR{removable}=="1"
> > 	ATTR{state}=="online"
> > 	ATTR{type}=="dimm"
> > 	ATTR{valid_zones}=="Normal"
> > 
> > 
> > RFC -> RFCv2:
> > - Now also taking care of PPC (somehow missed it :/ )
> > - Split the series up to some degree (some ideas on how to split up patch 3
> >   would be very welcome)
> > - Introduce more memory block types. Turns out abstracting too much was
> >   rather confusing and not helpful. Properly document them.
> > 
> > Notes:
> > - I wanted to convert the enum of types into a named enum but this
> >   provoked all kinds of different errors. For now, I am doing it just like
> >   the other types (e.g. online_type) we are using in that context.
> > - The "removable" property should never have been named like that. It
> >   should have been "offlinable". Can we still rename that? E.g. boot memory
> >   is sometimes marked as removable ...
> > 
> 
> 
> Any feedback regarding the suggested block types would be very much
> appreciated!

I still do not like this much to be honest. I just didn't get to think
through this properly. My fear is that this is conflating an actual API
with the current implementation and as such will cause problems in
future. But I haven't really looked into your patches closely so I might
be wrong. Anyway I won't be able to look into it by the end of year.
David Hildenbrand Dec. 20, 2018, 1:16 p.m. UTC | #4
On 20.12.18 14:08, Michal Hocko wrote:
> On Thu 20-12-18 13:58:16, David Hildenbrand wrote:
>> On 30.11.18 18:59, David Hildenbrand wrote:
>>> This is the second approach, introducing more meaningful memory block
>>> types and not changing online behavior in the kernel. It is based on
>>> latest linux-next.
>>>
>>> As we found out during dicussion, user space should always handle onlining
>>> of memory, in any case. However in order to make smart decisions in user
>>> space about if and how to online memory, we have to export more information
>>> about memory blocks. This way, we can formulate rules in user space.
>>>
>>> One such information is the type of memory block we are talking about.
>>> This helps to answer some questions like:
>>> - Does this memory block belong to a DIMM?
>>> - Can this DIMM theoretically ever be unplugged again?
>>> - Was this memory added by a balloon driver that will rely on balloon
>>>   inflation to remove chunks of that memory again? Which zone is advised?
>>> - Is this special standby memory on s390x that is usually not automatically
>>>   onlined?
>>>
>>> And in short it helps to answer to some extend (excluding zone imbalances)
>>> - Should I online this memory block?
>>> - To which zone should I online this memory block?
>>> ... of course special use cases will result in different anwers. But that's
>>> why user space has control of onlining memory.
>>>
>>> More details can be found in Patch 1 and Patch 3.
>>> Tested on x86 with hotplugged DIMMs. Cross-compiled for PPC and s390x.
>>>
>>>
>>> Example:
>>> $ udevadm info -q all -a /sys/devices/system/memory/memory0
>>> 	KERNEL=="memory0"
>>> 	SUBSYSTEM=="memory"
>>> 	DRIVER==""
>>> 	ATTR{online}=="1"
>>> 	ATTR{phys_device}=="0"
>>> 	ATTR{phys_index}=="00000000"
>>> 	ATTR{removable}=="0"
>>> 	ATTR{state}=="online"
>>> 	ATTR{type}=="boot"
>>> 	ATTR{valid_zones}=="none"
>>> $ udevadm info -q all -a /sys/devices/system/memory/memory90
>>> 	KERNEL=="memory90"
>>> 	SUBSYSTEM=="memory"
>>> 	DRIVER==""
>>> 	ATTR{online}=="1"
>>> 	ATTR{phys_device}=="0"
>>> 	ATTR{phys_index}=="0000005a"
>>> 	ATTR{removable}=="1"
>>> 	ATTR{state}=="online"
>>> 	ATTR{type}=="dimm"
>>> 	ATTR{valid_zones}=="Normal"
>>>
>>>
>>> RFC -> RFCv2:
>>> - Now also taking care of PPC (somehow missed it :/ )
>>> - Split the series up to some degree (some ideas on how to split up patch 3
>>>   would be very welcome)
>>> - Introduce more memory block types. Turns out abstracting too much was
>>>   rather confusing and not helpful. Properly document them.
>>>
>>> Notes:
>>> - I wanted to convert the enum of types into a named enum but this
>>>   provoked all kinds of different errors. For now, I am doing it just like
>>>   the other types (e.g. online_type) we are using in that context.
>>> - The "removable" property should never have been named like that. It
>>>   should have been "offlinable". Can we still rename that? E.g. boot memory
>>>   is sometimes marked as removable ...
>>>
>>
>>
>> Any feedback regarding the suggested block types would be very much
>> appreciated!
> 
> I still do not like this much to be honest. I just didn't get to think
> through this properly. My fear is that this is conflating an actual API
> with the current implementation and as such will cause problems in
> future. But I haven't really looked into your patches closely so I might
> be wrong. Anyway I won't be able to look into it by the end of year.
> 

I guess as long as we have memory block devices and we expect user space
to make a decision we will have this API and the involved problems.

I am open for alternatives, and as I said, any feedback on how to sort
this out will be highly appreciated.

I'll be on vacation for the next two weeks, so this can wait. Just
wanted to note that I am still interested in feedback :)
David Hildenbrand March 27, 2019, 4:03 p.m. UTC | #5
On 20.12.18 14:08, Michal Hocko wrote:
> On Thu 20-12-18 13:58:16, David Hildenbrand wrote:
>> On 30.11.18 18:59, David Hildenbrand wrote:
>>> This is the second approach, introducing more meaningful memory block
>>> types and not changing online behavior in the kernel. It is based on
>>> latest linux-next.
>>>
>>> As we found out during dicussion, user space should always handle onlining
>>> of memory, in any case. However in order to make smart decisions in user
>>> space about if and how to online memory, we have to export more information
>>> about memory blocks. This way, we can formulate rules in user space.
>>>
>>> One such information is the type of memory block we are talking about.
>>> This helps to answer some questions like:
>>> - Does this memory block belong to a DIMM?
>>> - Can this DIMM theoretically ever be unplugged again?
>>> - Was this memory added by a balloon driver that will rely on balloon
>>>   inflation to remove chunks of that memory again? Which zone is advised?
>>> - Is this special standby memory on s390x that is usually not automatically
>>>   onlined?
>>>
>>> And in short it helps to answer to some extend (excluding zone imbalances)
>>> - Should I online this memory block?
>>> - To which zone should I online this memory block?
>>> ... of course special use cases will result in different anwers. But that's
>>> why user space has control of onlining memory.
>>>
>>> More details can be found in Patch 1 and Patch 3.
>>> Tested on x86 with hotplugged DIMMs. Cross-compiled for PPC and s390x.
>>>
>>>
>>> Example:
>>> $ udevadm info -q all -a /sys/devices/system/memory/memory0
>>> 	KERNEL=="memory0"
>>> 	SUBSYSTEM=="memory"
>>> 	DRIVER==""
>>> 	ATTR{online}=="1"
>>> 	ATTR{phys_device}=="0"
>>> 	ATTR{phys_index}=="00000000"
>>> 	ATTR{removable}=="0"
>>> 	ATTR{state}=="online"
>>> 	ATTR{type}=="boot"
>>> 	ATTR{valid_zones}=="none"
>>> $ udevadm info -q all -a /sys/devices/system/memory/memory90
>>> 	KERNEL=="memory90"
>>> 	SUBSYSTEM=="memory"
>>> 	DRIVER==""
>>> 	ATTR{online}=="1"
>>> 	ATTR{phys_device}=="0"
>>> 	ATTR{phys_index}=="0000005a"
>>> 	ATTR{removable}=="1"
>>> 	ATTR{state}=="online"
>>> 	ATTR{type}=="dimm"
>>> 	ATTR{valid_zones}=="Normal"
>>>
>>>
>>> RFC -> RFCv2:
>>> - Now also taking care of PPC (somehow missed it :/ )
>>> - Split the series up to some degree (some ideas on how to split up patch 3
>>>   would be very welcome)
>>> - Introduce more memory block types. Turns out abstracting too much was
>>>   rather confusing and not helpful. Properly document them.
>>>
>>> Notes:
>>> - I wanted to convert the enum of types into a named enum but this
>>>   provoked all kinds of different errors. For now, I am doing it just like
>>>   the other types (e.g. online_type) we are using in that context.
>>> - The "removable" property should never have been named like that. It
>>>   should have been "offlinable". Can we still rename that? E.g. boot memory
>>>   is sometimes marked as removable ...
>>>
>>
>>
>> Any feedback regarding the suggested block types would be very much
>> appreciated!
> 
> I still do not like this much to be honest. I just didn't get to think
> through this properly. My fear is that this is conflating an actual API
> with the current implementation and as such will cause problems in
> future. But I haven't really looked into your patches closely so I might
> be wrong. Anyway I won't be able to look into it by the end of year.
> 

So I started to think about this again, and I guess somehow exposing an
identification of the device driver that added the memory section could
be sufficient.

E.g. "hyperv", "xen", "acpi", "sclp", "virtio-mem" ...

Via separate device driver interfaces, other information about the
memory could be exposed. (e.g. for ACPI: which memory devices belong to
one physical device). So stuff would not have to centered around
/sys/devices/system/memory/ , uglifying it for special cases.

We would have to write udev rules to deal with these values, should be
easy. If no DRIVER is given, it is simply memory detected and detected
during boot. ACPI changing the DRIVER might be tricky (from no DRIVER ->
ACPI), but I guess it could be done.

Now, the question would be how to get the DRIVER value in there. Adding
a bunch of fake device drivers would work, however this might get a
little messy ... and then there is unbining and rebinding which can be
triggered by userspace. Thinks to care about? Most probably not.