Message ID | 20220913195508.3511038-1-opendmb@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: introduce Designated Movable Blocks | expand |
On Tue, Sep 13, 2022 at 2:57 PM Doug Berger <opendmb@gmail.com> wrote: > > MOTIVATION: > Some Broadcom devices (e.g. 7445, 7278) contain multiple memory > controllers with each mapped in a different address range within > a Uniform Memory Architecture. Some users of these systems have > expressed the desire to locate ZONE_MOVABLE memory on each > memory controller to allow user space intensive processing to > make better use of the additional memory bandwidth. > Unfortunately, the historical monotonic layout of zones would > mean that if the lowest addressed memory controller contains > ZONE_MOVABLE memory then all of the memory available from > memory controllers at higher addresses must also be in the > ZONE_MOVABLE zone. This would force all kernel memory accesses > onto the lowest addressed memory controller and significantly > reduce the amount of memory available for non-movable > allocations. Why are you sending kernel patches to the Devicetree specification list? Rob
On 9/14/2022 6:21 AM, Rob Herring wrote: > On Tue, Sep 13, 2022 at 2:57 PM Doug Berger <opendmb@gmail.com> wrote: >> >> MOTIVATION: >> Some Broadcom devices (e.g. 7445, 7278) contain multiple memory >> controllers with each mapped in a different address range within >> a Uniform Memory Architecture. Some users of these systems have >> expressed the desire to locate ZONE_MOVABLE memory on each >> memory controller to allow user space intensive processing to >> make better use of the additional memory bandwidth. >> Unfortunately, the historical monotonic layout of zones would >> mean that if the lowest addressed memory controller contains >> ZONE_MOVABLE memory then all of the memory available from >> memory controllers at higher addresses must also be in the >> ZONE_MOVABLE zone. This would force all kernel memory accesses >> onto the lowest addressed memory controller and significantly >> reduce the amount of memory available for non-movable >> allocations. > > Why are you sending kernel patches to the Devicetree specification list? > > Rob My apologies if this is a problem. No offense was intended. My process has been to run my patches through get_maintainers.pl to get the list of addresses to copy on submissions and my 0016-dt-bindings-reserved-memory-introduce-designated-mov.patch solicited the '- <devicetree-spec@vger.kernel.org>' address. My preference when reviewing is to receive an entire patch set to understand the context of an individual commit, but I can certainly understand that others may have different preferences. It was my understanding that the Devicetree specification list was part of the kernel (e.g. @vger.kernel.org) and would be willing to receive patches that might be of relevance to it. I am inexperienced with yaml and devicetree processes in general so I have tried to lean on the examples of other reserved-memory node bindings for help. There is much to learn and I am happy to modify my process to better accommodate your needs. Regards, Doug
On Wed, Sep 14, 2022 at 11:57 AM Doug Berger <opendmb@gmail.com> wrote: > > On 9/14/2022 6:21 AM, Rob Herring wrote: > > On Tue, Sep 13, 2022 at 2:57 PM Doug Berger <opendmb@gmail.com> wrote: > >> > >> MOTIVATION: > >> Some Broadcom devices (e.g. 7445, 7278) contain multiple memory > >> controllers with each mapped in a different address range within > >> a Uniform Memory Architecture. Some users of these systems have > >> expressed the desire to locate ZONE_MOVABLE memory on each > >> memory controller to allow user space intensive processing to > >> make better use of the additional memory bandwidth. > >> Unfortunately, the historical monotonic layout of zones would > >> mean that if the lowest addressed memory controller contains > >> ZONE_MOVABLE memory then all of the memory available from > >> memory controllers at higher addresses must also be in the > >> ZONE_MOVABLE zone. This would force all kernel memory accesses > >> onto the lowest addressed memory controller and significantly > >> reduce the amount of memory available for non-movable > >> allocations. > > > > Why are you sending kernel patches to the Devicetree specification list? > > > > Rob > My apologies if this is a problem. No offense was intended. None taken. Just trying to keep a low traffic list low traffic. > My process has been to run my patches through get_maintainers.pl to get > the list of addresses to copy on submissions and my > 0016-dt-bindings-reserved-memory-introduce-designated-mov.patch > solicited the > '- <devicetree-spec@vger.kernel.org>' address. Yeah, I see that now. That needs to be a person for a specific binding. The only bindings using the list should be targeting the dtschema repo. (And even those are a person ideally.) Rob
Hi Dough, I have some high-level questions. > MOTIVATION: > Some Broadcom devices (e.g. 7445, 7278) contain multiple memory > controllers with each mapped in a different address range within > a Uniform Memory Architecture. Some users of these systems have How large are these areas typically? How large are they in comparison to other memory in the system? How is this memory currently presented to the system? > expressed the desire to locate ZONE_MOVABLE memory on each > memory controller to allow user space intensive processing to > make better use of the additional memory bandwidth. Can you share some more how exactly ZONE_MOVABLE would help here to make better use of the memory bandwidth? > Unfortunately, the historical monotonic layout of zones would > mean that if the lowest addressed memory controller contains > ZONE_MOVABLE memory then all of the memory available from > memory controllers at higher addresses must also be in the > ZONE_MOVABLE zone. This would force all kernel memory accesses > onto the lowest addressed memory controller and significantly > reduce the amount of memory available for non-movable > allocations. We do have code that relies on zones during boot to not overlap within a single node. > > The main objective of this patch set is therefore to allow a > block of memory to be designated as part of the ZONE_MOVABLE > zone where it will always only be used by the kernel page > allocator to satisfy requests for movable pages. The term > Designated Movable Block is introduced here to represent such a > block. The favored implementation allows modification of the Sorry to say, but that term is rather suboptimal to describe what you are doing here. You simply have some system RAM you'd want to have managed by ZONE_MOVABLE, no? > 'movablecore' kernel parameter to allow specification of a base > address and support for multiple blocks. The existing > 'movablecore' mechanisms are retained. Other mechanisms based on > device tree are also included in this set. > > BACKGROUND: > NUMA architectures support distributing movablecore memory > across each node, but it is undesirable to introduce the > overhead and complexities of NUMA on systems that don't have a > Non-Uniform Memory Architecture. How exactly would that look like? I think I am missing something :) > > Commit 342332e6a925 ("mm/page_alloc.c: introduce kernelcore=mirror option") > also depends on zone overlap to support sytems with multiple > mirrored ranges. IIRC, zones will not overlap within a single node. > > Commit c6f03e2903c9 ("mm, memory_hotplug: remove zone restrictions") > embraced overlapped zones for memory hotplug. Yes, after boot. > > This commit set follows their lead to allow the ZONE_MOVABLE > zone to overlap other zones while spanning the pages from the > lowest Designated Movable Block to the end of the node. > Designated Movable Blocks are made absent from overlapping zones > and present within the ZONE_MOVABLE zone. > > I initially investigated an implementation using a Designated > Movable migrate type in line with comments[1] made by Mel Gorman > regarding a "sticky" MIGRATE_MOVABLE type to avoid using > ZONE_MOVABLE. However, this approach was riskier since it was > much more instrusive on the allocation paths. Ultimately, the > progress made by the memory hotplug folks to expand the > ZONE_MOVABLE functionality convinced me to follow this approach. > > OPPORTUNITIES: > There have been many attempts to modify the behavior of the > kernel page allocators use of CMA regions. This implementation > of Designated Movable Blocks creates an opportunity to repurpose > the CMA allocator to operate on ZONE_MOVABLE memory that the > kernel page allocator can use more agressively, without > affecting the existing CMA implementation. It is hoped that the > "shared-dmb-pool" approach included here will be useful in cases > where memory sharing is more important than allocation latency. > > CMA introduced a paradigm where multiple allocators could > operate on the same region of memory, and that paradigm can be > extended to Designated Movable Blocks as well. I was interested > in using kernel resource management as a mechanism for exposing > Designated Movable Block resources (e.g. /proc/iomem) that would > be used by the kernel page allocator like any other ZONE_MOVABLE > memory, but could be claimed by an alternative allocator (e.g. > CMA). Unfortunately, this becomes complicated because the kernel > resource implementation varies materially across different > architectures and I do not require this capability so I have > deferred that. Why can't we simply designate these regions as CMA regions? Why do we have to start using ZONE_MOVABLE for them?
On 9/19/2022 2:00 AM, David Hildenbrand wrote: > Hi Dough, > > I have some high-level questions. Thanks for your interest. I will attempt to answer them. > >> MOTIVATION: >> Some Broadcom devices (e.g. 7445, 7278) contain multiple memory >> controllers with each mapped in a different address range within >> a Uniform Memory Architecture. Some users of these systems have > > How large are these areas typically? > > How large are they in comparison to other memory in the system? > > How is this memory currently presented to the system? I'm not certain what is typical because these systems are highly configurable and Broadcom's customers have different ideas about application processing. The 7278 device has four ARMv8 CPU cores in an SMP cluster and two memory controllers (MEMCs). Each MEMC is capable of controlling up to 8GB of DRAM. An example 7278 system might have 1GB on each controller, so an arm64 kernel might see 1GB on MEMC0 at 0x40000000-0x7FFFFFFF and 1GB on MEMC1 at 0x300000000-0x33FFFFFFF. The Designated Movable Block concept introduced here has the potential to offer useful services to different constituencies. I tried to highlight this in my V1 patch set with the hope of attracting some interest, but it can complicate the overall discussion, so I would like to maybe narrow the discussion here. It may be good to keep them in mind when assessing the overall value, but perhaps the "other opportunities" can be covered as a follow on discussion. The base capability described in commits 7-15 of this V1 patch set is to allow a 'movablecore' block to be created at a particular base address rather than solely at the end of addressable memory. > >> expressed the desire to locate ZONE_MOVABLE memory on each >> memory controller to allow user space intensive processing to >> make better use of the additional memory bandwidth. > > Can you share some more how exactly ZONE_MOVABLE would help here to make > better use of the memory bandwidth? ZONE_MOVABLE memory is effectively unusable by the kernel. It can be used by user space applications through both the page allocator and the Hugetlbfs. If a large 'movablecore' allocation is defined and it can only be located at the end of addressable memory then it will always be located on MEMC1 of a 7278 system. This will create a tendency for user space accesses to consume more bandwidth on the MEMC1 memory controller and kernel space accesses to consume more bandwidth on MEMC0. A more even distribution of ZONE_MOVABLE memory between the available memory controllers in theory makes more memory bandwidth available to user space intensive loads. > >> Unfortunately, the historical monotonic layout of zones would >> mean that if the lowest addressed memory controller contains >> ZONE_MOVABLE memory then all of the memory available from >> memory controllers at higher addresses must also be in the >> ZONE_MOVABLE zone. This would force all kernel memory accesses >> onto the lowest addressed memory controller and significantly >> reduce the amount of memory available for non-movable >> allocations. > > We do have code that relies on zones during boot to not overlap within a > single node. I believe my changes address all such reliance, but if you are aware of something I missed please let me know. > >> >> The main objective of this patch set is therefore to allow a >> block of memory to be designated as part of the ZONE_MOVABLE >> zone where it will always only be used by the kernel page >> allocator to satisfy requests for movable pages. The term >> Designated Movable Block is introduced here to represent such a >> block. The favored implementation allows modification of the > > Sorry to say, but that term is rather suboptimal to describe what you > are doing here. You simply have some system RAM you'd want to have > managed by ZONE_MOVABLE, no? That may be true, but I found it superior to the 'sticky' movable terminology put forth by Mel Gorman ;). I'm happy to entertain alternatives, but they may not be as easy to find as you think. > >> 'movablecore' kernel parameter to allow specification of a base >> address and support for multiple blocks. The existing >> 'movablecore' mechanisms are retained. Other mechanisms based on >> device tree are also included in this set. >> >> BACKGROUND: >> NUMA architectures support distributing movablecore memory >> across each node, but it is undesirable to introduce the >> overhead and complexities of NUMA on systems that don't have a >> Non-Uniform Memory Architecture. > > How exactly would that look like? I think I am missing something :) The notion would be to consider each memory controller as a separate node, but as stated it is not desirable. > >> >> Commit 342332e6a925 ("mm/page_alloc.c: introduce kernelcore=mirror >> option") >> also depends on zone overlap to support sytems with multiple >> mirrored ranges. > > IIRC, zones will not overlap within a single node. I believe the implementation for kernelcore=mirror allows for the possibility of multiple non-adjacent mirrored ranges in a single node and accommodates the zone overlap. > >> >> Commit c6f03e2903c9 ("mm, memory_hotplug: remove zone restrictions") >> embraced overlapped zones for memory hotplug. > > Yes, after boot. > >> >> This commit set follows their lead to allow the ZONE_MOVABLE >> zone to overlap other zones while spanning the pages from the >> lowest Designated Movable Block to the end of the node. >> Designated Movable Blocks are made absent from overlapping zones >> and present within the ZONE_MOVABLE zone. >> >> I initially investigated an implementation using a Designated >> Movable migrate type in line with comments[1] made by Mel Gorman >> regarding a "sticky" MIGRATE_MOVABLE type to avoid using >> ZONE_MOVABLE. However, this approach was riskier since it was >> much more instrusive on the allocation paths. Ultimately, the >> progress made by the memory hotplug folks to expand the >> ZONE_MOVABLE functionality convinced me to follow this approach. >> >> OPPORTUNITIES: >> There have been many attempts to modify the behavior of the >> kernel page allocators use of CMA regions. This implementation >> of Designated Movable Blocks creates an opportunity to repurpose >> the CMA allocator to operate on ZONE_MOVABLE memory that the >> kernel page allocator can use more agressively, without >> affecting the existing CMA implementation. It is hoped that the >> "shared-dmb-pool" approach included here will be useful in cases >> where memory sharing is more important than allocation latency. >> >> CMA introduced a paradigm where multiple allocators could >> operate on the same region of memory, and that paradigm can be >> extended to Designated Movable Blocks as well. I was interested >> in using kernel resource management as a mechanism for exposing >> Designated Movable Block resources (e.g. /proc/iomem) that would >> be used by the kernel page allocator like any other ZONE_MOVABLE >> memory, but could be claimed by an alternative allocator (e.g. >> CMA). Unfortunately, this becomes complicated because the kernel >> resource implementation varies materially across different >> architectures and I do not require this capability so I have >> deferred that. > > Why can't we simply designate these regions as CMA regions? We and others have encountered significant performance issues when large CMA regions are used. There are significant restrictions on the page allocator's use of MIGRATE_CMA pages and the memory subsystem works very hard to keep about half of the memory in the CMA region free. There have been attempts to patch the CMA implementation to alter this behavior (for example the set I referenced Mel's response to in [1]), but there are users that desire the current behavior. > > Why do we have to start using ZONE_MOVABLE for them? One of the "other opportunities" for Designated Movable Blocks is to allow CMA to allocate from a DMB as an alternative. This would allow current users to continue using CMA as they want, but would allow users (e.g. hugetlb_cma) that are not sensitive to the allocation latency to let the kernel page allocator make more complete use (i.e. waste less) of the shared memory. ZONE_MOVABLE pageblocks are always MIGRATE_MOVABLE so the restrictions placed on MIGRATE_CMA pageblocks are lifted within a DMB. > Thanks for your consideration, Dough Baker ... I mean Doug Berger :).
Hi Doug, I only had time to skim through the patches and before diving in I'd like to clarify a few things. On Mon, Sep 19, 2022 at 06:03:55PM -0700, Doug Berger wrote: > On 9/19/2022 2:00 AM, David Hildenbrand wrote: > > > > How is this memory currently presented to the system? > > The 7278 device has four ARMv8 CPU cores in an SMP cluster and two memory > controllers (MEMCs). Each MEMC is capable of controlling up to 8GB of DRAM. > An example 7278 system might have 1GB on each controller, so an arm64 kernel > might see 1GB on MEMC0 at 0x40000000-0x7FFFFFFF and 1GB on MEMC1 at > 0x300000000-0x33FFFFFFF. > > The base capability described in commits 7-15 of this V1 patch set is to > allow a 'movablecore' block to be created at a particular base address > rather than solely at the end of addressable memory. I think this capability is only useful when there is non-uniform access to different memory ranges. Otherwise it wouldn't matter where the movable pages reside. The system you describe looks quite NUMA to me, with two memory controllers, each for accessing a partial range of the available memory. > > > expressed the desire to locate ZONE_MOVABLE memory on each > > > memory controller to allow user space intensive processing to > > > make better use of the additional memory bandwidth. > > > > Can you share some more how exactly ZONE_MOVABLE would help here to make > > better use of the memory bandwidth? > > ZONE_MOVABLE memory is effectively unusable by the kernel. It can be used by > user space applications through both the page allocator and the Hugetlbfs. > If a large 'movablecore' allocation is defined and it can only be located at > the end of addressable memory then it will always be located on MEMC1 of a > 7278 system. This will create a tendency for user space accesses to consume > more bandwidth on the MEMC1 memory controller and kernel space accesses to > consume more bandwidth on MEMC0. A more even distribution of ZONE_MOVABLE > memory between the available memory controllers in theory makes more memory > bandwidth available to user space intensive loads. The theory makes perfect sense, but is there any practical evidence of improvement? Some benchmark results that illustrate the difference would be nice. > > > BACKGROUND: > > > NUMA architectures support distributing movablecore memory > > > across each node, but it is undesirable to introduce the > > > overhead and complexities of NUMA on systems that don't have a > > > Non-Uniform Memory Architecture. > > > > How exactly would that look like? I think I am missing something :) > > The notion would be to consider each memory controller as a separate node, > but as stated it is not desirable. Why? > Thanks for your consideration, > Dough Baker ... I mean Doug Berger :).
On 9/23/2022 4:19 AM, Mike Rapoport wrote: > Hi Doug, > > I only had time to skim through the patches and before diving in I'd like > to clarify a few things. Thanks for taking the time. Any input is appreciated. > > On Mon, Sep 19, 2022 at 06:03:55PM -0700, Doug Berger wrote: >> On 9/19/2022 2:00 AM, David Hildenbrand wrote: >>> >>> How is this memory currently presented to the system? >> >> The 7278 device has four ARMv8 CPU cores in an SMP cluster and two memory >> controllers (MEMCs). Each MEMC is capable of controlling up to 8GB of DRAM. >> An example 7278 system might have 1GB on each controller, so an arm64 kernel >> might see 1GB on MEMC0 at 0x40000000-0x7FFFFFFF and 1GB on MEMC1 at >> 0x300000000-0x33FFFFFFF. >> >> The base capability described in commits 7-15 of this V1 patch set is to >> allow a 'movablecore' block to be created at a particular base address >> rather than solely at the end of addressable memory. > > I think this capability is only useful when there is non-uniform access to > different memory ranges. Otherwise it wouldn't matter where the movable > pages reside. I think that is a fair assessment of the described capability. However, the non-uniform access is a result of the current Linux architecture rather than the hardware architecture. > The system you describe looks quite NUMA to me, with two > memory controllers, each for accessing a partial range of the available > memory. NUMA was created to deal with non-uniformity in the hardware architecture where a CPU and/or other hardware device can make more efficient use of some nodes than other nodes. NUMA attempts to allocate from "closer" nodes to improve the operational efficiency of the system. If we consider how an arm64 architecture Linux kernel will apply zones to the above example system we find that Linux will place MEMC0 in ZONE_DMA and MEMC1 in ZONE_NORMAL. This allows both kernel and user space to compete for bandwidth on MEMC1, but largely excludes user space from MEMC0. It is possible for user space to get memory from ZONE_DMA through fallback when ZONE_NORMAL has been consumed, but there is a pretty clear bias against user space use of MEMC0. This non-uniformity doesn't come from the bus architecture since each CPU has equal costs to access MEMC0 and MEMC1. They compete for bandwidth, but there is no hardware bias for one node over another. Creating ZONE_MOVABLE memory on MEMC0 can help correct for the Linux bias. > >>>> expressed the desire to locate ZONE_MOVABLE memory on each >>>> memory controller to allow user space intensive processing to >>>> make better use of the additional memory bandwidth. >>> >>> Can you share some more how exactly ZONE_MOVABLE would help here to make >>> better use of the memory bandwidth? >> >> ZONE_MOVABLE memory is effectively unusable by the kernel. It can be used by >> user space applications through both the page allocator and the Hugetlbfs. >> If a large 'movablecore' allocation is defined and it can only be located at >> the end of addressable memory then it will always be located on MEMC1 of a >> 7278 system. This will create a tendency for user space accesses to consume >> more bandwidth on the MEMC1 memory controller and kernel space accesses to >> consume more bandwidth on MEMC0. A more even distribution of ZONE_MOVABLE >> memory between the available memory controllers in theory makes more memory >> bandwidth available to user space intensive loads. > > The theory makes perfect sense, but is there any practical evidence of > improvement? > Some benchmark results that illustrate the difference would be nice. I agree that benchmark results would be nice. Unfortunately, I am not part of the constituency that uses these Linux features so I have no representative user space work loads to measure. I can only say that I was asked to implement this capability, this is the approach I took, and customers of Broadcom are making use of it. I am submitting it upstream with the hope that: its/my sanity can be better reviewed, it will not get broken by future changes in the kernel, and it will be useful to others. This "narrow" capability may have limited value to others, but it should not create issues for those that do not actively wish to use it. I would hope that makes it easier to review and get accepted. However, I believe "other opportunities" exist that may have broader appeal so I have suggested some along with the "narrow" capability to hopefully give others motivation to consider accepting the narrow capability and to help shape how these "other capabilities" should be implemented. One "other opportunity" that I have realized may be more interesting than I originally anticipated comes from the recognition that the Devicetree Specification includes support for Reserved Memory regions that can contain the 'reusable' property to allow the OS to make use of the memory. Currently, Linux only takes advantage of that capability for reserved memory nodes that are compatible with 'shared-dma-pool' where CMA is used to allow the memory to be used by the OS and by device drivers. CMA is a great concept, but we have observed shortcomings that become more apparent as the size of the CMA region grows. Specifically, the Linux memory management works very hard to keep half of the CMA memory free. A number of submissions have been made over the years to alter the CMA implementation to allow more aggressive use of the memory by the OS, but there are users that desire the current behavior so the submissions have been rejected. No other types of reserved memory nodes can take advantage of sharing the memory with the Linux operating system because there is insufficient specification of how device drivers can reclaim the reserved memory when it is needed. The introduction of Designated Movable Block support provides a mechanism that would allow this capability to be realized. Because DMBs are in ZONE_MOVABLE their pages are reclaimable, and because they can be located anywhere they can satisfy DMA constraints of owning devices. In the simplest case, device drivers can use the dmb_intersects() function to determine whether their reserved memory range is within a DMB and can use the alloc_contig_range() function to reclaim the pages. This simple API could certainly be improved upon (e.g. the CMA allocator seems like an obvious choice), but it doesn't need to be defined by me so I would be happy to hear other people's ideas. > >>>> BACKGROUND: >>>> NUMA architectures support distributing movablecore memory >>>> across each node, but it is undesirable to introduce the >>>> overhead and complexities of NUMA on systems that don't have a >>>> Non-Uniform Memory Architecture. >>> >>> How exactly would that look like? I think I am missing something :) >> >> The notion would be to consider each memory controller as a separate node, >> but as stated it is not desirable. > > Why? In my opinion this is an inappropriate application of NUMA because the hardware does not impose any access non-uniformity to justify the complexity and overhead associated with NUMA. It would only be shoe-horned into the implementation to add some logical notion of memory nodes being associated with memory controllers. I would expect such an approach to receive a lot of push back from the Android Common Kernel users which may not be relevant to everyone, but is to many. Thanks for your consideration, -Doug
On 20.09.22 03:03, Doug Berger wrote: > On 9/19/2022 2:00 AM, David Hildenbrand wrote: >> Hi Dough, >> >> I have some high-level questions. > Thanks for your interest. I will attempt to answer them. > Hi Doug, sorry for the late reply, slowly catching up on mails. >> >>> MOTIVATION: >>> Some Broadcom devices (e.g. 7445, 7278) contain multiple memory >>> controllers with each mapped in a different address range within >>> a Uniform Memory Architecture. Some users of these systems have >> >> How large are these areas typically? >> >> How large are they in comparison to other memory in the system? >> >> How is this memory currently presented to the system? > I'm not certain what is typical because these systems are highly > configurable and Broadcom's customers have different ideas about > application processing. > > The 7278 device has four ARMv8 CPU cores in an SMP cluster and two > memory controllers (MEMCs). Each MEMC is capable of controlling up to > 8GB of DRAM. An example 7278 system might have 1GB on each controller, > so an arm64 kernel might see 1GB on MEMC0 at 0x40000000-0x7FFFFFFF and > 1GB on MEMC1 at 0x300000000-0x33FFFFFFF. > > The Designated Movable Block concept introduced here has the potential > to offer useful services to different constituencies. I tried to > highlight this in my V1 patch set with the hope of attracting some > interest, but it can complicate the overall discussion, so I would like > to maybe narrow the discussion here. It may be good to keep them in mind > when assessing the overall value, but perhaps the "other opportunities" > can be covered as a follow on discussion. > > The base capability described in commits 7-15 of this V1 patch set is to > allow a 'movablecore' block to be created at a particular base address > rather than solely at the end of addressable memory. > Just so we're on the same page: Having too much ZONE_MOVABLE memory (ratio compared to !ZONE_MOVABLE memory) is dangerous. Acceptable ratios highly depend on the target workload. An extreme example is memory-hungry applications that end up long-term pinning a lot of memory (e.g., VMs with SR-IO): we can run easily out of free memory in the !ZONE_MOVABLE zones and might not want ZONE_MOVABLE at all. So whatever we do, this should in general not be the kernel sole decision to make this memory any special and let ZONE_MOVABLE manage it. It's the same with CMA. "Heavy" CMA users require special configuration: hugetlb_cma is one prime example. >> >>> expressed the desire to locate ZONE_MOVABLE memory on each >>> memory controller to allow user space intensive processing to >>> make better use of the additional memory bandwidth. >> >> Can you share some more how exactly ZONE_MOVABLE would help here to make >> better use of the memory bandwidth? > ZONE_MOVABLE memory is effectively unusable by the kernel. It can be > used by user space applications through both the page allocator and the > Hugetlbfs. If a large 'movablecore' allocation is defined and it can Hugetlbfs not necessarily by all architectures. Some architectures don't support placing hugetlb pages on ZONE_MOVABLE (not migratable) and gigantic pages are special either way. > only be located at the end of addressable memory then it will always be > located on MEMC1 of a 7278 system. This will create a tendency for user > space accesses to consume more bandwidth on the MEMC1 memory controller > and kernel space accesses to consume more bandwidth on MEMC0. A more > even distribution of ZONE_MOVABLE memory between the available memory > controllers in theory makes more memory bandwidth available to user > space intensive loads. > Sorry to be dense, is this also about different memory access latency or just memory bandwidth? Do these memory areas have special/different performance characteristics? Using dedicated/fake NUMA nodes might be more in line with what CXL and PMEM are up to. Using ZONE_MOVABLE for that purpose feels a little bit like an abuse of the mechanism. To be clearer what I mean: We can place any movable allocations on ZONE_MOVABLE, including kernel allocations. User space allocations are just one example, and int he future we'll turn more and more allocations movable to be able to cope with bigger ZONE_MOVABLE demands due to DAX/CXL. I once looked into migrating user space page tables, just to give an example. >> >>> Unfortunately, the historical monotonic layout of zones would >>> mean that if the lowest addressed memory controller contains >>> ZONE_MOVABLE memory then all of the memory available from >>> memory controllers at higher addresses must also be in the >>> ZONE_MOVABLE zone. This would force all kernel memory accesses >>> onto the lowest addressed memory controller and significantly >>> reduce the amount of memory available for non-movable >>> allocations. >> >> We do have code that relies on zones during boot to not overlap within a >> single node. > I believe my changes address all such reliance, but if you are aware of > something I missed please let me know. > One example I'm aware of is drivers/base/memory.c:memory_block_add_nid() / early_node_zone_for_memory_block(). If we get it wrong, or actually have memory blocks that span multiple zones, we can no longer offline these memory blocks. We really wanted to avoid scanning the memmap for now and it seems to get the job done in environments we care about. >> >>> >>> The main objective of this patch set is therefore to allow a >>> block of memory to be designated as part of the ZONE_MOVABLE >>> zone where it will always only be used by the kernel page >>> allocator to satisfy requests for movable pages. The term >>> Designated Movable Block is introduced here to represent such a >>> block. The favored implementation allows modification of the >> >> Sorry to say, but that term is rather suboptimal to describe what you >> are doing here. You simply have some system RAM you'd want to have >> managed by ZONE_MOVABLE, no? > That may be true, but I found it superior to the 'sticky' movable > terminology put forth by Mel Gorman ;). I'm happy to entertain > alternatives, but they may not be as easy to find as you think. Especially the "blocks" part is confusing. Movable pageblocks? Movable Linux memory blocks? Note that the sticky movable *pageblocks* were a completely different concept than simply reusing ZONE_MOVABLE for some memory ranges. > >> >>> 'movablecore' kernel parameter to allow specification of a base >>> address and support for multiple blocks. The existing >>> 'movablecore' mechanisms are retained. Other mechanisms based on >>> device tree are also included in this set. >>> >>> BACKGROUND: >>> NUMA architectures support distributing movablecore memory >>> across each node, but it is undesirable to introduce the >>> overhead and complexities of NUMA on systems that don't have a >>> Non-Uniform Memory Architecture. >> >> How exactly would that look like? I think I am missing something :) > The notion would be to consider each memory controller as a separate > node, but as stated it is not desirable. > Doing it the DAX/CXL way would be to expose these memory ranges as daxdev instead, and letting the admin decide how to online these memory ranges when adding them to the buddy via the dax/kmem kernel module. That could mean that your booting with memory on MC0 only, and expose memory of MC1 via a daxdev, giving the admin the possibility do decide to which zone the memory should be onlined too. That would avoid most kernel code changes. >> >> Why can't we simply designate these regions as CMA regions? > We and others have encountered significant performance issues when large > CMA regions are used. There are significant restrictions on the page > allocator's use of MIGRATE_CMA pages and the memory subsystem works very > hard to keep about half of the memory in the CMA region free. There have > been attempts to patch the CMA implementation to alter this behavior > (for example the set I referenced Mel's response to in [1]), but there > are users that desire the current behavior. Optimizing that would be great, eventually making it configurable or selecting the behavior based on the actual CMA area sizes. > >> >> Why do we have to start using ZONE_MOVABLE for them? > One of the "other opportunities" for Designated Movable Blocks is to > allow CMA to allocate from a DMB as an alternative. This would allow > current users to continue using CMA as they want, but would allow users > (e.g. hugetlb_cma) that are not sensitive to the allocation latency to > let the kernel page allocator make more complete use (i.e. waste less) > of the shared memory. ZONE_MOVABLE pageblocks are always MIGRATE_MOVABLE > so the restrictions placed on MIGRATE_CMA pageblocks are lifted within a > DMB. The whole purpose of ZONE_MOVABLE is that *no* unmovable allocations end up on it. The biggest difference to CMA is that the CMA *owner* is able to place unmovable allocations on it. Using ZONE_MOVABLE for unmovable allocations (hugetlb_cma) is not acceptable as is. Using ZONE_MOVABLE in different context and calling it DMB is very confusing TBH. Just a note that I described the idea of a "PREFER_MOVABLE" zone in the past. In contrast to ZONE_MOVABLE, we cannot run into weird OOM situations in a ZONE misconfiguration, and we'd end up placing only movable allocations on it as long as we can. However, especially gigantic pages could be allocated from it. It sounds kind-of more like what you want -- and maybe in combination of daxctl to let the user decide how to online memory ranges. And just to make it clear again: depending on ZONE_MOVABLE == only user space allocations is not future proof. > >> > Thanks for your consideration, > Dough Baker ... I mean Doug Berger :). :) Thanks Doug!
On 9/29/2022 2:00 AM, David Hildenbrand wrote: > On 20.09.22 03:03, Doug Berger wrote: >> On 9/19/2022 2:00 AM, David Hildenbrand wrote: >>> Hi Dough, >>> >>> I have some high-level questions. >> Thanks for your interest. I will attempt to answer them. >> > > Hi Doug, > > sorry for the late reply, slowly catching up on mails. Thanks for finding the time, and for the thoughtful feedback. > >>> >>>> MOTIVATION: >>>> Some Broadcom devices (e.g. 7445, 7278) contain multiple memory >>>> controllers with each mapped in a different address range within >>>> a Uniform Memory Architecture. Some users of these systems have >>> >>> How large are these areas typically? >>> >>> How large are they in comparison to other memory in the system? >>> >>> How is this memory currently presented to the system? >> I'm not certain what is typical because these systems are highly >> configurable and Broadcom's customers have different ideas about >> application processing. >> >> The 7278 device has four ARMv8 CPU cores in an SMP cluster and two >> memory controllers (MEMCs). Each MEMC is capable of controlling up to >> 8GB of DRAM. An example 7278 system might have 1GB on each controller, >> so an arm64 kernel might see 1GB on MEMC0 at 0x40000000-0x7FFFFFFF and >> 1GB on MEMC1 at 0x300000000-0x33FFFFFFF. >> >> The Designated Movable Block concept introduced here has the potential >> to offer useful services to different constituencies. I tried to >> highlight this in my V1 patch set with the hope of attracting some >> interest, but it can complicate the overall discussion, so I would like >> to maybe narrow the discussion here. It may be good to keep them in mind >> when assessing the overall value, but perhaps the "other opportunities" >> can be covered as a follow on discussion. >> >> The base capability described in commits 7-15 of this V1 patch set is to >> allow a 'movablecore' block to be created at a particular base address >> rather than solely at the end of addressable memory. >> > > Just so we're on the same page: > > Having too much ZONE_MOVABLE memory (ratio compared to !ZONE_MOVABLE > memory) is dangerous. Acceptable ratios highly depend on the target > workload. An extreme example is memory-hungry applications that end up > long-term pinning a lot of memory (e.g., VMs with SR-IO): we can run > easily out of free memory in the !ZONE_MOVABLE zones and might not want > ZONE_MOVABLE at all. Definitely. I've had to explain this to application developers myself :). This is fundamentally why the existing 'movablecore' implementation is insufficient for multiple memory controllers. Placing any ZONE_MOVABLE memory on the lower addressed memory controller forces all of the higher addressed memory controller(s) to only contain ZONE_MOVABLE memory, which is generally unacceptable for any workload. > > So whatever we do, this should in general not be the kernel sole > decision to make this memory any special and let ZONE_MOVABLE manage it. I believe you are stating that Designated Movable Blocks should only be created as a result of special configuration (e.g. kernel parameters, devicetree, ...). I would agree with that. Is that what you intended by this statement, or am I missing something? > > It's the same with CMA. "Heavy" CMA users require special configuration: > hugetlb_cma is one prime example. > >>> >>>> expressed the desire to locate ZONE_MOVABLE memory on each >>>> memory controller to allow user space intensive processing to >>>> make better use of the additional memory bandwidth. >>> >>> Can you share some more how exactly ZONE_MOVABLE would help here to make >>> better use of the memory bandwidth? >> ZONE_MOVABLE memory is effectively unusable by the kernel. It can be >> used by user space applications through both the page allocator and the >> Hugetlbfs. If a large 'movablecore' allocation is defined and it can > > Hugetlbfs not necessarily by all architectures. Some architectures don't > support placing hugetlb pages on ZONE_MOVABLE (not migratable) and > gigantic pages are special either way. That's true. > >> only be located at the end of addressable memory then it will always be >> located on MEMC1 of a 7278 system. This will create a tendency for user >> space accesses to consume more bandwidth on the MEMC1 memory controller >> and kernel space accesses to consume more bandwidth on MEMC0. A more >> even distribution of ZONE_MOVABLE memory between the available memory >> controllers in theory makes more memory bandwidth available to user >> space intensive loads. >> > > Sorry to be dense, is this also about different memory access latency or > just memory bandwidth? Broadcom memory controllers do support configurable real-time scheduling with bandwidth guarantees for different memory clients so I suppose this is a fair question. However, the expectation here is that the CPUs would have equivalent access latencies, so it is really just about memory bandwidth for the CPUs. > > Do these memory areas have special/different performance > characteristics? Using dedicated/fake NUMA nodes might be more in line > with what CXL and PMEM are up to. > > Using ZONE_MOVABLE for that purpose feels a little bit like an abuse of > the mechanism. Current usage intends to have equivalent performance from a CPU perspective. God forbid any Broadcom customers read your questions and start asking for such capabilities :), but if they do I agree that ZONE_MOVABLE for that purpose would be harebrained. > To be clearer what I mean: > > We can place any movable allocations on ZONE_MOVABLE, including kernel > allocations. User space allocations are just one example, and int he > future we'll turn more and more allocations movable to be able to cope > with bigger ZONE_MOVABLE demands due to DAX/CXL. I once looked into > migrating user space page tables, just to give an example. That's good to know. > > >>> >>>> Unfortunately, the historical monotonic layout of zones would >>>> mean that if the lowest addressed memory controller contains >>>> ZONE_MOVABLE memory then all of the memory available from >>>> memory controllers at higher addresses must also be in the >>>> ZONE_MOVABLE zone. This would force all kernel memory accesses >>>> onto the lowest addressed memory controller and significantly >>>> reduce the amount of memory available for non-movable >>>> allocations. >>> >>> We do have code that relies on zones during boot to not overlap within a >>> single node. >> I believe my changes address all such reliance, but if you are aware of >> something I missed please let me know. >> > > One example I'm aware of is drivers/base/memory.c:memory_block_add_nid() > / early_node_zone_for_memory_block(). > > If we get it wrong, or actually have memory blocks that span multiple > zones, we can no longer offline these memory blocks. We really wanted to > avoid scanning the memmap for now and it seems to get the job done in > environments we care about. To the extent that this implementation only supports creating Designated Movable Blocks in boot memory and boot memory does not generally support offlining, I wouldn't expect this to be an issue. However, if for some reason offlining boot memory becomes desirable then we should use dmb_intersects() along with zone_intersects() to take the appropriate action. Based on the current usage of zone_intersects() I'm not entirely sure what the correct action should be. > >>> >>>> >>>> The main objective of this patch set is therefore to allow a >>>> block of memory to be designated as part of the ZONE_MOVABLE >>>> zone where it will always only be used by the kernel page >>>> allocator to satisfy requests for movable pages. The term >>>> Designated Movable Block is introduced here to represent such a >>>> block. The favored implementation allows modification of the >>> >>> Sorry to say, but that term is rather suboptimal to describe what you >>> are doing here. You simply have some system RAM you'd want to have >>> managed by ZONE_MOVABLE, no? >> That may be true, but I found it superior to the 'sticky' movable >> terminology put forth by Mel Gorman ;). I'm happy to entertain >> alternatives, but they may not be as easy to find as you think. > > Especially the "blocks" part is confusing. Movable pageblocks? Movable > Linux memory blocks? > > Note that the sticky movable *pageblocks* were a completely different > concept than simply reusing ZONE_MOVABLE for some memory ranges. I would say that is open for debate. The implementations would be "completely different" but the objectives could be quite similar. There appear to be a number of people that are interested in the concept of memory that can only contain data that tolerates relocation for various potentially non-competing reasons. Fundamentally, the concept of MIGRATE_MOVABLE memory is useful to allow competing user space processes to share limited physical memory supplied by the kernel. The data in that memory can be relocated elsewhere by the kernel when the process that owns it is not executing. This movement is typically not observable to the owning process which has its own address space. The kernel uses MIGRATE_UNMOVABLE memory to protect the integrity of its address space, but of course what the kernel considers unmovable could in fact be moved by a hypervisor in a way that is analogous to what the kernel does for user space. For maximum flexibility the Linux memory management allows for converting the migratetype of free memory to help satisfy requests to allocate pages of memory through a mechanism I will call "fallback". The concepts of sticky movable pageblocks and ZONE_MOVABLE have the common objective of preventing the migratetype of pageblocks from getting converted to anything other than MIGRATE_MOVABLE, and this is what makes the memory special. I agree with Mel Gorman that zones are meant to be about address induced limitations, so using a zone for the purpose of breaking the fallback mechanism of the page allocator is a misuse of the concept. A new migratetype would be more appropriate for representing this change in how fallback should apply to the pageblock because the desired behavior has nothing to do with the address at which the memory is located. It is entirely reasonable to desire "sticky" movable behavior for memory in any zone. Such a solution would be directly applicable to our multiple memory controller use case, and is really how Designated Movable Blocks should be imagined. However, I also recognize the efficiency benefits of using a ZONE_MOVABLE zone to manage the pages that have this "sticky" movable behavior. Introducing a new sticky MIGRATE_MOVABLE migratetype adds a new free_list to every free_area which increases the search space and associated work when trying to allocate a page for all callers. Introducing ZONE_MOVABLE reduces the search space by providing an early separation between searches for movable and non-movable allocations. The classic zone restrictions weren't a good fit for multiple memory controllers, but those restrictions were lifted to overcome similar issues with memory_hotplug. It is not that Designated Movable Blocks want to be in ZONE_MOVABLE, but rather that ZONE_MOVABLE provides a convenience for managing the page allocators use of "sticky" movable memory just like it does for memory hotplug. Dumping the memory in Designated Movable Blocks into the ZONE_MOVABLE zone allows an existing mechanism to be reused, reducing the risk of negatively impacting the page allocator behavior. There are some subtle distinctions between Designated Movable Blocks and the existing ZONE_MOVABLE zone. Because Designated Movable Blocks are reserved when created they are protected against any early boot time kernel reservations that might place unmovable allocations in them. The implementation continues to track the zone_movable_pfn as the start of the "classic" ZONE_MOVABLE zone on each node. A Designated Movable Block can overlap any other zone including the "classic" ZONE_MOVABLE zone. > >> >>> >>>> 'movablecore' kernel parameter to allow specification of a base >>>> address and support for multiple blocks. The existing >>>> 'movablecore' mechanisms are retained. Other mechanisms based on >>>> device tree are also included in this set. >>>> >>>> BACKGROUND: >>>> NUMA architectures support distributing movablecore memory >>>> across each node, but it is undesirable to introduce the >>>> overhead and complexities of NUMA on systems that don't have a >>>> Non-Uniform Memory Architecture. >>> >>> How exactly would that look like? I think I am missing something :) >> The notion would be to consider each memory controller as a separate >> node, but as stated it is not desirable. >> > > Doing it the DAX/CXL way would be to expose these memory ranges as > daxdev instead, and letting the admin decide how to online these memory > ranges when adding them to the buddy via the dax/kmem kernel module. > > That could mean that your booting with memory on MC0 only, and expose > memory of MC1 via a daxdev, giving the admin the possibility do decide > to which zone the memory should be onlined too. > > That would avoid most kernel code changes. I wasn't familiar with these kernel mechanisms and did enjoy reading about the somewhat oxymoronic "volatile-use of persistent memory" that is dax/kmem, but this isn't performance differentiated RAM. It really is just System RAM so this degree of complexity seems unwarranted. > >>> >>> Why can't we simply designate these regions as CMA regions? >> We and others have encountered significant performance issues when large >> CMA regions are used. There are significant restrictions on the page >> allocator's use of MIGRATE_CMA pages and the memory subsystem works very >> hard to keep about half of the memory in the CMA region free. There have >> been attempts to patch the CMA implementation to alter this behavior >> (for example the set I referenced Mel's response to in [1]), but there >> are users that desire the current behavior. > > Optimizing that would be great, eventually making it configurable or > selecting the behavior based on the actual CMA area sizes. > >> >>> >>> Why do we have to start using ZONE_MOVABLE for them? >> One of the "other opportunities" for Designated Movable Blocks is to >> allow CMA to allocate from a DMB as an alternative. This would allow >> current users to continue using CMA as they want, but would allow users >> (e.g. hugetlb_cma) that are not sensitive to the allocation latency to >> let the kernel page allocator make more complete use (i.e. waste less) >> of the shared memory. ZONE_MOVABLE pageblocks are always MIGRATE_MOVABLE >> so the restrictions placed on MIGRATE_CMA pageblocks are lifted within a >> DMB. > > The whole purpose of ZONE_MOVABLE is that *no* unmovable allocations end > up on it. The biggest difference to CMA is that the CMA *owner* is able > to place unmovable allocations on it. I'm not sure that is a wholly fair characterization (or maybe I just hope that's the case :). I would agree that the Linux page allocator can't place any unmovable allocations on it. I expect that people locate memory in ZONE_MOVABLE for different purposes. For example, the memory hotplug users ostensibly place memory their so that any data on the hot plugged memory can be moved off of the memory prior to it being hot unplugged. Unplugging the memory removes the memory from the ZONE_MOVABLE zone, but it is not materially different from allocating the memory for a different purpose (perhaps in a different machine). Conceptually, allowing a CMA allocator to operate on a Designated Movable Block of memory that it *owns* is also removing that memory from the ZONE_MOVABLE zone. Issues of ownership should be addressed which is why these "other opportunities" are being deferred for now, but I do not believe such use is unreasonable. Again, Designated Movable Blocks are only allowed in boot memory so there shouldn't be a conflict with memory hotplug. I believe the same would apply for hugetlb_cma. > > Using ZONE_MOVABLE for unmovable allocations (hugetlb_cma) is not > acceptable as is. > > Using ZONE_MOVABLE in different context and calling it DMB is very > confusing TBH. Perhaps it is more helpful to think of a Designated Movable Block as a block of memory whose migratetype is not allowed to be changed from MIGRATE_MOVABLE (i.e. "sticky" migrate movable). The fact that ZONE_MOVABLE is being used to achieve that is an implementation detail for this commit set. In the same way that memory hotplug is the concept of adding System RAM during run time, but placing it in ZONE_MOVABLE is an implementation detail to make it easier to unplug. > > Just a note that I described the idea of a "PREFER_MOVABLE" zone in the > past. In contrast to ZONE_MOVABLE, we cannot run into weird OOM > situations in a ZONE misconfiguration, and we'd end up placing only > movable allocations on it as long as we can. However, especially > gigantic pages could be allocated from it. It sounds kind-of more like > what you want -- and maybe in combination of daxctl to let the user > decide how to online memory ranges. Best not let Mel hear you suggesting another zone;). > > > And just to make it clear again: depending on ZONE_MOVABLE == only user > space allocations is not future proof. Understood. > >> >>> >> Thanks for your consideration, >> Dough Baker ... I mean Doug Berger :). > > > :) Thanks Doug! > Thank you! -Doug
>> So whatever we do, this should in general not be the kernel sole >> decision to make this memory any special and let ZONE_MOVABLE manage it. > I believe you are stating that Designated Movable Blocks should only be > created as a result of special configuration (e.g. kernel parameters, > devicetree, ...). I would agree with that. Is that what you intended by > this statement, or am I missing something? Essentially, that it should mostly be the decision of an educated admin. ... >> >>> only be located at the end of addressable memory then it will always be >>> located on MEMC1 of a 7278 system. This will create a tendency for user >>> space accesses to consume more bandwidth on the MEMC1 memory controller >>> and kernel space accesses to consume more bandwidth on MEMC0. A more >>> even distribution of ZONE_MOVABLE memory between the available memory >>> controllers in theory makes more memory bandwidth available to user >>> space intensive loads. >>> >> >> Sorry to be dense, is this also about different memory access latency or >> just memory bandwidth? > Broadcom memory controllers do support configurable real-time scheduling > with bandwidth guarantees for different memory clients so I suppose this > is a fair question. However, the expectation here is that the CPUs would > have equivalent access latencies, so it is really just about memory > bandwidth for the CPUs. Okay, thanks for clarifying. ... >>>> >>>>> Unfortunately, the historical monotonic layout of zones would >>>>> mean that if the lowest addressed memory controller contains >>>>> ZONE_MOVABLE memory then all of the memory available from >>>>> memory controllers at higher addresses must also be in the >>>>> ZONE_MOVABLE zone. This would force all kernel memory accesses >>>>> onto the lowest addressed memory controller and significantly >>>>> reduce the amount of memory available for non-movable >>>>> allocations. >>>> >>>> We do have code that relies on zones during boot to not overlap within a >>>> single node. >>> I believe my changes address all such reliance, but if you are aware of >>> something I missed please let me know. >>> >> >> One example I'm aware of is drivers/base/memory.c:memory_block_add_nid() >> / early_node_zone_for_memory_block(). >> >> If we get it wrong, or actually have memory blocks that span multiple >> zones, we can no longer offline these memory blocks. We really wanted to >> avoid scanning the memmap for now and it seems to get the job done in >> environments we care about. > To the extent that this implementation only supports creating Designated > Movable Blocks in boot memory and boot memory does not generally support > offlining, I wouldn't expect this to be an issue. However, if for some Sad truth is, that boot memory sometimes is supposed to support offlining -- or people expect it to work to some degree. For example, with special memblock hacks you can get them into ZONE_MOVABLE to be able to hotunplug some NUMA nodes even after a reboot (movable_node kernel parameter). There are use cases where you want to offline boot memory to save energy by disabling complete memory banks -- best effort when not using ZONE_MOVABLE. Having that said, I agree that it's a corner case use case. > reason offlining boot memory becomes desirable then we should use > dmb_intersects() along with zone_intersects() to take the appropriate > action. Based on the current usage of zone_intersects() I'm not entirely > sure what the correct action should be. > >> >>>> >>>>> >>>>> The main objective of this patch set is therefore to allow a >>>>> block of memory to be designated as part of the ZONE_MOVABLE >>>>> zone where it will always only be used by the kernel page >>>>> allocator to satisfy requests for movable pages. The term >>>>> Designated Movable Block is introduced here to represent such a >>>>> block. The favored implementation allows modification of the >>>> >>>> Sorry to say, but that term is rather suboptimal to describe what you >>>> are doing here. You simply have some system RAM you'd want to have >>>> managed by ZONE_MOVABLE, no? >>> That may be true, but I found it superior to the 'sticky' movable >>> terminology put forth by Mel Gorman ;). I'm happy to entertain >>> alternatives, but they may not be as easy to find as you think. >> >> Especially the "blocks" part is confusing. Movable pageblocks? Movable >> Linux memory blocks? >> >> Note that the sticky movable *pageblocks* were a completely different >> concept than simply reusing ZONE_MOVABLE for some memory ranges. > I would say that is open for debate. The implementations would be > "completely different" but the objectives could be quite similar. > There appear to be a number of people that are interested in the concept > of memory that can only contain data that tolerates relocation for > various potentially non-competing reasons. > > Fundamentally, the concept of MIGRATE_MOVABLE memory is useful to allow > competing user space processes to share limited physical memory supplied > by the kernel. The data in that memory can be relocated elsewhere by the > kernel when the process that owns it is not executing. This movement is > typically not observable to the owning process which has its own address > space. > > The kernel uses MIGRATE_UNMOVABLE memory to protect the integrity of its > address space, but of course what the kernel considers unmovable could > in fact be moved by a hypervisor in a way that is analogous to what the > kernel does for user space. > > For maximum flexibility the Linux memory management allows for > converting the migratetype of free memory to help satisfy requests to > allocate pages of memory through a mechanism I will call "fallback". The > concepts of sticky movable pageblocks and ZONE_MOVABLE have the common > objective of preventing the migratetype of pageblocks from getting > converted to anything other than MIGRATE_MOVABLE, and this is what makes > the memory special. Yes, good summary. > > I agree with Mel Gorman that zones are meant to be about address induced > limitations, so using a zone for the purpose of breaking the fallback > mechanism of the page allocator is a misuse of the concept. A new > migratetype would be more appropriate for representing this change in > how fallback should apply to the pageblock because the desired behavior > has nothing to do with the address at which the memory is located. It is > entirely reasonable to desire "sticky" movable behavior for memory in > any zone. Such a solution would be directly applicable to our multiple > memory controller use case, and is really how Designated Movable Blocks > should be imagined. I usually agree with Mel, but not necessarily on that point that it's a misuse of a concept. It's an extension of an existing concept, that doesn't imply it's a misuse. Traditionally, it was about address limitations, yes. Now it's also about allocation types. Sure, there might be other ways to get it done as well. I'd compare it to the current use of NUMA nodes: traditionally, it really used to be actual NUMA nodes. Nowadays, it's a mechanism, for example, to expose performance-differented memory, let applications use it via mbind() or have the page allocator dynamically migrate hot/cold pages back and forth according to memory tiering strategies. > > However, I also recognize the efficiency benefits of using a > ZONE_MOVABLE zone to manage the pages that have this "sticky" movable > behavior. Introducing a new sticky MIGRATE_MOVABLE migratetype adds a > new free_list to every free_area which increases the search space and > associated work when trying to allocate a page for all callers. > Introducing ZONE_MOVABLE reduces the search space by providing an early > separation between searches for movable and non-movable allocations. The > classic zone restrictions weren't a good fit for multiple memory > controllers, but those restrictions were lifted to overcome similar > issues with memory_hotplug. It is not that Designated Movable Blocks > want to be in ZONE_MOVABLE, but rather that ZONE_MOVABLE provides a > convenience for managing the page allocators use of "sticky" movable > memory just like it does for memory hotplug. Dumping the memory in > Designated Movable Blocks into the ZONE_MOVABLE zone allows an existing > mechanism to be reused, reducing the risk of negatively impacting the > page allocator behavior. > > There are some subtle distinctions between Designated Movable Blocks and > the existing ZONE_MOVABLE zone. Because Designated Movable Blocks are > reserved when created they are protected against any early boot time > kernel reservations that might place unmovable allocations in them. The > implementation continues to track the zone_movable_pfn as the start of > the "classic" ZONE_MOVABLE zone on each node. A Designated Movable Block > can overlap any other zone including the "classic" ZONE_MOVABLE zone. What exactly to you mean with "overlay" -- I assume you mean that zone span will overlay but it really "belongs" to ZONE_MOVABLE, as indicated by it's struct page metadata. >> >> Doing it the DAX/CXL way would be to expose these memory ranges as >> daxdev instead, and letting the admin decide how to online these memory >> ranges when adding them to the buddy via the dax/kmem kernel module. >> >> That could mean that your booting with memory on MC0 only, and expose >> memory of MC1 via a daxdev, giving the admin the possibility do decide >> to which zone the memory should be onlined too. >> >> That would avoid most kernel code changes. > I wasn't familiar with these kernel mechanisms and did enjoy reading > about the somewhat oxymoronic "volatile-use of persistent memory" that > is dax/kmem, but this isn't performance differentiated RAM. It really is > just System RAM so this degree of complexity seems unwarranted. It's an existing mechanism that will get heavily used by CXL -- for all kinds of memory. I feel like it could solve your use case eventually. Excluded memory cannot be allocated by the early allocator and you can online it to ZONE_MOVABLE. It at least seems to roughly do something you want to achieve. I'd be curious what you can't achieve or what we might need to make >>> >>>> >>>> Why do we have to start using ZONE_MOVABLE for them? >>> One of the "other opportunities" for Designated Movable Blocks is to >>> allow CMA to allocate from a DMB as an alternative. This would allow >>> current users to continue using CMA as they want, but would allow users >>> (e.g. hugetlb_cma) that are not sensitive to the allocation latency to >>> let the kernel page allocator make more complete use (i.e. waste less) >>> of the shared memory. ZONE_MOVABLE pageblocks are always MIGRATE_MOVABLE >>> so the restrictions placed on MIGRATE_CMA pageblocks are lifted within a >>> DMB. >> >> The whole purpose of ZONE_MOVABLE is that *no* unmovable allocations end >> up on it. The biggest difference to CMA is that the CMA *owner* is able >> to place unmovable allocations on it. > I'm not sure that is a wholly fair characterization (or maybe I just > hope that's the case :). I would agree that the Linux page allocator > can't place any unmovable allocations on it. I expect that people locate > memory in ZONE_MOVABLE for different purposes. For example, the memory > hotplug users ostensibly place memory their so that any data on the hot > plugged memory can be moved off of the memory prior to it being hot > unplugged. Unplugging the memory removes the memory from the > ZONE_MOVABLE zone, but it is not materially different from allocating > the memory for a different purpose (perhaps in a different machine). Well, memory offlining is the one operation that evacuates memory) and makes sure it cannot be allocated anymore (possibly with the intention of removing that memory from the system). Sure, you can call it a fake allocation, but there is a more fundamental difference compared to random subsystems placing unmovable allocations there. > > Conceptually, allowing a CMA allocator to operate on a Designated > Movable Block of memory that it *owns* is also removing that memory from > the ZONE_MOVABLE zone. Issues of ownership should be addressed which is > why these "other opportunities" are being deferred for now, but I do not > believe such use is unreasonable. Again, Designated Movable Blocks are > only allowed in boot memory so there shouldn't be a conflict with memory > hotplug. I believe the same would apply for hugetlb_cma. >> >> Using ZONE_MOVABLE for unmovable allocations (hugetlb_cma) is not >> acceptable as is. >> >> Using ZONE_MOVABLE in different context and calling it DMB is very >> confusing TBH. > Perhaps it is more helpful to think of a Designated Movable Block as a > block of memory whose migratetype is not allowed to be changed from > MIGRATE_MOVABLE (i.e. "sticky" migrate movable). The fact that I think that such a description might make the feature easier to grasp. Although I am not sure yet if DMB as proposed is rather a hack to avoid introducing real sticky movable blocks (sorry, I'm just trying to connect the dots and there is a lot of complexity involved) or actually a clean design. Messing with zones and memblock always implies complexity :) > ZONE_MOVABLE is being used to achieve that is an implementation detail > for this commit set. In the same way that memory hotplug is the concept > of adding System RAM during run time, but placing it in ZONE_MOVABLE is > an implementation detail to make it easier to unplug. Right, but there we don't play any tricks: it's just ZONE_MOVABLE without any other metadata pointing out ownership. Maybe that's what you are trying to describe here: A DMB inside ZONE_MOVABLE implies that there is another owner and that even memory offlining should fail. > >> >> Just a note that I described the idea of a "PREFER_MOVABLE" zone in the >> past. In contrast to ZONE_MOVABLE, we cannot run into weird OOM >> situations in a ZONE misconfiguration, and we'd end up placing only >> movable allocations on it as long as we can. However, especially >> gigantic pages could be allocated from it. It sounds kind-of more like >> what you want -- and maybe in combination of daxctl to let the user >> decide how to online memory ranges. > Best not let Mel hear you suggesting another zone;). He most probably read it already. ;) I can understand all theoretical complains about ZONE_MOVABLE, but in the end it has been getting the job done for years. > >> >> >> And just to make it clear again: depending on ZONE_MOVABLE == only user >> space allocations is not future proof. > Understood. May I ask what the main purpose/use case of DMB is? Would it be sufficient, to specify that hugetlb are allocated from a specific memory area, possible managed by CMA? And then simply providing the application that cares these hugetlb pages? Would you need something that is *not* hugetlb? But even then, how would an application be able to specify that exactly it's allocation will get served from that part of ZONE_MOVABLE? Sure, if you don't reserve any other hugetlb pages, it's easy. I'd like to note that if you'd go with (fake) NUMA nodes like PMEM or CXL you could easily let your application mbind() to that memory and have it configured.
Reordered to (hopefully) improve readability. On 10/5/2022 11:39 AM, David Hildenbrand wrote: > May I ask what the main purpose/use case of DMB is? The concept of Designated Movable Blocks was conceived to provide a common mechanism for different use cases, so identifying the "main" one is not so easy. Broadly speaking I would say there are two different but compatible objectives that could be used to categorize use cases. The narrower objective is the ability to locate some "user space friendly" memory on each memory controller to make more of the total memory bandwidth available to user space processes. The ZONE_MOVABLE zone is considered to be "user space friendly" so locating some of it on each memory controller would meet this objective. The existing 'movablecore' kernel parameter allows the balance of kernel/movable memory to be adjusted, but movable memory will always be located on the highest addressed memory controller. The v2 patch set attempts to focus explicitly on the use case of adding a base address to the 'movablecore' kernel parameter to support this objective. The other general objective is to facilitate better reuse/sharing of memory. Broadcom Set-Top Box SoCs include video processing devices that can require large amounts of memory to perform their functions. Historically, memory carve-outs have been used to ensure guaranteed availability of memory to meet the requirements of cable television customers. The rise of Android TV and Google TV have made the inefficiency of memory carve-outs unacceptable. We have tried to meet the reusability objective with a CMA based implementation, but Broadcom customers were unhappy with the performance. Efforts to improve the CMA performance led me to Joonsoo's efforts to do the same and to the "sticky" MIGRATE_MOVABLE proposal from Mel Gorman that I cited. I began working on an implementation of Designated Movable Blocks based on that proposal which could be characterized as reserving a block of memory, assigning it a new "sticky" movable migrate type, and modifying the fast and slow path page allocators to handle the new migrate type such that requests for movable memory could be satisfied by pages from the blocks and that the migrate type of pages in the blocks could not be changed by "fallback" mechanisms. Both of these objectives require the ability to specify the location of a block of memory that can only be used by the Linux kernel page allocator to satisfy requests for movable memory. The location is relevant because it may need to be on a specific memory controller or it may have to satisfy the DMA address range of a specific device. The movability is relevant because it improves the availability to user space allocations or it allows the data occupying the memory to be moved away when the memory is required by the device. The Designated Movable Block mechanism was designed to satisfy these requirements and was seen as a common mechanism for both objectives. While learning more about the page allocator implementation, I realized that hotplug memory also has these same requirements. The location of hotplug memory is determined by the system hardware independent of Linux's zone concepts and the data stored on the memory must be movable to support the ability to offline the memory before it is unplugged. This led me to study the hotplug memory implementation to understand how they satisfied these requirements. I became aware that the "narrower objective" could conceivably be satisfied by the hotplug memory capability with a few challenges. First the size of hotplug memory sections is a bit course. The current 128MB sections on arm64 are not too bad and are far better than the 1GB sections that were in place when I first looked at it. For systems that do not support ACPI there is no clear way to specify hotplug memory regions at boot time. When Linux boots an arm64 kernel with devicetree the OS attempts to initialize all available memory described by the devicetree. Typically this boot memory cannot be unplugged to allow it to be plugged into a different zone. A devicetree specification of the hardware could intentionally leave holes in its memory description to allow for runtime plugging of memory into the holes, but this goes against the spirit of a devicetree description of the system hardware as it is not representative of what hardware is actually present. The 'mem=' kernel parameter can be used to prevent Linux from initializing all of the available memory so that memory could be hotplugged after boot, but this breaks devicetree mechanisms for reserving memory from addresses that might only be populated by hotplug after boot. It also becomes difficult to manage the selection of zones where memory is hotplugged. Referring again to the example system with 1GB on MEMC0 and 1GB on MEMC1 we could boot with 'mem=768M' to leave 256MB unpopulated on MEMC0 and all of the memory (1GB) on MEMC1 unpopulated. If we set the memory_hotplug module parameter online_policy to "auto-movable" then adding 256MB at 0x70000000 will put the memory in ZONE_MOVABLE as desired. However, we might want to hotplug 768MB at 0x300000000 into ZONE_NORMAL and 256MB at 0x330000000 into ZONE_MOVABLE. The fact that the memory_hotplug parameters are not easily modifiable from the kernel modules that are necessary to access the memory_hotplug API makes this a difficult dance. I have experimented with a simple module exposing hotplug capability to user space and have confirmed as a proof of concept that user space can adjust the memory_hotplug parameters and use the module to achieve the desired zone population with hotplug. The /sys/devices/system/memory/probe control simplifies this, but is not enabled on arm64 architectures. In addition, keeping this memory unplugged until after boot means that the memory cannot be used during boot. Kernel boot time reservations are a mixed bag. On the one hand they won't land in ZONE_MOVABLE which is nice, but in this example they land in ZONE_DMA which can be considered a more valuable resource than ZONE_NORMAL. Both of these issues are not likely to be of significant consequence, but neither is really desirable. Finally, just like there are those that may not want to execute a NUMA kernel (e.g. Android GKI arm64), there may also be those that don't want to include memory hotplug support in their kernel. These things can change, but are not always under our control. If you are aware of solutions to these issues that would make memory hotplug a more viable solution for us than DMB I would be happy to know them. These observations led me to design DMB more as an extension of 'movablecore' than an extension of memory hotplug. However, the efficiency of using the ZONE_MOVABLE zone to collect and manage "sticky" movable pages in an address independent way without "fallback" (as is done by memory hotplug) won me over and I abandoned the idea of modifying the fast and slow page allocator paths to support a "sticky" movable migrate type. The implementation of DMB was re-conceived to preserve the existing 'movablecore' mechanism of creating a dynamic ZONE_MOVABLE zone that spans from zone_movable_pfn for each node to the end of memory on the node, and adding the ability to designate blocks of memory whose pages would be removed from their default zone and placed in the ZONE_MOVABLE zone. The span of each ZONE_MOVABLE zone was increased to start at the lowest pfn in the zone on the node and continue to the end of memory on the node. I also neglected to destroy zones that became empty after their pages were moved to ZONE_MOVABLE. These last two decisions were a matter of convenience, but I can see that they may have created some confusion (based on your questions) so I am happy to reconsider them. > > Would it be sufficient, to specify that hugetlb are allocated from a > specific memory area, possible managed by CMA? And then simply providing > the application that cares these hugetlb pages? Would you need something > that is *not* hugetlb? > > But even then, how would an application be able to specify that exactly > it's allocation will get served from that part of ZONE_MOVABLE? Sure, if > you don't reserve any other hugetlb pages, it's easy. As noted before I actually have very limited visibility into how the "narrower objective" is being used by Broadcom customers and how much benefit it provides. I believe its current use is probably simply opportunistic, but these kinds of improvements to hugetlb allocation might be welcomed. I'd say the hugetlb_cma is similar to what you are describing except that it is consolidated rather than being distributed across multiple memory areas. Such changes to add benefit to the "narrower objective" need not be considered with respect to this patch set. On the other hand, the reuse objective of Designated Movable Blocks could be very relevant to hugetlb_cma. >> >> I agree with Mel Gorman that zones are meant to be about address induced >> limitations, so using a zone for the purpose of breaking the fallback >> mechanism of the page allocator is a misuse of the concept. A new >> migratetype would be more appropriate for representing this change in >> how fallback should apply to the pageblock because the desired behavior >> has nothing to do with the address at which the memory is located. It is >> entirely reasonable to desire "sticky" movable behavior for memory in >> any zone. Such a solution would be directly applicable to our multiple >> memory controller use case, and is really how Designated Movable Blocks >> should be imagined. > > I usually agree with Mel, but not necessarily on that point that it's a > misuse of a concept. It's an extension of an existing concept, that > doesn't imply it's a misuse. Traditionally, it was about address > limitations, yes. Now it's also about allocation types. Sure, there > might be other ways to get it done as well. Yes, I would also agree that when introduced that was the concept, but that the extensions made for memory hotplug have enough value to be a justified extension of the initial concept. That is exactly why I changed my approach. > > I'd compare it to the current use of NUMA nodes: traditionally, it > really used to be actual NUMA nodes. Nowadays, it's a mechanism, for > example, to expose performance-differented memory, let applications use > it via mbind() or have the page allocator dynamically migrate hot/cold > pages back and forth according to memory tiering strategies. You are helping me gain an appreciation for the current extensions of the node concept beyond the initial use for NUMA. It does sound useful for applications that do want to have that finer control over the resources they use. However, I still believe there is value in the Designated Movable Block concept that should be realizable when nodes are not available in the kernel config. The implementation I am proposing should not incur a cost for those that don't wish to use it. > >> >> However, I also recognize the efficiency benefits of using a >> ZONE_MOVABLE zone to manage the pages that have this "sticky" movable >> behavior. Introducing a new sticky MIGRATE_MOVABLE migratetype adds a >> new free_list to every free_area which increases the search space and >> associated work when trying to allocate a page for all callers. >> Introducing ZONE_MOVABLE reduces the search space by providing an early >> separation between searches for movable and non-movable allocations. The >> classic zone restrictions weren't a good fit for multiple memory >> controllers, but those restrictions were lifted to overcome similar >> issues with memory_hotplug. It is not that Designated Movable Blocks >> want to be in ZONE_MOVABLE, but rather that ZONE_MOVABLE provides a >> convenience for managing the page allocators use of "sticky" movable >> memory just like it does for memory hotplug. Dumping the memory in >> Designated Movable Blocks into the ZONE_MOVABLE zone allows an existing >> mechanism to be reused, reducing the risk of negatively impacting the >> page allocator behavior. >> >> There are some subtle distinctions between Designated Movable Blocks and >> the existing ZONE_MOVABLE zone. Because Designated Movable Blocks are >> reserved when created they are protected against any early boot time >> kernel reservations that might place unmovable allocations in them. The >> implementation continues to track the zone_movable_pfn as the start of >> the "classic" ZONE_MOVABLE zone on each node. A Designated Movable Block >> can overlap any other zone including the "classic" ZONE_MOVABLE zone. > > What exactly to you mean with "overlay" -- I assume you mean that zone > span will overlay but it really "belongs" to ZONE_MOVABLE, as indicated > by it's struct page metadata. Yes. If the pages of a DMB are within the span of a zone I am saying it overlaps that zone. The pages will only be "present" in the ZONE_MOVABLE zone. >>>> >>>>> >>>>> Why do we have to start using ZONE_MOVABLE for them? >>>> One of the "other opportunities" for Designated Movable Blocks is to >>>> allow CMA to allocate from a DMB as an alternative. This would allow >>>> current users to continue using CMA as they want, but would allow users >>>> (e.g. hugetlb_cma) that are not sensitive to the allocation latency to >>>> let the kernel page allocator make more complete use (i.e. waste less) >>>> of the shared memory. ZONE_MOVABLE pageblocks are always >>>> MIGRATE_MOVABLE >>>> so the restrictions placed on MIGRATE_CMA pageblocks are lifted >>>> within a >>>> DMB. >>> >>> The whole purpose of ZONE_MOVABLE is that *no* unmovable allocations end >>> up on it. The biggest difference to CMA is that the CMA *owner* is able >>> to place unmovable allocations on it. >> I'm not sure that is a wholly fair characterization (or maybe I just >> hope that's the case :). I would agree that the Linux page allocator >> can't place any unmovable allocations on it. I expect that people locate >> memory in ZONE_MOVABLE for different purposes. For example, the memory >> hotplug users ostensibly place memory there so that any data on the hot >> plugged memory can be moved off of the memory prior to it being hot >> unplugged. Unplugging the memory removes the memory from the >> ZONE_MOVABLE zone, but it is not materially different from allocating >> the memory for a different purpose (perhaps in a different machine). > > Well, memory offlining is the one operation that evacuates memory) and > makes sure it cannot be allocated anymore (possibly with the intention > of removing that memory from the system). Sure, you can call it a fake > allocation, but there is a more fundamental difference compared to > random subsystems placing unmovable allocations there. For the record, I am not offended by your use of the word "random" in that statement. I was once informed I unintentionally offended someone by using the term "arbitrary" in a similar way ;). Any such unmovable allocation should be made with intent and with authority to do so. The memory hotunplug is an example (perhaps a singular one) of a subsystem that can do so with intent and authority. Randomness plays no role. "Ownership" of a DMB would imply authority and such an owner should be presumed to be acting with intent. So the mechanics of ownership and methods should be formalized before the general objective of reuse of DMBs for non-movable purposes (e.g. hugetlb_cma, device driver, ...) is allowed. This is why that objective has been deferred with the hope that users that may have an interest in this objective can propose their favored mechanism. The "narrower objective" expressed in my v2 submission (i.e. movablecore with base address) does not make any non-movable allocations so explicit ownership is not necessary. Maybe whoever provided the 'movablecore' parameter is the implied owner, but it doesn't much matter in this case. Conceptually such a DMB could be hotunplugged, but that would be unexpected. > >> >> Conceptually, allowing a CMA allocator to operate on a Designated >> Movable Block of memory that it *owns* is also removing that memory from >> the ZONE_MOVABLE zone. Issues of ownership should be addressed which is >> why these "other opportunities" are being deferred for now, but I do not >> believe such use is unreasonable. Again, Designated Movable Blocks are >> only allowed in boot memory so there shouldn't be a conflict with memory >> hotplug. I believe the same would apply for hugetlb_cma. >>> >>> Using ZONE_MOVABLE for unmovable allocations (hugetlb_cma) is not >>> acceptable as is. >>> >>> Using ZONE_MOVABLE in different context and calling it DMB is very >>> confusing TBH. >> Perhaps it is more helpful to think of a Designated Movable Block as a >> block of memory whose migratetype is not allowed to be changed from >> MIGRATE_MOVABLE (i.e. "sticky" migrate movable). The fact that > > I think that such a description might make the feature easier to grasp. > Although I am not sure yet if DMB as proposed is rather a hack to avoid > introducing real sticky movable blocks (sorry, I'm just trying to > connect the dots and there is a lot of complexity involved) or actually > a clean design. Messing with zones and memblock always implies > complexity :) I very much appreciate your efforts to make sense of this. I am not certain whether that OR is INCLUSIVE or EXCLUSIVE. I would say that the implementation attempts to reuse the clean design of ZONE_MOVABLE (as extended by memory hotplug) to provide the management of "sticky" movable blocks that may overlap/overlay other zones. Doing so makes it unnecessary to provide an otherwise redundant implementation of "sticky" movable blocks that would likely degrade the performance of page allocations from zones other than ZONE_MOVABLE, even when no "sticky" movable blocks exist in the system. > >> ZONE_MOVABLE is being used to achieve that is an implementation detail >> for this commit set. In the same way that memory hotplug is the concept >> of adding System RAM during run time, but placing it in ZONE_MOVABLE is >> an implementation detail to make it easier to unplug. > > Right, but there we don't play any tricks: it's just ZONE_MOVABLE > without any other metadata pointing out ownership. Maybe that's what you > are trying to describe here: A DMB inside ZONE_MOVABLE implies that > there is another owner and that even memory offlining should fail. Now why didn't I just say that in the first place :). The general objective of reuse is inspired by CMA which has implied/explicit ownership and as noted above DMB needs ownership to meet this objective as well. Thanks for your patience and helping me attempt to communicate this more clearly. -Doug