Message ID | 1306499498-14263-2-git-send-email-ankita@in.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, 2011-05-27 at 18:01 +0530, Ankita Garg wrote: > +typedef struct mem_region_list_data { > + struct zone zones[MAX_NR_ZONES]; > + int nr_zones; > + > + int node; > + int region; > + > + unsigned long start_pfn; > + unsigned long spanned_pages; > +} mem_region_t; > + > +#define MAX_NR_REGIONS 16 Don't do the foo_t thing. It's out of style and the pg_data_t is a dinosaur. I'm a bit surprised how little discussion of this there is in the patch descriptions. Why did you choose this structure? What are the downsides of doing it this way? This effectively breaks up the zone's LRU in to MAX_NR_REGIONS LRUs. What effects does that have? How big _is_ a 'struct zone' these days? This patch will increase their effective size by 16x. Since one distro kernel basically gets run on *EVERYTHING*, what will MAX_NR_REGIONS be in practice? How many regions are there on the largest systems that will need this? We're going to be doing many linear searches and iterations over it, so it's pretty darn important to know. What does this do to lmbench numbers sensitive to page allocations? -- Dave
* Dave Hansen <dave@linux.vnet.ibm.com> [2011-05-27 08:30:03]: > On Fri, 2011-05-27 at 18:01 +0530, Ankita Garg wrote: > > +typedef struct mem_region_list_data { > > + struct zone zones[MAX_NR_ZONES]; > > + int nr_zones; > > + > > + int node; > > + int region; > > + > > + unsigned long start_pfn; > > + unsigned long spanned_pages; > > +} mem_region_t; > > + > > +#define MAX_NR_REGIONS 16 > > Don't do the foo_t thing. It's out of style and the pg_data_t is a > dinosaur. > > I'm a bit surprised how little discussion of this there is in the patch > descriptions. Why did you choose this structure? What are the > downsides of doing it this way? This effectively breaks up the zone's > LRU in to MAX_NR_REGIONS LRUs. What effects does that have? This data structure is one of the option, but definitely has overheads. One alternative was to use fake-numa nodes that has more overhead and user visible quirks. The overheads is based on the number of regions actually defined in the platform. It may be 2-4 in smaller systems. This split is what makes the allocations and reclaims work withing these boundaries using the zone's active, inactive lists on a per memory regions basis. An external structure to just capture the boundaries would have less overheads, but does not provide enough hooks to influence the zone level allocators and reclaim operations. > How big _is_ a 'struct zone' these days? This patch will increase their > effective size by 16x. Yes, this is not good, we should to a runtime allocation for the exact number of regions that we need. This can be optimized later once we design the data structure hierarchy with least overhead for the purpose. > Since one distro kernel basically gets run on *EVERYTHING*, what will > MAX_NR_REGIONS be in practice? How many regions are there on the > largest systems that will need this? We're going to be doing many > linear searches and iterations over it, so it's pretty darn important to > know. What does this do to lmbench numbers sensitive to page > allocations? Yep, agreed, we are generally looking at 2-4 regions per-node for most purposes. Also regions need not be of equal size, they can be large and small based on platform characteristics so that we need not fragment the zones below the level required. The overall idea is to have a VM data structure that can capture various boundaries of memory, and enable the allocations and reclaim logic to target certain areas based on the boundaries and properties required. NUMA node and pgdat is the example of capturing memory distances. The proposed memory regions should capture other orthogonal properties and boundaries of memory addresses similar to zone type. Thanks for the quick feedback. --Vaidy
On Fri, 2011-05-27 at 23:50 +0530, Vaidyanathan Srinivasan wrote: > The overall idea is to have a VM data structure that can capture > various boundaries of memory, and enable the allocations and reclaim > logic to target certain areas based on the boundaries and properties > required. It's worth noting that we already do targeted reclaim on boundaries other than zones. The lumpy reclaim and memory compaction logically do the same thing. So, it's at least possible to do this without having the global LRU designed around the way you want to reclaim. Also, if you get _too_ dependent on the global LRU, what are you going to do if our cgroup buddies manage to get cgroup'd pages off the global LRU? -- Dave
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e56f835..997a474 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -60,6 +60,7 @@ struct free_area { }; struct pglist_data; +struct mem_region_list_data; /* * zone->lock and zone->lru_lock are two of the hottest locks in the kernel. @@ -311,6 +312,7 @@ struct zone { unsigned long min_unmapped_pages; unsigned long min_slab_pages; #endif + int region; struct per_cpu_pageset __percpu *pageset; /* * free areas of different sizes @@ -399,6 +401,8 @@ struct zone { * Discontig memory support fields. */ struct pglist_data *zone_pgdat; + struct mem_region_list_data *zone_mem_region; + /* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */ unsigned long zone_start_pfn; @@ -597,6 +601,19 @@ struct node_active_region { extern struct page *mem_map; #endif +typedef struct mem_region_list_data { + struct zone zones[MAX_NR_ZONES]; + int nr_zones; + + int node; + int region; + + unsigned long start_pfn; + unsigned long spanned_pages; +} mem_region_t; + +#define MAX_NR_REGIONS 16 + /* * The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM * (mostly NUMA machines?) to denote a higher-level memory zone than the @@ -610,7 +627,10 @@ extern struct page *mem_map; */ struct bootmem_data; typedef struct pglist_data { - struct zone node_zones[MAX_NR_ZONES]; +/* The linkage to node_zones is now removed. The new hierarchy introduced + * is pg_data_t -> mem_region -> zones + * struct zone node_zones[MAX_NR_ZONES]; + */ struct zonelist node_zonelists[MAX_ZONELISTS]; int nr_zones; #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */ @@ -632,6 +652,9 @@ typedef struct pglist_data { */ spinlock_t node_size_lock; #endif + mem_region_t mem_regions[MAX_NR_REGIONS]; + int nr_mem_regions; + unsigned long node_start_pfn; unsigned long node_present_pages; /* total number of physical pages */ unsigned long node_spanned_pages; /* total size of physical page
Memory region data structure is created under a NUMA node. Each NUMA node can have multiple memory regions, depending upon the platform configuration for power management. Each memory region contains zones, which is the entity from which memory is allocated by the buddy allocator. ------------- | pg_data_t | ------------- | | ------ ------- v v ---------------- ---------------- | mem_region_t | | mem_region_t | ---------------- ---------------- ------------- | |...........| zone0 | .... v ------------- ----------------------------- | zone0 | zone1 | zone3 | ..| ----------------------------- Each memory region contains a zone array for the zones belonging to that region, in addition to other fields like node id, index of the region in the node, start pfn of the pages in that region and the number of pages spanned in the region. The zone array inside the regions is statically allocated at this point. ToDo: However, since the number of regions actually present on the system might be much smaller than the maximum allowed, dynamic bootmem allocation could be used to save memory. Signed-off-by: Ankita Garg <ankita@in.ibm.com> --- include/linux/mmzone.h | 25 ++++++++++++++++++++++++- 1 files changed, 24 insertions(+), 1 deletions(-)