Message ID | 20180216160121.519788537@linux.com (mailing list archive) |
---|---|
State | RFC |
Headers | show |
> First performance tests in a virtual enviroment show > a hackbench improvement by 6% just by increasing > the page size used by the page allocator to order 3. So why is hackbench improving? Is that just for kernel stacks? -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/16/2018 08:01 AM, Christoph Lameter wrote: > Control over this feature is by writing to /proc/zoneinfo. > > F.e. to ensure that 2000 16K pages stay available for jumbo > frames do > > echo "2=2000" >/proc/zoneinfo > > or through the order=<page spec> on the kernel command line. > F.e. > > order=2=2000,4N2=500 Please document the the kernel command line option in Documentation/admin-guide/kernel-parameters.txt. I suppose that /proc/zoneinfo should be added somewhere in Documentation/vm/ but I'm not sure where that would be. thanks,
On Fri, 16 Feb 2018, Andi Kleen wrote: > > First performance tests in a virtual enviroment show > > a hackbench improvement by 6% just by increasing > > the page size used by the page allocator to order 3. > > So why is hackbench improving? Is that just for kernel stacks? Less stack overhead. The large the page size the less metadata need to be handled. The freelists get larger and the chance of hitting the per cpu freelist increases. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/16/2018 08:01 AM, Christoph Lameter wrote: > Over time as the kernel is churning through memory it will break > up larger pages and as time progresses larger contiguous allocations > will no longer be possible. This is an approach to preserve these > large pages and prevent them from being broken up. > > This is useful for example for the use of jumbo pages and can > satify various needs of subsystems and device drivers that require > large contiguous allocation to operate properly. > > The idea is to reserve a pool of pages of the required order > so that the kernel is not allowed to use the pages for allocations > of a different order. This is a pool that is fully integrated > into the page allocator and therefore transparently usable. > > Control over this feature is by writing to /proc/zoneinfo. > > F.e. to ensure that 2000 16K pages stay available for jumbo > frames do > > echo "2=2000" >/proc/zoneinfo > > or through the order=<page spec> on the kernel command line. > F.e. > > order=2=2000,4N2=500 > > These pages will be subject to reclaim etc as usual but will not > be broken up. > > One can then also f.e. operate the slub allocator with > 64k pages. Specify "slub_max_order=4 slub_min_order=4" on > the kernel command line and all slab allocator allocations > will occur in 64K page sizes. > > Note that this will reduce the memory available to the application > in some cases. Reclaim may occur more often. If more than > the reserved number of higher order pages are being used then > allocations will still fail as normal. > > In order to make this work just right one needs to be able to > know the workload well enough to reserve the right amount > of pages. This is comparable to other reservation schemes. Yes. I like the idea that this only comes into play as the result of explicit user/sysadmin action. It does remind me of hugetlbfs reservations. So, we hope that only people who really know their workload and know what they are doing would use this feature. > Well that f.e brings up huge pages. You can of course > also use this to reserve those and can then be sure that > you can dynamically resize your huge page pools even after > a long time of system up time. Yes, and no. Doesn't that assume nobody else is doing allocations of that size? For example, I could image THP using huge page sized reservations. The when it comes time to resize your hugetlbfs pool there may not be enough. Although, we may quickly split THP pages in this case. I am not sure. IIRC, Guy Shattah's use case was for allocations greater than MAX_ORDER. This would not directly address that. A huge contiguous area (2GB) is the sweet spot' for best performance in his case. However, I think he could still benefit from using a set of larger (such as 2MB) size allocations which this scheme could help with.
On 02/16/2018 08:01 AM, Christoph Lameter wrote: > In order to make this work just right one needs to be able to > know the workload well enough to reserve the right amount > of pages. This is comparable to other reservation schemes. Yes, but it's a reservation scheme that doesn't show up in MemFree, for instance. Even hugetlbfs-reserved memory subtracts from that. This has the potential to be really confusing to apps. If this memory is now not available to normal apps, they might plow into the invisible memory limits and get into nasty reclaim scenarios. Shouldn't this subtract the memory for MemFree and friends? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 16 Feb 2018, Mike Kravetz wrote: > > Well that f.e brings up huge pages. You can of course > > also use this to reserve those and can then be sure that > > you can dynamically resize your huge page pools even after > > a long time of system up time. > > Yes, and no. Doesn't that assume nobody else is doing allocations > of that size? For example, I could image THP using huge page sized > reservations. The when it comes time to resize your hugetlbfs pool > there may not be enough. Although, we may quickly split THP pages > in this case. I am not sure. Yup it has a pool for everyone. Question is how to divide the loot ;-) > IIRC, Guy Shattah's use case was for allocations greater than MAX_ORDER. > This would not directly address that. A huge contiguous area (2GB) is > the sweet spot' for best performance in his case. However, I think he > could still benefit from using a set of larger (such as 2MB) size > allocations which this scheme could help with. MAX_ORDER can be increased to allow for larger allocations. IA64 has f.e. a much larger MAX_ORDER size. So does powerpc. And then the reservation scheme will work. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 16 Feb 2018, Dave Hansen wrote: > On 02/16/2018 08:01 AM, Christoph Lameter wrote: > > In order to make this work just right one needs to be able to > > know the workload well enough to reserve the right amount > > of pages. This is comparable to other reservation schemes. > > Yes, but it's a reservation scheme that doesn't show up in MemFree, for > instance. Even hugetlbfs-reserved memory subtracts from that. Ok. There is the question if we can get all these reservation schemes under one hood instead of having page order specific ones in subsystems like hugetlb. > This has the potential to be really confusing to apps. If this memory > is now not available to normal apps, they might plow into the invisible > memory limits and get into nasty reclaim scenarios. > Shouldn't this subtract the memory for MemFree and friends? Ok certainly we could do that. But on the other hand the memory is available if those subsystems ask for the right order. Its not clear to me what the right way of handling this is. Right now it adds the reserved pages to the watermarks. But then under some circumstances the memory is available. What is the best solution here? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/16/2018 12:15 PM, Christopher Lameter wrote: >> This has the potential to be really confusing to apps. If this memory >> is now not available to normal apps, they might plow into the invisible >> memory limits and get into nasty reclaim scenarios. >> Shouldn't this subtract the memory for MemFree and friends? > Ok certainly we could do that. But on the other hand the memory is > available if those subsystems ask for the right order. Its not clear to me > what the right way of handling this is. Right now it adds the reserved > pages to the watermarks. But then under some circumstances the memory is > available. What is the best solution here? There's definitely no perfect solution. But, in general, I think we should cater to the dumbest users. Folks doing higher-order allocations are not that. I say we make the picture the most clear for the traditional 4k users. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 16, 2018 at 01:08:11PM -0800, Dave Hansen wrote: > On 02/16/2018 12:15 PM, Christopher Lameter wrote: > >> This has the potential to be really confusing to apps. If this memory > >> is now not available to normal apps, they might plow into the invisible > >> memory limits and get into nasty reclaim scenarios. > >> Shouldn't this subtract the memory for MemFree and friends? > > Ok certainly we could do that. But on the other hand the memory is > > available if those subsystems ask for the right order. Its not clear to me > > what the right way of handling this is. Right now it adds the reserved > > pages to the watermarks. But then under some circumstances the memory is > > available. What is the best solution here? > > There's definitely no perfect solution. > > But, in general, I think we should cater to the dumbest users. Folks > doing higher-order allocations are not that. I say we make the picture > the most clear for the traditional 4k users. Your way might be confusing -- if there's a system which is under varying amounts of jumboframe load and all the 16k pages get gobbled up by the ethernet driver, MemFree won't change at all, for example. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/16/2018 01:43 PM, Matthew Wilcox wrote: >> There's definitely no perfect solution. >> >> But, in general, I think we should cater to the dumbest users. Folks >> doing higher-order allocations are not that. I say we make the picture >> the most clear for the traditional 4k users. > Your way might be confusing -- if there's a system which is under varying > amounts of jumboframe load and all the 16k pages get gobbled up by the > ethernet driver, MemFree won't change at all, for example. IOW, you agree that "there's definitely no perfect solution." :) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On February 16, 2018 7:02:53 PM GMT+01:00, Randy Dunlap <rdunlap@infradead.org> wrote: >On 02/16/2018 08:01 AM, Christoph Lameter wrote: >> Control over this feature is by writing to /proc/zoneinfo. >> >> F.e. to ensure that 2000 16K pages stay available for jumbo >> frames do >> >> echo "2=2000" >/proc/zoneinfo >> >> or through the order=<page spec> on the kernel command line. >> F.e. >> >> order=2=2000,4N2=500 > > >Please document the the kernel command line option in >Documentation/admin-guide/kernel-parameters.txt. > >I suppose that /proc/zoneinfo should be added somewhere in >Documentation/vm/ >but I'm not sure where that would be. It's in Documentation/sysctl/vm.txt and in 'man proc' [1] [1] https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man5/proc.5 >thanks,
> > Yup it has a pool for everyone. Question is how to divide the loot ;-) > > > IIRC, Guy Shattah's use case was for allocations greater than MAX_ORDER. > > This would not directly address that. A huge contiguous area (2GB) is > > the sweet spot' for best performance in his case. However, I think he > > could still benefit from using a set of larger (such as 2MB) size > > allocations which this scheme could help with. > > MAX_ORDER can be increased to allow for larger allocations. IA64 has f.e. > a much larger MAX_ORDER size. So does powerpc. And then the reservation > scheme will work. > MAX_ORDER can be increased only if kernel is recompiled. It won't work for code running for the general case / typical user. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
My skynet.ie/csn.ul.ie address has been defunct for quite some time. Mail sent to it is not guaranteed to get to me. On Fri, Feb 16, 2018 at 10:01:11AM -0600, Christoph Lameter wrote: > Over time as the kernel is churning through memory it will break > up larger pages and as time progresses larger contiguous allocations > will no longer be possible. This is an approach to preserve these > large pages and prevent them from being broken up. > > <SNIP> > Idea-by: Thomas Schoebel-Theuer <tst@schoebel-theuer.de> > > First performance tests in a virtual enviroment show > a hackbench improvement by 6% just by increasing > the page size used by the page allocator to order 3. > The phrasing here is confusing. hackbench is not very intensive in terms of memory, it's more fork intensive where I find it extremely unlikely that it would hit problems with fragmentation unless memory was deliberately fragmented first. Furthermore, the phrasing implies that the minimum order used by the page allocator is order 3 which is not what the patch appears to do. > Signed-off-by: Christopher Lameter <cl@linux.com> > > Index: linux/include/linux/mmzone.h > =================================================================== > --- linux.orig/include/linux/mmzone.h > +++ linux/include/linux/mmzone.h > @@ -96,6 +96,11 @@ extern int page_group_by_mobility_disabl > struct free_area { > struct list_head free_list[MIGRATE_TYPES]; > unsigned long nr_free; > + /* We stop breaking up pages of this order if less than > + * min are available. At that point the pages can only > + * be used for allocations of that particular order. > + */ > + unsigned long min; > }; > > struct pglist_data; > Index: linux/mm/page_alloc.c > =================================================================== > --- linux.orig/mm/page_alloc.c > +++ linux/mm/page_alloc.c > @@ -1844,7 +1844,12 @@ struct page *__rmqueue_smallest(struct z > area = &(zone->free_area[current_order]); > page = list_first_entry_or_null(&area->free_list[migratetype], > struct page, lru); > - if (!page) > + /* > + * Continue if no page is found or if our freelist contains > + * less than the minimum pages of that order. In that case > + * we better look for a different order. > + */ > + if (!page || area->nr_free < area->min) > continue; > list_del(&page->lru); > rmv_page_order(page); This is surprising to say the least. Assuming reservations are at order-3, this would refuse to split order-3 even if there was sufficient reserved pages at higher orders for a reserve. This will cause splits of higher orders unnecessarily which could cause other fragmentation-related issues in the future. This is similar to a memory pool except it's not. There is no concept of a user of high-order reserves accounting for it. Hence, a user of high-order pages could allocate the reserve multiple times for long-term purposes while starving other allocation requests. This could easily happen for slub with min_order set to the same order as the reserve causing potential OOM issues. If a pool is to be created, it should be a real pool even if it's transparently accessed through the page allocator. It should allocate the requested number of pages and either decide to refill is possible or pass requests through to the page allocator when the pool is depleted. Also, as it stands, an OOM due to the reserve would be confusing as there is no hint the failure may have been due to the reserve. Access to the pool is unprotected so you might create a reserve for jumbo frames only to have them consumed by something else entirely. It's not clear if that is even fixable as GFP flags are too coarse. It is not covered in the changelog why MIGRATE_HIGHATOMIC was not sufficient for jumbo frames which are generally expected to be allocated from atomic context. If there is a problem there then maybe MIGRATE_HIGHATOMIC should be made more strict instead of a hack like this. It'll be very difficult, if not impossible, for this to be tuned properly. Finally, while I accept that fragmentation over time is a problem for unmovable allocations (fragmentation protection was originally designed for THP/hugetlbfs), this is papering over the problem. If greater protections are needed then the right approach is to be more strict about fallbacks. Specifically, unmovable allocations should migrate all movable pages out of migrate_unmovable pageblocks before falling back and that can be controlled by policy due to the overhead of migration. For atomic allocations, allow fallback but use kcompact or a workqueue to migrate movable pages out of migrate_unmovable pageblocks to limit fallbacks in the future. I'm not a fan of this patch.
On Mon 19-02-18 10:19:35, Mel Gorman wrote: [...] > Access to the pool is unprotected so you might create a reserve for jumbo > frames only to have them consumed by something else entirely. It's not > clear if that is even fixable as GFP flags are too coarse. > > It is not covered in the changelog why MIGRATE_HIGHATOMIC was not > sufficient for jumbo frames which are generally expected to be allocated > from atomic context. If there is a problem there then maybe > MIGRATE_HIGHATOMIC should be made more strict instead of a hack like > this. It'll be very difficult, if not impossible, for this to be tuned > properly. > > Finally, while I accept that fragmentation over time is a problem for > unmovable allocations (fragmentation protection was originally designed > for THP/hugetlbfs), this is papering over the problem. If greater > protections are needed then the right approach is to be more strict about > fallbacks. Specifically, unmovable allocations should migrate all movable > pages out of migrate_unmovable pageblocks before falling back and that > can be controlled by policy due to the overhead of migration. For atomic > allocations, allow fallback but use kcompact or a workqueue to migrate > movable pages out of migrate_unmovable pageblocks to limit fallbacks in > the future. Completely agreed! > I'm not a fan of this patch. Yes, I think the approach is just wrong. It will just hit all sorts of weird corner cases and won't work reliable for those who care.
On Mon, 19 Feb 2018, Mel Gorman wrote: > The phrasing here is confusing. hackbench is not very intensive in terms of > memory, it's more fork intensive where I find it extremely unlikely that > it would hit problems with fragmentation unless memory was deliberately > fragmented first. Furthermore, the phrasing implies that the minimum order > used by the page allocator is order 3 which is not what the patch appears > to do. It was used to illustrate the performance gain. > > - if (!page) > > + /* > > + * Continue if no page is found or if our freelist contains > > + * less than the minimum pages of that order. In that case > > + * we better look for a different order. > > + */ > > + if (!page || area->nr_free < area->min) > > continue; > > list_del(&page->lru); > > rmv_page_order(page); > > This is surprising to say the least. Assuming reservations are at order-3, > this would refuse to split order-3 even if there was sufficient reserved > pages at higher orders for a reserve. This will cause splits of higher > orders unnecessarily which could cause other fragmentation-related issues > in the future. Well that is intended. We want to preserve a number of pages at a certain order. If there are higher order pages available then those can be split and the allocation will succeed while preserving the mininum number of pages at the reserved order. > This is similar to a memory pool except it's not. There is no concept of a > user of high-order reserves accounting for it. Hence, a user of high-order > pages could allocate the reserve multiple times for long-term purposes > while starving other allocation requests. This could easily happen for slub > with min_order set to the same order as the reserve causing potential OOM > issues. If a pool is to be created, it should be a real pool even if it's > transparently accessed through the page allocator. It should allocate the > requested number of pages and either decide to refill is possible or pass > requests through to the page allocator when the pool is depleted. Also, > as it stands, an OOM due to the reserve would be confusing as there is no > hint the failure may have been due to the reserve. Ok we can add the ->min values to the OOOM report. This is a crude approach I agree and it does require knowlege of the load and user patterns. However, what other approach is there to allow the system to sustain higher order allocations if those are needed? This is an issue for which no satisfactory solution is present. So a measure like this would allow a limited use in some situations. > Access to the pool is unprotected so you might create a reserve for jumbo > frames only to have them consumed by something else entirely. It's not > clear if that is even fixable as GFP flags are too coarse. If its consumed by something else then the parameters or the jumbo frame setting may be adjusted. This feature is off by default so its only used for tuning purposes. > It is not covered in the changelog why MIGRATE_HIGHATOMIC was not > sufficient for jumbo frames which are generally expected to be allocated > from atomic context. If there is a problem there then maybe > MIGRATE_HIGHATOMIC should be made more strict instead of a hack like > this. It'll be very difficult, if not impossible, for this to be tuned > properly. This approach has been in use for a decade or so as mentioned in the patch description. So please be careful with impossibility claims. This enables handling of larger contiguous blocks of memory that are requires in some circumstances and it has been doing that successfully (although with some tuning effort). > Finally, while I accept that fragmentation over time is a problem for > unmovable allocations (fragmentation protection was originally designed > for THP/hugetlbfs), this is papering over the problem. If greater > protections are needed then the right approach is to be more strict about > fallbacks. Specifically, unmovable allocations should migrate all movable > pages out of migrate_unmovable pageblocks before falling back and that > can be controlled by policy due to the overhead of migration. For atomic > allocations, allow fallback but use kcompact or a workqueue to migrate > movable pages out of migrate_unmovable pageblocks to limit fallbacks in > the future. This is also papering over more issues. While these measures may delay fragmentation some bit more they will not result in a pool of large pages being available for the system throughout the lifetime of it. > I'm not a fan of this patch. I am also not a fan of this patch but this is enabling something that we wanted for a long time. Consistent ability in a limited way to allocate large page orders. Since we have failed to address this in other way this may be the best ad hoc method to get there. What we have done to address fragmentation so far are all these preventative measures that get more ineffective as time progresses while memory sizes increase. Either we do this or we need to actually do one of the other known measures to address fragmentation like making inode/dentries movable. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: linux/include/linux/mmzone.h =================================================================== --- linux.orig/include/linux/mmzone.h +++ linux/include/linux/mmzone.h @@ -96,6 +96,11 @@ extern int page_group_by_mobility_disabl struct free_area { struct list_head free_list[MIGRATE_TYPES]; unsigned long nr_free; + /* We stop breaking up pages of this order if less than + * min are available. At that point the pages can only + * be used for allocations of that particular order. + */ + unsigned long min; }; struct pglist_data; Index: linux/mm/page_alloc.c =================================================================== --- linux.orig/mm/page_alloc.c +++ linux/mm/page_alloc.c @@ -1844,7 +1844,12 @@ struct page *__rmqueue_smallest(struct z area = &(zone->free_area[current_order]); page = list_first_entry_or_null(&area->free_list[migratetype], struct page, lru); - if (!page) + /* + * Continue if no page is found or if our freelist contains + * less than the minimum pages of that order. In that case + * we better look for a different order. + */ + if (!page || area->nr_free < area->min) continue; list_del(&page->lru); rmv_page_order(page); @@ -5190,6 +5195,57 @@ static void build_zonelists(pg_data_t *p #endif /* CONFIG_NUMA */ +int set_page_order_min(int node, int order, unsigned min) +{ + int i, o; + long min_pages = 0; /* Pages already reserved */ + long managed_pages = 0; /* Pages managed on the node */ + struct zone *last; + unsigned remaining; + + /* + * Determine already reserved memory for orders + * plus the total of the pages on the node + */ + for (i = 0; i < MAX_NR_ZONES; i++) { + struct zone *z = &NODE_DATA(node)->node_zones[i]; + if (managed_zone(z)) { + for (o = 0; o < MAX_ORDER; o++) { + if (o != order) + min_pages += z->free_area[o].min << o; + + } + managed_pages += z->managed_pages; + } + } + + if (min_pages + (min << order) > managed_pages / 2) + return -ENOMEM; + + /* Set the min values for all zones on the node */ + remaining = min; + for (i = 0; i < MAX_NR_ZONES; i++) { + struct zone *z = &NODE_DATA(node)->node_zones[i]; + if (managed_zone(z)) { + u64 tmp; + + tmp = (u64)z->managed_pages * (min << order); + do_div(tmp, managed_pages); + tmp >>= order; + z->free_area[order].min = tmp; + + last = z; + remaining -= tmp; + } + } + + /* Deal with rounding errors */ + if (remaining) + last->free_area[order].min += remaining; + + return 0; +} + /* * Boot pageset table. One per cpu which is going to be used for all * zones and all nodes. The parameters will be set in such a way @@ -5424,6 +5480,7 @@ static void __meminit zone_init_free_lis for_each_migratetype_order(order, t) { INIT_LIST_HEAD(&zone->free_area[order].free_list[t]); zone->free_area[order].nr_free = 0; + zone->free_area[order].min = 0; } } @@ -6998,6 +7055,7 @@ static void __setup_per_zone_wmarks(void unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10); unsigned long lowmem_pages = 0; struct zone *zone; + int order; unsigned long flags; /* Calculate total number of !ZONE_HIGHMEM pages */ @@ -7012,6 +7070,10 @@ static void __setup_per_zone_wmarks(void spin_lock_irqsave(&zone->lock, flags); tmp = (u64)pages_min * zone->managed_pages; do_div(tmp, lowmem_pages); + + for (order = 0; order < MAX_ORDER; order++) + tmp += zone->free_area[order].min << order; + if (is_highmem(zone)) { /* * __GFP_HIGH and PF_MEMALLOC allocations usually don't Index: linux/mm/vmstat.c =================================================================== --- linux.orig/mm/vmstat.c +++ linux/mm/vmstat.c @@ -27,6 +27,7 @@ #include <linux/mm_inline.h> #include <linux/page_ext.h> #include <linux/page_owner.h> +#include <linux/ctype.h> #include "internal.h" @@ -1614,6 +1615,11 @@ static void zoneinfo_show_print(struct s zone_numa_state_snapshot(zone, i)); #endif + for (i = 0; i < MAX_ORDER; i++) + if (zone->free_area[i].min) + seq_printf(m, "\nPreserve %lu pages of order %d from breaking up.", + zone->free_area[i].min, i); + seq_printf(m, "\n pagesets"); for_each_online_cpu(i) { struct per_cpu_pageset *pageset; @@ -1641,6 +1647,122 @@ static void zoneinfo_show_print(struct s seq_putc(m, '\n'); } +static int __order_protect(char *p) +{ + char c; + + do { + int order = 0; + int pages = 0; + int node = 0; + int rc; + + /* Syntax <order>[N<node>]=number */ + if (!isdigit(*p)) + return -EFAULT; + + while (true) { + c = *p++; + + if (!isdigit(c)) + break; + + order = order * 10 + c - '0'; + } + + /* Check for optional node specification */ + if (c == 'N') { + if (!isdigit(*p)) + return -EFAULT; + + while (true) { + c = *p++; + if (!isdigit(c)) + break; + node = node * 10 + c - '0'; + } + } + + if (c != '=') + return -EINVAL; + + if (!isdigit(*p)) + return -EINVAL; + + while (true) { + c = *p++; + if (!isdigit(c)) + break; + pages = pages * 10 + c - '0'; + } + + if (order == 0 || order >= MAX_ORDER) + return -EINVAL; + + if (!node_online(node)) + return -ENOSYS; + + rc = set_page_order_min(node, order, pages); + if (rc) + return rc; + + } while (c == ','); + + if (c) + return -EINVAL; + + setup_per_zone_wmarks(); + + return 0; +} + +/* + * Writing to /proc/zoneinfo allows to setup the large page breakup + * protection. + * + * Syntax: + * <order>[N<node>]=<number>{,<order>[N<node>]=<number>} + * + * F.e. Protecting 500 pages of order 2 (16K on intel) and 300 of + * order 4 (64K) on node 1 + * + * echo "2=500,4N1=300" >/proc/zoneinfo + * + */ +static ssize_t zoneinfo_write(struct file *file, const char __user *buffer, + size_t count, loff_t *ppos) +{ + char zinfo[200]; + int rc; + + if (count > sizeof(zinfo)) + return -EINVAL; + + if (copy_from_user(zinfo, buffer, count)) + return -EFAULT; + + zinfo[count - 1] = 0; + + rc = __order_protect(zinfo); + + if (rc) + return rc; + + return count; +} + +static int order_protect(char *s) +{ + int rc; + + rc = __order_protect(s); + if (rc) + printk("Invalid order=%s rc=%d\n",s, rc); + + return 1; +} +__setup("order=", order_protect); + /* * Output information about zones in @pgdat. All zones are printed regardless * of whether they are populated or not: lowmem_reserve_ratio operates on the @@ -1672,6 +1794,7 @@ static const struct file_operations zone .read = seq_read, .llseek = seq_lseek, .release = seq_release, + .write = zoneinfo_write, }; enum writeback_stat_item { @@ -2016,7 +2139,7 @@ void __init init_mm_internals(void) proc_create("buddyinfo", 0444, NULL, &buddyinfo_file_operations); proc_create("pagetypeinfo", 0444, NULL, &pagetypeinfo_file_operations); proc_create("vmstat", 0444, NULL, &vmstat_file_operations); - proc_create("zoneinfo", 0444, NULL, &zoneinfo_file_operations); + proc_create("zoneinfo", 0644, NULL, &zoneinfo_file_operations); #endif } Index: linux/include/linux/gfp.h =================================================================== --- linux.orig/include/linux/gfp.h +++ linux/include/linux/gfp.h @@ -543,6 +543,7 @@ void drain_all_pages(struct zone *zone); void drain_local_pages(struct zone *zone); void page_alloc_init_late(void); +int set_page_order_min(int node, int order, unsigned min); /* * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
Over time as the kernel is churning through memory it will break up larger pages and as time progresses larger contiguous allocations will no longer be possible. This is an approach to preserve these large pages and prevent them from being broken up. This is useful for example for the use of jumbo pages and can satify various needs of subsystems and device drivers that require large contiguous allocation to operate properly. The idea is to reserve a pool of pages of the required order so that the kernel is not allowed to use the pages for allocations of a different order. This is a pool that is fully integrated into the page allocator and therefore transparently usable. Control over this feature is by writing to /proc/zoneinfo. F.e. to ensure that 2000 16K pages stay available for jumbo frames do echo "2=2000" >/proc/zoneinfo or through the order=<page spec> on the kernel command line. F.e. order=2=2000,4N2=500 These pages will be subject to reclaim etc as usual but will not be broken up. One can then also f.e. operate the slub allocator with 64k pages. Specify "slub_max_order=4 slub_min_order=4" on the kernel command line and all slab allocator allocations will occur in 64K page sizes. Note that this will reduce the memory available to the application in some cases. Reclaim may occur more often. If more than the reserved number of higher order pages are being used then allocations will still fail as normal. In order to make this work just right one needs to be able to know the workload well enough to reserve the right amount of pages. This is comparable to other reservation schemes. Well that f.e brings up huge pages. You can of course also use this to reserve those and can then be sure that you can dynamically resize your huge page pools even after a long time of system up time. The idea for this patch came from Thomas Schoebel-Theuer whom I met at the LCA and who described the approach to me promising a patch that would do this. Sadly he has vanished somehow. However, he has been using this approach to support a production environment for numerous years. So I redid his patch and this is the first draft of it. Idea-by: Thomas Schoebel-Theuer <tst@schoebel-theuer.de> First performance tests in a virtual enviroment show a hackbench improvement by 6% just by increasing the page size used by the page allocator to order 3. Signed-off-by: Christopher Lameter <cl@linux.com> -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html