From patchwork Fri Feb 23 13:26:01 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Christoph Lameter (Ampere)" X-Patchwork-Id: 10237677 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id BB7B5602A0 for ; Fri, 23 Feb 2018 13:26:22 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A846C29554 for ; Fri, 23 Feb 2018 13:26:22 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 9C722295B0; Fri, 23 Feb 2018 13:26:22 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4F79A29552 for ; Fri, 23 Feb 2018 13:26:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751381AbeBWN0U (ORCPT ); Fri, 23 Feb 2018 08:26:20 -0500 Received: from resqmta-ch2-09v.sys.comcast.net ([69.252.207.41]:43120 "EHLO resqmta-ch2-09v.sys.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751015AbeBWN0T (ORCPT ); Fri, 23 Feb 2018 08:26:19 -0500 Received: from resomta-ch2-05v.sys.comcast.net ([69.252.207.101]) by resqmta-ch2-09v.sys.comcast.net with ESMTP id pDK5edaxsS9FqpDMgey324; Fri, 23 Feb 2018 13:26:18 +0000 Received: from gentwo.org ([98.222.162.64]) by resomta-ch2-05v.sys.comcast.net with SMTP id pDMfeNwt2fMHmpDMfej9pI; Fri, 23 Feb 2018 13:26:18 +0000 Received: by gentwo.org (Postfix, from userid 1001) id AA06C11601CE; Fri, 23 Feb 2018 07:26:14 -0600 (CST) Message-Id: <20180223132614.632687479@linux.com> User-Agent: quilt/0.63-1 Date: Fri, 23 Feb 2018 07:26:01 -0600 From: cl@linux.com From: Christoph Lameter To: Mel Gorman Cc: Matthew Wilcox Cc: linux-mm@kvack.org Cc: linux-rdma@vger.kernel.org CC: akpm@linux-foundation.org Cc: Thomas Schoebel-Theuer Cc: andi@firstfloor.org Cc: Michal Hocko Cc: Guy Shattah Cc: Mike Kravetz Cc: Zi Yan Subject: [PATCH 1/2] Protect larger order pages from breaking up References: <20180223132600.174455480@linux.com> MIME-Version: 1.0 Content-Disposition: inline; filename=limit_order X-CMAE-Envelope: MS4wfEdxWdcYsyzegJoilhHLTJJrOZmVBJ+uM62k+65FV+35kesWtRgiywgEotVUVdCgW4zoj2iu+apwqEdBJ48HjJ/bKm5GgCG8MQ15xhMH5+cG2QFoVrjX XdtWZq5QMPnRsAvEm2V83DnjVgrxl9VhFu94P/nHZBpcFiKW4F3S6h+up4SAXh6lX17X7SezuFZxEjmY+K2UKeuwqNsnF7G6Vpq5+ZDyy9EzG7rybuV7BUET WTgOLUR1zTAz2K34oxsjz1g6RAaPIQDQbECso0g+LxP+qlwkqvJXJIZSuR4iCagiwPyhjpzMwAZLAlO6qO0uKQQgCqcKtPz6OZz7kYN7+wS/SyXkvDhTcdnO 8uH8DmPrK3PYROuJkmiWTdG0eHcew71y8vQq7stDktr5dpU/3ruKveYHr6CT0w1lS33wHR/HihBGsdO5i2Nk4O1flR72cTfEBgS48B+ZmbTkVdjSYZ9O9hAO mJyqN478xA5eQCZn Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP rfc->v1 - Use Thomas suggestion to change the test in __rmqueue_smallest Over time as the kernel is churning through memory it will break up larger pages and as time progresses larger contiguous allocations will no longer be possible. This is an approach to preserve these large pages and prevent them from being broken up. This is useful for example for the use of jumbo pages and can satify various needs of subsystems and device drivers that require large contiguous allocation to operate properly. The idea is to reserve a pool of pages of the required order so that the kernel is not allowed to use the pages for allocations of a different order. This is a pool that is fully integrated into the page allocator and therefore transparently usable. Control over this feature is by writing to /proc/zoneinfo. F.e. to ensure that 2000 16K pages stay available for jumbo frames do echo "3=2000" >/proc/zoneinfo or throught the order= on the kernel command line. F.e. order=3=2000,4N2=500 These pages will be subject to reclaim etc as usual but will not be broken up. One can then also f.e. operate the slub allocator with 64k pages. Specify "slub_max_order=3 slub_min_order=3" on the kernel command line and all slab allocator allocations will occur in 32K page sizes. Note that this will reduce the memory available to the application in some cases. Reclaim may occur more often. If more than the reserved number of higher order pages are being used then allocations will still fail as normal. In order to make this work just right one needs to be able to know the workload well enough to reserve the right amount of pages. This is comparable to other reservation schemes. Well that f.e brings up huge pages. You can of course also use this to reserve those and can then be sure that you can dynamically resize your huge page pools even after a long time of system up time. The idea for this patch came from Thomas Schoebel-Theuer whom I met at the LCA and who described the approach to me promising a patch that would do this. Sadly he has vanished somehow. However, he has been using this approach to support a production environment for numerous years. So I redid his patch and this is the first draft of it. Idea-by: Thomas Schoebel-Theuer First performance tests in a virtual enviroment show a hackbench improvement by 6% just by increasing the page size used by the page allocator. Signed-off-by: Christopher Lameter --- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Index: linux/include/linux/mmzone.h =================================================================== --- linux.orig/include/linux/mmzone.h +++ linux/include/linux/mmzone.h @@ -96,6 +96,11 @@ extern int page_group_by_mobility_disabl struct free_area { struct list_head free_list[MIGRATE_TYPES]; unsigned long nr_free; + /* We stop breaking up pages of this order if less than + * min are available. At that point the pages can only + * be used for allocations of that particular order. + */ + unsigned long min; }; struct pglist_data; Index: linux/mm/page_alloc.c =================================================================== --- linux.orig/mm/page_alloc.c +++ linux/mm/page_alloc.c @@ -1848,8 +1848,15 @@ struct page *__rmqueue_smallest(struct z area = &(zone->free_area[current_order]); page = list_first_entry_or_null(&area->free_list[migratetype], struct page, lru); - if (!page) + /* + * Continue if no page is found or if our freelist contains + * less than the minimum pages of that order. In that case + * we better look for a different order. + */ + if (!page || (area->nr_free < area->min + && current_order > order)) continue; + list_del(&page->lru); rmv_page_order(page); area->nr_free--; @@ -5194,6 +5201,57 @@ static void build_zonelists(pg_data_t *p #endif /* CONFIG_NUMA */ +int set_page_order_min(int node, int order, unsigned min) +{ + int i, o; + long min_pages = 0; /* Pages already reserved */ + long managed_pages = 0; /* Pages managed on the node */ + struct zone *last = NULL; + unsigned remaining; + + /* + * Determine already reserved memory for orders + * plus the total of the pages on the node + */ + for (i = 0; i < MAX_NR_ZONES; i++) { + struct zone *z = &NODE_DATA(node)->node_zones[i]; + if (managed_zone(z)) { + for (o = 0; o < MAX_ORDER; o++) { + if (o != order) + min_pages += z->free_area[o].min << o; + + } + managed_pages += z->managed_pages; + } + } + + if (min_pages + (min << order) > managed_pages / 2) + return -ENOMEM; + + /* Set the min values for all zones on the node */ + remaining = min; + for (i = 0; i < MAX_NR_ZONES; i++) { + struct zone *z = &NODE_DATA(node)->node_zones[i]; + if (managed_zone(z)) { + u64 tmp; + + tmp = (u64)z->managed_pages * (min << order); + do_div(tmp, managed_pages); + tmp >>= order; + z->free_area[order].min = tmp; + + last = z; + remaining -= tmp; + } + } + + /* Deal with rounding errors */ + if (remaining && last) + last->free_area[order].min += remaining; + + return 0; +} + /* * Boot pageset table. One per cpu which is going to be used for all * zones and all nodes. The parameters will be set in such a way @@ -5428,6 +5486,7 @@ static void __meminit zone_init_free_lis for_each_migratetype_order(order, t) { INIT_LIST_HEAD(&zone->free_area[order].free_list[t]); zone->free_area[order].nr_free = 0; + zone->free_area[order].min = 0; } } @@ -7002,6 +7061,7 @@ static void __setup_per_zone_wmarks(void unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10); unsigned long lowmem_pages = 0; struct zone *zone; + int order; unsigned long flags; /* Calculate total number of !ZONE_HIGHMEM pages */ @@ -7016,6 +7076,10 @@ static void __setup_per_zone_wmarks(void spin_lock_irqsave(&zone->lock, flags); tmp = (u64)pages_min * zone->managed_pages; do_div(tmp, lowmem_pages); + + for (order = 0; order < MAX_ORDER; order++) + tmp += zone->free_area[order].min << order; + if (is_highmem(zone)) { /* * __GFP_HIGH and PF_MEMALLOC allocations usually don't Index: linux/mm/vmstat.c =================================================================== --- linux.orig/mm/vmstat.c +++ linux/mm/vmstat.c @@ -27,6 +27,7 @@ #include #include #include +#include #include "internal.h" @@ -1614,6 +1615,11 @@ static void zoneinfo_show_print(struct s zone_numa_state_snapshot(zone, i)); #endif + for (i = 0; i < MAX_ORDER; i++) + if (zone->free_area[i].min) + seq_printf(m, "\nPreserve %lu pages of order %d from breaking up.", + zone->free_area[i].min, i); + seq_printf(m, "\n pagesets"); for_each_online_cpu(i) { struct per_cpu_pageset *pageset; @@ -1641,6 +1647,122 @@ static void zoneinfo_show_print(struct s seq_putc(m, '\n'); } +static int __order_protect(char *p) +{ + char c; + + do { + int order = 0; + int pages = 0; + int node = 0; + int rc; + + /* Syntax [N]=number */ + if (!isdigit(*p)) + return -EFAULT; + + while (true) { + c = *p++; + + if (!isdigit(c)) + break; + + order = order * 10 + c - '0'; + } + + /* Check for optional node specification */ + if (c == 'N') { + if (!isdigit(*p)) + return -EFAULT; + + while (true) { + c = *p++; + if (!isdigit(c)) + break; + node = node * 10 + c - '0'; + } + } + + if (c != '=') + return -EINVAL; + + if (!isdigit(*p)) + return -EINVAL; + + while (true) { + c = *p++; + if (!isdigit(c)) + break; + pages = pages * 10 + c - '0'; + } + + if (order == 0 || order >= MAX_ORDER) + return -EINVAL; + + if (!node_online(node)) + return -ENOSYS; + + rc = set_page_order_min(node, order, pages); + if (rc) + return rc; + + } while (c == ','); + + if (c) + return -EINVAL; + + setup_per_zone_wmarks(); + + return 0; +} + +/* + * Writing to /proc/zoneinfo allows to setup the large page breakup + * protection. + * + * Syntax: + * [N]={,[N]=} + * + * F.e. Protecting 500 pages of order 2 (16K on intel) and 300 of + * order 4 (64K) on node 1 + * + * echo "2=500,4N1=300" >/proc/zoneinfo + * + */ +static ssize_t zoneinfo_write(struct file *file, const char __user *buffer, + size_t count, loff_t *ppos) +{ + char zinfo[200]; + int rc; + + if (count > sizeof(zinfo)) + return -EINVAL; + + if (copy_from_user(zinfo, buffer, count)) + return -EFAULT; + + zinfo[count - 1] = 0; + + rc = __order_protect(zinfo); + + if (rc) + return rc; + + return count; +} + +static int order_protect(char *s) +{ + int rc; + + rc = __order_protect(s); + if (rc) + printk("Invalid order=%s rc=%d\n",s, rc); + + return 1; +} +__setup("order=", order_protect); + /* * Output information about zones in @pgdat. All zones are printed regardless * of whether they are populated or not: lowmem_reserve_ratio operates on the @@ -1672,6 +1794,7 @@ static const struct file_operations zone .read = seq_read, .llseek = seq_lseek, .release = seq_release, + .write = zoneinfo_write, }; enum writeback_stat_item { @@ -2016,7 +2139,7 @@ void __init init_mm_internals(void) proc_create("buddyinfo", 0444, NULL, &buddyinfo_file_operations); proc_create("pagetypeinfo", 0444, NULL, &pagetypeinfo_file_operations); proc_create("vmstat", 0444, NULL, &vmstat_file_operations); - proc_create("zoneinfo", 0444, NULL, &zoneinfo_file_operations); + proc_create("zoneinfo", 0644, NULL, &zoneinfo_file_operations); #endif } Index: linux/include/linux/gfp.h =================================================================== --- linux.orig/include/linux/gfp.h +++ linux/include/linux/gfp.h @@ -543,6 +543,7 @@ void drain_all_pages(struct zone *zone); void drain_local_pages(struct zone *zone); void page_alloc_init_late(void); +int set_page_order_min(int node, int order, unsigned min); /* * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what