From patchwork Wed Sep 20 06:18:53 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13392114 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A740BCE79AE for ; Wed, 20 Sep 2023 06:20:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4037E6B0117; Wed, 20 Sep 2023 02:20:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3B2A06B0118; Wed, 20 Sep 2023 02:20:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 22C1A6B0119; Wed, 20 Sep 2023 02:20:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 0F1EB6B0117 for ; Wed, 20 Sep 2023 02:20:04 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id B9C8B1408C1 for ; Wed, 20 Sep 2023 06:20:03 +0000 (UTC) X-FDA: 81255975486.21.3FD84BD Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126]) by imf29.hostedemail.com (Postfix) with ESMTP id 95BFD12001F for ; Wed, 20 Sep 2023 06:20:01 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=WfeXs7Pr; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695190802; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=M/fe6gvTJ5IvajuIXHKt80mPtHGXVD+KvoXX8vweENs=; b=OoUMNRSwBbSTN2Cu7g39VseunKqDzYIcmyJnfByuqelc2JsTyPoUqnkgTgsGsMYgckMR6e FNCLpT7uRiNS5OcGH52ORmxjEsC/6c3rgLCoFFiAdYO1hR6gc+r9gGJP6sPpAzECEpCfcb LqnCM73dhuKPRrvvtyA9WgTRwyrhSWs= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=WfeXs7Pr; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695190802; a=rsa-sha256; cv=none; b=vF4O7uH54zOzd8yGyxy5ae/4EHZD4av4ilSL8IGPlHvRPPK/7By8JRLw0otCkEX/LSENAD r17R1BgLyk9ePkTjnz4xpS/PIIV52yTqctRDjqwdiAeZHUBI+6Q97yuxhIEp+J15Uf+CM6 NUaFPpohZgDMXz6nCfQ98XZPTb+xn98= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695190801; x=1726726801; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=hq7IYYC8nVzeFGqZQrv5zeua36AUxmv2g1nXbA7LL/w=; b=WfeXs7PrdGSr3f+XV/Uc1r9LwyeMX1Cw4QnURWL8lbYlu9+v5goZsLbj k2oezLdBJhx/WyXGMsdlSxkrH7aiz8f0KFHMOs8oO6+fGLx6C7xjjZLMM lko0OYwm5S3Ykconu0RM2xs4V8BNMUZaAUl6F4Tv2DjGwOG6we/zFTWBh aC7UcZqc2D5l45YWEr7V3jBgjb1XtJ1di3Yd5omRFHOnqOAq85MI0eYmn SZEZ1DqnQFC/mghhRfFuxwILn+vH41pxAty2RugV6YgnsnXJq2Lohd1Un G2It5WZvQY0gD1beKoMDAzJZIS56oGxReI9eBkYKeBxkR3iApefk0ltZ1 g==; X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187734" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="365187734" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:20:01 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060638" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="740060638" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:19:56 -0700 From: Huang Ying To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Mel Gorman , Michal Hocko , Andrew Morton , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH 07/10] mm: tune PCP high automatically Date: Wed, 20 Sep 2023 14:18:53 +0800 Message-Id: <20230920061856.257597-8-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com> References: <20230920061856.257597-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 95BFD12001F X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: s31487skd51sq8399u8s8m671c7zqirx X-HE-Tag: 1695190801-699868 X-HE-Meta: U2FsdGVkX1/W/I+x0hxAYY2YkNE7OUuPpx4Yj2zJ+p5hj+olXMoVMB7DyyFF4DiEafuvbEnR0LACoan/6/H91NhSMv/jhcZOflAcrxLu/3ZN5Fpqo7A2F6PIYXppocKJ9JX2TfvGkHdKGhjFpOvs/2AeQoWDD4a0rO6IT2KBrTEXam5P6Qy42HTcYOGRLjHF4CaeleATP08fmFAo8LGc+n3j2wkEDzkpBndwh+heMkieMC5+stwZ8f/snGcxgHD8hmbSXxQFol7sigdJbpR/lk99ak5534cnTN8G2Oyx8IuJ6GRKcz+JLY+VXWjfWzP8pXkDAJOTeg2g1EypjZRNMzGj3fVrJ8tM8f6YoSMP6wAP63OsiJNxhOLEmimVv4AnfgOn6gTMUKlIMGZ6b9LPxm4am5UCcWV6A7yq+n5aIOHrph7o36Y2S4Glr4GhKnWa5JIgr8C32ObZjr0ZariUBQUDNO8UVjIKI0nnOdMXaOBwD9WVE07BXoUh5PsKREabemNmn8PDro0hfugO7WDPO10wFN+Ito/TSpkmgAG7AUkGLpvhhCHM9S7ZFnyxJBieEQNJSMgBpxA97t7WVLL7KxTWjo6drron9gYYlSWExrtZtb/quSJdOo1PbpuHjVjxfVF1rSDq/wQg/qWOfTN8eeg+ww9P32dUfry0dVjnSG8QLHMSw0vCk04QWsiSvbrQxXh32Nxv45uNvCK846NUe9QBqZhYCRrYCyKQotYWUWv5QlCudHjNtkOy1tTGtYh0B4fh4nK5kfptS8NiKpgbzk9QF7sW6ARGfQw7vJ1ucM5fHXVh5DwXq1yw+We8leUWQjhC0B4nKQ2fGkacw2kf15G5sjwaJSsWtAgm+o209z7LhsdVCz+V1Bw7yuj4KI2c72JoN9nR0I0IM0oSYe7QnF+D4QBmmBr5n/8FSDr6tdB+EPNVszO4ZLGsSjWKDYcsUgahakzujgZM2YXTVGn ShDz8NmG 7548XgvJYM3DLYjJDzCB06YBJ88OOKPGPLXZT1DLMb8M6K5o3dm5yvLq6j7XzoC/jLB8snk/OGIp2RZt8tOfVPxoOPGHLtR3dhThU6mhC+j2caRmQHPDFZUh1aRAoQJ6BI2/syTlcmJeaEABTEke1s3RpfFSVOyMnkCYcZQnRZr72ZKOUGgnyLRdzbLiSKPEpPXVFqzGhG/jU3aM6ziadYhyaG3onC7FRAAKXNRd0JuHkwCr0xPr2UAr96fpHkn1tDD0P9WpTDySmPWaDu8EUEr4tPe7OVSfkED6NoSmOo/OvsfBUpr6rbirsD1pIWReIwgo8/cNPC0mYyQZb4ymEOtshKRuyYggn+oAOWKpoIwTm18MLW19q3iQmQLAkmat0ZTkITgqw0bsKR7sfsjqun6oWP5/ofHG2fEr+rPVz8McJh3SWrlvTSYswfnRxdxZZCurrevnKFqdCGuEmp7sk/Z/pK/q5FQPJBOvALtsvhSh2Hy3Td0TkGlkV5f7oEOcTf2qr50boYQhGFC05qlhFsRgYl/5fuE+HMEIV X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The target to tune PCP high automatically is as follows, - Minimize allocation/freeing from/to shared zone - Minimize idle pages in PCP - Minimize pages in PCP if the system free pages is too few To reach these target, a tuning algorithm as follows is designed, - When we refill PCP via allocating from the zone, increase PCP high. Because if we had larger PCP, we could avoid to allocate from the zone. - In periodic vmstat updating kworker (via refresh_cpu_vm_stats()), decrease PCP high to try to free possible idle PCP pages. - When page reclaiming is active for the zone, stop increasing PCP high in allocating path, decrease PCP high and free some pages in freeing path. So, the PCP high can be tuned to the page allocating/freeing depth of workloads eventually. One issue of the algorithm is that if the number of pages allocated is much more than that of pages freed on a CPU, the PCP high may become the maximal value even if the allocating/freeing depth is small. But this isn't a severe issue, because there are no idle pages in this case. One alternative choice is to increase PCP high when we drain PCP via trying to free pages to the zone, but don't increase PCP high during PCP refilling. This can avoid the issue above. But if the number of pages allocated is much less than that of pages freed on a CPU, there will be many idle pages in PCP and it may be hard to free these idle pages. On a 2-socket Intel server with 224 logical CPU, we tested kbuild on one socket with `make -j 112`. With the patch, the build time decreases 10.1%. The cycles% of the spinlock contention (mostly for zone lock) decreases from 37.9% to 9.8% (with PCP size == 361). The number of PCP draining for high order pages freeing (free_high) decreases 53.4%. The number of pages allocated from zone (instead of from PCP) decreases 77.3%. Signed-off-by: "Huang, Ying" Suggested-by: Mel Gorman Suggested-by: Michal Hocko Cc: Andrew Morton Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter --- include/linux/gfp.h | 1 + mm/page_alloc.c | 118 ++++++++++++++++++++++++++++++++++---------- mm/vmstat.c | 8 +-- 3 files changed, 98 insertions(+), 29 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 665edc11fb9f..5b917e5b9350 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -320,6 +320,7 @@ extern void page_frag_free(void *addr); #define free_page(addr) free_pages((addr), 0) void page_alloc_init_cpuhp(void); +int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp); void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp); void drain_all_pages(struct zone *zone); void drain_local_pages(struct zone *zone); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 38bfab562b44..225abe56752c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2160,6 +2160,40 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, return i; } +/* + * Called from the vmstat counter updater to decay the PCP high. + * Return whether there are addition works to do. + */ +int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) +{ + int high_min, to_drain, batch; + int todo = 0; + + high_min = READ_ONCE(pcp->high_min); + batch = READ_ONCE(pcp->batch); + /* + * Decrease pcp->high periodically to try to free possible + * idle PCP pages. And, avoid to free too many pages to + * control latency. + */ + if (pcp->high > high_min) { + pcp->high = max3(pcp->count - (batch << PCP_BATCH_SCALE_MAX), + pcp->high * 4 / 5, high_min); + if (pcp->high > high_min) + todo++; + } + + to_drain = pcp->count - pcp->high; + if (to_drain > 0) { + spin_lock(&pcp->lock); + free_pcppages_bulk(zone, to_drain, pcp, 0); + spin_unlock(&pcp->lock); + todo++; + } + + return todo; +} + #ifdef CONFIG_NUMA /* * Called from the vmstat counter updater to drain pagesets of this @@ -2321,14 +2355,13 @@ static bool free_unref_page_prepare(struct page *page, unsigned long pfn, return true; } -static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high) +static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free_high) { int min_nr_free, max_nr_free; - int batch = READ_ONCE(pcp->batch); - /* Free everything if batch freeing high-order pages. */ + /* Free as much as possible if batch freeing high-order pages. */ if (unlikely(free_high)) - return pcp->count; + return min(pcp->count, batch << PCP_BATCH_SCALE_MAX); /* Check for PCP disabled or boot pageset */ if (unlikely(high < batch)) @@ -2343,7 +2376,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high) * freeing of pages without any allocation. */ batch <<= pcp->free_factor; - if (batch < max_nr_free && pcp->free_factor < PCP_BATCH_SCALE_MAX) + if (batch <= max_nr_free && pcp->free_factor < PCP_BATCH_SCALE_MAX) pcp->free_factor++; batch = clamp(batch, min_nr_free, max_nr_free); @@ -2351,28 +2384,47 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high) } static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, - bool free_high) + int batch, bool free_high) { - int high = READ_ONCE(pcp->high_min); + int high, high_min, high_max; - if (unlikely(!high || free_high)) + high_min = READ_ONCE(pcp->high_min); + high_max = READ_ONCE(pcp->high_max); + high = pcp->high = clamp(pcp->high, high_min, high_max); + + if (unlikely(!high)) return 0; - if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) - return high; + if (unlikely(free_high)) { + pcp->high = max(high - (batch << PCP_BATCH_SCALE_MAX), high_min); + return 0; + } /* * If reclaim is active, limit the number of pages that can be * stored on pcp lists */ - return min(READ_ONCE(pcp->batch) << 2, high); + if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) { + pcp->high = max(high - (batch << pcp->free_factor), high_min); + return min(batch << 2, pcp->high); + } + + if (pcp->count >= high && high_min != high_max) { + int need_high = (batch << pcp->free_factor) + batch; + + /* pcp->high should be large enough to hold batch freed pages */ + if (pcp->high < need_high) + pcp->high = clamp(need_high, high_min, high_max); + } + + return high; } static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, struct page *page, int migratetype, unsigned int order) { - int high; + int high, batch; int pindex; bool free_high = false; @@ -2387,6 +2439,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, list_add(&page->pcp_list, &pcp->lists[pindex]); pcp->count += 1 << order; + batch = READ_ONCE(pcp->batch); /* * As high-order pages other than THP's stored on PCP can contribute * to fragmentation, limit the number stored when PCP is heavily @@ -2397,14 +2450,15 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, free_high = (pcp->free_factor && (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) && (!(pcp->flags & PCPF_FREE_HIGH_BATCH) || - pcp->count >= READ_ONCE(pcp->batch))); + pcp->count >= READ_ONCE(batch))); pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER; } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) { pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER; } - high = nr_pcp_high(pcp, zone, free_high); + high = nr_pcp_high(pcp, zone, batch, free_high); if (pcp->count >= high) { - free_pcppages_bulk(zone, nr_pcp_free(pcp, high, free_high), pcp, pindex); + free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high), + pcp, pindex); } } @@ -2688,24 +2742,38 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, return page; } -static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order) +static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order) { - int high, batch, max_nr_alloc; + int high, base_batch, batch, max_nr_alloc; + int high_max, high_min; - high = READ_ONCE(pcp->high_min); - batch = READ_ONCE(pcp->batch); + base_batch = READ_ONCE(pcp->batch); + high_min = READ_ONCE(pcp->high_min); + high_max = READ_ONCE(pcp->high_max); + high = pcp->high = clamp(pcp->high, high_min, high_max); /* Check for PCP disabled or boot pageset */ - if (unlikely(high < batch)) + if (unlikely(high < base_batch)) return 1; + if (order) + batch = base_batch; + else + batch = (base_batch << pcp->alloc_factor); + /* - * Double the number of pages allocated each time there is subsequent - * refiling of order-0 pages without drain. + * If we had larger pcp->high, we could avoid to allocate from + * zone. */ + if (high_min != high_max && !test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) + high = pcp->high = min(high + batch, high_max); + if (!order) { - max_nr_alloc = max(high - pcp->count - batch, batch); - batch <<= pcp->alloc_factor; + max_nr_alloc = max(high - pcp->count - base_batch, base_batch); + /* + * Double the number of pages allocated each time there is + * subsequent refiling of order-0 pages without drain. + */ if (batch <= max_nr_alloc && pcp->alloc_factor < PCP_BATCH_SCALE_MAX) pcp->alloc_factor++; batch = min(batch, max_nr_alloc); @@ -2735,7 +2803,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, do { if (list_empty(list)) { - int batch = nr_pcp_alloc(pcp, order); + int batch = nr_pcp_alloc(pcp, zone, order); int alloced; alloced = rmqueue_bulk(zone, order, diff --git a/mm/vmstat.c b/mm/vmstat.c index 00e81e99c6ee..2f716ad14168 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -814,9 +814,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets) for_each_populated_zone(zone) { struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats; -#ifdef CONFIG_NUMA struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset; -#endif for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) { int v; @@ -832,10 +830,12 @@ static int refresh_cpu_vm_stats(bool do_pagesets) #endif } } -#ifdef CONFIG_NUMA if (do_pagesets) { cond_resched(); + + changes += decay_pcp_high(zone, this_cpu_ptr(pcp)); +#ifdef CONFIG_NUMA /* * Deal with draining the remote pageset of this * processor @@ -862,8 +862,8 @@ static int refresh_cpu_vm_stats(bool do_pagesets) drain_zone_pages(zone, this_cpu_ptr(pcp)); changes++; } - } #endif + } } for_each_online_pgdat(pgdat) {