From patchwork Thu Oct 13 19:31:13 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Johannes Weiner X-Patchwork-Id: 13006376 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3392BC433FE for ; Thu, 13 Oct 2022 19:31:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 235136B0071; Thu, 13 Oct 2022 15:31:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1E3E16B0073; Thu, 13 Oct 2022 15:31:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 085A26B0074; Thu, 13 Oct 2022 15:31:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id DFDA96B0071 for ; Thu, 13 Oct 2022 15:31:16 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id A6A8CA045A for ; Thu, 13 Oct 2022 19:31:16 +0000 (UTC) X-FDA: 80016919752.17.B1CC782 Received: from mail-qv1-f48.google.com (mail-qv1-f48.google.com [209.85.219.48]) by imf15.hostedemail.com (Postfix) with ESMTP id 0053EA0031 for ; Thu, 13 Oct 2022 19:31:15 +0000 (UTC) Received: by mail-qv1-f48.google.com with SMTP id i12so1947521qvs.2 for ; Thu, 13 Oct 2022 12:31:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=TjPreMRbl+DsOUCUiinLa3OKKpEheUYI+YsvrHyezzo=; b=a1gbuNuD82cmpVCbpj0pSUGnWd2k8qQY7zpQ4zqT9vzNquwtl+HXMK4keMrhadmv3S i8Vb+C2ct2Mr5yUFQaM4M33pulmoXicTmQul5yNAQeUaSj91XYaTLYnTzsFt1JJl0egm SC/pC4COp7KZ4OuyFK5VWrdlYzN/q8zjLg9v+yNtSQAQOG8pb2EsEXDXFXf0VCKDQbUA VoCZIdgxfVQEB2nZzOHf1IQA7muL2EanUOQfaYTOEBi9k8lC3shG+t3yjjTxbSckXBvC sDY7xjKOPKx4tjGS6HTFhBwMhcd8PmDP2cAQWpTFvs7CoQZGmF/u8Y3+uydWimFwHods 7szQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=TjPreMRbl+DsOUCUiinLa3OKKpEheUYI+YsvrHyezzo=; b=TvJFgmJVHf+zK+zmr7cOZSQfydEjMTi9HSzyI9AaQgrzSGeFgQlLCdOwVoWROUkVRr tPOLJXkyHED5ole5e1H4ggrCMefNqdu5V7Eycec4L+6B/8DHh6aO7XNLq/9lz3ir3pbS 9AdsSdgiDlHCuVTqOW3H3EAnjXLhI1nRGvsn7K7+m7bx6NwWXgJlvEdlegpvFXVEexmq JAES4NQIAgW/rYGG1G2YR3GcMZTLdWJ7ptw2Ol1LUx4QPvQcA8tHRfEKDTtsVOMZALlw or8rW9MTH1Z9qaNcKuZ52OH4MQDRjV1L0EQtF7NrTufmpz744Z0KYpR3/5pg3T3Xcuv0 82dQ== X-Gm-Message-State: ACrzQf0oVt70+tKh61nTfxAADoKYjHAipbxqsPi45YzpwPL18iD0QbNu ly5cAWXlVc5QhjmBSxy+nCjogw== X-Google-Smtp-Source: AMsMyM5tE3LgPBQCNgHRpR1u9B3ktv2bfI1WRqI0ofEJpOmGq8xjVfe1D+pLkx5hhV/ukcMlaETm5A== X-Received: by 2002:ad4:5cea:0:b0:4b4:1747:19da with SMTP id iv10-20020ad45cea000000b004b4174719damr1144897qvb.101.1665689475029; Thu, 13 Oct 2022 12:31:15 -0700 (PDT) Received: from localhost ([2620:10d:c091:480::3a61]) by smtp.gmail.com with ESMTPSA id bl16-20020a05620a1a9000b006bbc09af9f5sm481244qkb.101.2022.10.13.12.31.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Oct 2022 12:31:14 -0700 (PDT) From: Johannes Weiner To: Andrew Morton Cc: Rik van Riel , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH] mm: vmscan: make rotations a secondary factor in balancing anon vs file Date: Thu, 13 Oct 2022 15:31:13 -0400 Message-Id: <20221013193113.726425-1-hannes@cmpxchg.org> X-Mailer: git-send-email 2.37.3 MIME-Version: 1.0 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1665689476; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=TjPreMRbl+DsOUCUiinLa3OKKpEheUYI+YsvrHyezzo=; b=cwWNbFcaGehkpZXAsbtGztI2dIt6JHHYp9swuZR9pW4Xu1ilZupGoS5+7752R3gLGVx9/4 Wgb8GZSk+dam5PdBcaz4kWaI8+wMc+kJ4ddrV1DFkGcz67uctikXNYIHr86nKgQOdGzCfv 7Eyms+ozZpyAskIMrUlmv/o2M4UUhBs= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=a1gbuNuD; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf15.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.48 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1665689476; a=rsa-sha256; cv=none; b=uOGAxckzbN7wJVI20/DV4eCqCIXu73mvAKUEhwKIqnM0R6GBljZAdKQFEUmRcnU0Sx/zog CgA7+QDNhOSF7HZXj7DhPefwaP8PNsfo6zdlLdfnsr0rrSIaWuTXNOC97rY8c2l/Tc49aH itA119D7IhnRu/eYTKMSPKAJHgS4k0Q= X-Stat-Signature: f7m5qrzyd19yb5xrchbtdft5wzy7wwuk X-Rspamd-Queue-Id: 0053EA0031 Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=a1gbuNuD; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf15.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.48 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org X-Rspam-User: X-Rspamd-Server: rspam12 X-HE-Tag: 1665689475-342072 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: We noticed a 2% webserver throughput regression after upgrading from 5.6. This could be tracked down to a shift in the anon/file reclaim balance (confirmed with swappiness) that resulted in worse reclaim efficiency and thus more kswapd activity for the same outcome. The change that exposed the problem is aae466b0052e ("mm/swap: implement workingset detection for anonymous LRU"). By qualifying swapins based on their refault distance, it lowered the cost of anon reclaim in this workload, in turn causing (much) more anon scanning than before. Scanning the anon list is more expensive due to the higher ratio of mmapped pages that may rotate during reclaim, and so the result was an increase in %sys time. Right now, rotations aren't considered a cost when balancing scan pressure between LRUs. We can end up with very few file refaults putting all the scan pressure on hot anon pages that are rotated en masse, don't get reclaimed, and never push back on the file LRU again. We still only reclaim file cache in that case, but we burn a lot CPU rotating anon pages. It's "fair" from an LRU age POV, but doesn't reflect the real cost it imposes on the system. Consider rotations as a secondary factor in balancing the LRUs. This doesn't attempt to make a precise comparison between IO cost and CPU cost, it just says: if reloads are about comparable between the lists, or rotations are overwhelmingly different, adjust for CPU work. This fixed the regression on our webservers. It has since been deployed to the entire Meta fleet and hasn't caused any problems. Signed-off-by: Johannes Weiner --- include/linux/swap.h | 5 +++-- mm/swap.c | 22 +++++++++++++++++----- mm/vmscan.c | 4 +++- mm/workingset.c | 2 +- 4 files changed, 24 insertions(+), 9 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index a18cf4b7c724..369d7799205d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -384,8 +384,9 @@ extern unsigned long totalreserve_pages; /* linux/mm/swap.c */ -void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages); -void lru_note_cost_folio(struct folio *); +void lru_note_cost(struct lruvec *lruvec, bool file, + unsigned int nr_io, unsigned int nr_rotated); +void lru_note_cost_refault(struct folio *); void folio_add_lru(struct folio *); void folio_add_lru_vma(struct folio *, struct vm_area_struct *); void lru_cache_add(struct page *); diff --git a/mm/swap.c b/mm/swap.c index 955930f41d20..2f12a2ee1d3a 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -295,8 +295,20 @@ void folio_rotate_reclaimable(struct folio *folio) } } -void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages) +void lru_note_cost(struct lruvec *lruvec, bool file, + unsigned int nr_io, unsigned int nr_rotated) { + unsigned long cost; + + /* + * Reflect the relative cost of incurring IO and spending CPU + * time on rotations. This doesn't attempt to make a precise + * comparison, it just says: if reloads are about comparable + * between the LRU lists, or rotations are overwhelmingly + * different between them, adjust scan balance for CPU work. + */ + cost = nr_io * SWAP_CLUSTER_MAX + nr_rotated; + do { unsigned long lrusize; @@ -310,9 +322,9 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages) spin_lock_irq(&lruvec->lru_lock); /* Record cost event */ if (file) - lruvec->file_cost += nr_pages; + lruvec->file_cost += cost; else - lruvec->anon_cost += nr_pages; + lruvec->anon_cost += cost; /* * Decay previous events @@ -335,10 +347,10 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages) } while ((lruvec = parent_lruvec(lruvec))); } -void lru_note_cost_folio(struct folio *folio) +void lru_note_cost_refault(struct folio *folio) { lru_note_cost(folio_lruvec(folio), folio_is_file_lru(folio), - folio_nr_pages(folio)); + folio_nr_pages(folio), 0); } static void folio_activate_fn(struct lruvec *lruvec, struct folio *folio) diff --git a/mm/vmscan.c b/mm/vmscan.c index 04d8b88e5216..ffe402e095d3 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2499,7 +2499,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed); spin_unlock_irq(&lruvec->lru_lock); - lru_note_cost(lruvec, file, stat.nr_pageout); + lru_note_cost(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed); mem_cgroup_uncharge_list(&folio_list); free_unref_page_list(&folio_list); @@ -2639,6 +2639,8 @@ static void shrink_active_list(unsigned long nr_to_scan, __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); spin_unlock_irq(&lruvec->lru_lock); + if (nr_rotated) + lru_note_cost(lruvec, file, 0, nr_rotated); mem_cgroup_uncharge_list(&l_active); free_unref_page_list(&l_active); trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate, diff --git a/mm/workingset.c b/mm/workingset.c index ae7e984b23c6..d2d02978588c 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -493,7 +493,7 @@ void workingset_refault(struct folio *folio, void *shadow) if (workingset) { folio_set_workingset(folio); /* XXX: Move to lru_cache_add() when it supports new vs putback */ - lru_note_cost_folio(folio); + lru_note_cost_refault(folio); mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr); } out: