From patchwork Thu Feb  4 10:10:50 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 12066723
Return-Path: <SRS0=+yJO=HG=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,
	MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3DB0CC433E0
	for <linux-mm@archiver.kernel.org>; Thu,  4 Feb 2021 10:11:58 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 8FEF764F53
	for <linux-mm@archiver.kernel.org>; Thu,  4 Feb 2021 10:11:57 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8FEF764F53
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=none dis=none) header.from=intel.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id D50DC6B006E; Thu,  4 Feb 2021 05:11:56 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D01E96B0070; Thu,  4 Feb 2021 05:11:56 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C15B56B0071; Thu,  4 Feb 2021 05:11:56 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0157.hostedemail.com
 [216.40.44.157])
	by kanga.kvack.org (Postfix) with ESMTP id A8DE06B006E
	for <linux-mm@kvack.org>; Thu,  4 Feb 2021 05:11:56 -0500 (EST)
Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 65CD0824999B
	for <linux-mm@kvack.org>; Thu,  4 Feb 2021 10:11:56 +0000 (UTC)
X-FDA: 77780169432.13.crib22_03034be275db
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin13.hostedemail.com (Postfix) with ESMTP id 441D218140B60
	for <linux-mm@kvack.org>; Thu,  4 Feb 2021 10:11:56 +0000 (UTC)
X-HE-Tag: crib22_03034be275db
X-Filterd-Recvd-Size: 6841
Received: from mga17.intel.com (mga17.intel.com [192.55.52.151])
	by imf43.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Thu,  4 Feb 2021 10:11:54 +0000 (UTC)
IronPort-SDR: 
 zhAdTe7CoqNFMYsVrACLjhdCT7hbjmL7bI8mtoG5rMhMMRJfyTzUyitS+7L9SxqNvHOpEpb2od
 iEt9LIeJR+Ng==
X-IronPort-AV: E=McAfee;i="6000,8403,9884"; a="160968955"
X-IronPort-AV: E=Sophos;i="5.79,400,1602572400";
   d="scan'208";a="160968955"
Received: from orsmga008.jf.intel.com ([10.7.209.65])
  by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 04 Feb 2021 02:11:46 -0800
IronPort-SDR: 
 txp9yY4J58LI53q7uHt2PN5KOlXyGRZIVvD3bCiaQvVsgHmZZOUj7PeHJGxy9jCML+EEJYYin+
 MMCBCjvKdh5Q==
X-IronPort-AV: E=Sophos;i="5.79,400,1602572400";
   d="scan'208";a="393093391"
Received: from qwang9-mobl.ccr.corp.intel.com (HELO
 yhuang6-mobl1.ccr.corp.intel.com) ([10.254.213.123])
  by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 04 Feb 2021 02:11:43 -0800
From: Huang Ying <ying.huang@intel.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Huang Ying <ying.huang@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@suse.com>,
	Rik van Riel <riel@redhat.com>,
	Mel Gorman <mgorman@suse.de>,
	Ingo Molnar <mingo@kernel.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>
Subject: [RFC -V5 0/6] autonuma: Optimize memory placement for memory tiering
 system
Date: Thu,  4 Feb 2021 18:10:50 +0800
Message-Id: <20210204101056.89336-1-ying.huang@intel.com>
X-Mailer: git-send-email 2.29.2
MIME-Version: 1.0
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

With the advent of various new memory types, some machines will have
multiple types of memory, e.g. DRAM and PMEM (persistent memory).  The
memory subsystem of these machines can be called memory tiering
system, because the performance of the different types of memory are
usually different.

After commit c221c0b0308f ("device-dax: "Hotplug" persistent memory
for use like normal RAM"), the PMEM could be used as the
cost-effective volatile memory in separate NUMA nodes.  In a typical
memory tiering system, there are CPUs, DRAM and PMEM in each physical
NUMA node.  The CPUs and the DRAM will be put in one logical node,
while the PMEM will be put in another (faked) logical node.

To optimize the system overall performance, the hot pages should be
placed in DRAM node.  To do that, we need to identify the hot pages in
the PMEM node and migrate them to DRAM node via NUMA migration.

In the original AutoNUMA, there are already a set of existing
mechanisms to identify the pages recently accessed by the CPUs in a
node and migrate the pages to the node.  So we can reuse these
mechanisms to build the mechanisms to optimize the page placement in
the memory tiering system.  This has been implemented in this
patchset.

At the other hand, the cold pages should be placed in PMEM node.  So,
we also need to identify the cold pages in the DRAM node and migrate
them to PMEM node.

In the following patchset,

[RFC][PATCH 00/13] [v5] Migrate Pages in lieu of discard
https://lore.kernel.org/lkml/20210126003411.2AC51464@viggo.jf.intel.com/

A mechanism to demote the cold DRAM pages to PMEM node under memory
pressure is implemented.  Based on that, the cold DRAM pages can be
demoted to PMEM node proactively to free some memory space on DRAM
node.  And this frees the space on DRAM node for the hot PMEM pages to
be promoted to.  This has been implemented in this patchset too.

The patchset is based on the following not-yet-merged patchset,

[RFC][PATCH 00/13] [v5] Migrate Pages in lieu of discard
https://lore.kernel.org/lkml/20210126003411.2AC51464@viggo.jf.intel.com/

This is part of a larger patch set.  If you want to apply these or
play with them, I'd suggest using the tree from below,

https://github.com/hying-caritas/linux/commits/autonuma-r5

We have tested the solution with the pmbench memory accessing
benchmark with the 80:20 read/write ratio and the normal access
address distribution on a 2 socket Intel server with Optane DC
Persistent Memory Model.  The test results of the base kernel and step
by step optimizations are as follows,

                Throughput	Promotion      DRAM bandwidth
		  access/s           MB/s                MB/s
               -----------     ----------      --------------
Base            74238178.0                             4291.7
Patch 1        146050652.3          359.4             11248.6
Patch 2        146300787.1          355.2             11237.2
Patch 3        162536383.0          211.7             11890.4
Patch 4        157187775.0          105.9             10412.3
Patch 5        164028415.2           73.3             10810.6
Patch 6        162666229.4           74.6             10715.1

The whole patchset improves the benchmark score up to 119.1%.  The
basic AutoNUMA based optimization solution (patch 1), the hot page
selection algorithm (patch 3), and the threshold automatic adjustment
algorithms (patch 5) improves the performance or reduce the overhead
(promotion MB/s) mostly.

Changelog:

v5:

- Rebased on the latest page demotion patchset. (which bases on v5.10)

v4:

- Rebased on the latest page demotion patchset. (which bases on v5.9-rc6)

- Add page promotion counter.

v3:

- Move the rate limit control as late as possible per Mel Gorman's
  comments.

- Revise the hot page selection implementation to store page scan time
  in struct page.

- Code cleanup.

- Rebased on the latest page demotion patchset.

v2:

- Addressed comments for V1.

- Rebased on v5.5.

Huang Ying (6):
  NUMA balancing: optimize page placement for memory tiering system
  memory tiering: skip to scan fast memory
  memory tiering: hot page selection with hint page fault latency
  memory tiering: rate limit NUMA migration throughput
  memory tiering: adjust hot threshold automatically
  memory tiering: add page promotion counter

 include/linux/mm.h           |  29 ++++++++
 include/linux/mmzone.h       |  11 ++++
 include/linux/node.h         |   5 ++
 include/linux/sched/sysctl.h |  12 ++++
 kernel/sched/core.c          |   9 +--
 kernel/sched/fair.c          | 124 +++++++++++++++++++++++++++++++++++
 kernel/sysctl.c              |  22 ++++++-
 mm/huge_memory.c             |  41 ++++++++----
 mm/memory.c                  |  11 +++-
 mm/migrate.c                 |  52 +++++++++++++--
 mm/mmzone.c                  |  17 +++++
 mm/mprotect.c                |  19 +++++-
 mm/vmscan.c                  |  15 +++++
 mm/vmstat.c                  |   4 ++
 14 files changed, 345 insertions(+), 26 deletions(-)

Best Regards,
Huang, Ying