From patchwork Wed Jun 22 08:35:16 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 12890308
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9113BC433EF
	for <linux-mm@archiver.kernel.org>; Wed, 22 Jun 2022 08:35:49 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 344D18E0094; Wed, 22 Jun 2022 04:35:49 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2F4508E008A; Wed, 22 Jun 2022 04:35:49 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1BCC88E0094; Wed, 22 Jun 2022 04:35:49 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com
 [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 0D3738E008A
	for <linux-mm@kvack.org>; Wed, 22 Jun 2022 04:35:49 -0400 (EDT)
Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id DCC5B12FD
	for <linux-mm@kvack.org>; Wed, 22 Jun 2022 08:35:48 +0000 (UTC)
X-FDA: 79605213576.01.449F1F7
Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31])
	by imf26.hostedemail.com (Postfix) with ESMTP id 35894140015
	for <linux-mm@kvack.org>; Wed, 22 Jun 2022 08:35:45 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1655886946; x=1687422946;
  h=from:to:cc:subject:date:message-id:mime-version:
   content-transfer-encoding;
  bh=O3XlJdFKMMXZ/3uT7ErsQxlftSciCG/DfbQ0W2TTX/o=;
  b=W+rOGnvAAVYWcd2+QN6HtpuZABCZos2m1z8eCpIAgzQF023L0MSiBvMX
   ndlndIDYzTkHz4WJOAv0IwVsdFJWa1TqZ+6+sfSd0Aadlc2A4yMxofSX6
   vHnxMI8WtWXx48uYvxaJm0+cpcin3THN82BFdl0Bt1qVgSHqft5GahfTR
   zvjty776TDxome/AzvHjNn6qcI2Netme5kmhOdAjVOxOhRoSgl8yqzkmA
   ukHlskYgKDg57svuYugpEF41OrXuxe3typfzpNA0HUdTMCDCxVJm+Buv4
   6nR0IZ9qPhJE3pScdpjjuYE0rzV3RCQqWj7R/to/ovx9uFb3K/ZYD9cuj
   Q==;
X-IronPort-AV: E=McAfee;i="6400,9594,10385"; a="342039826"
X-IronPort-AV: E=Sophos;i="5.92,212,1650956400";
   d="scan'208";a="342039826"
Received: from fmsmga008.fm.intel.com ([10.253.24.58])
  by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 22 Jun 2022 01:35:31 -0700
X-IronPort-AV: E=Sophos;i="5.92,212,1650956400";
   d="scan'208";a="644076883"
Received: from lzha111-mobl.ccr.corp.intel.com (HELO
 yhuang6-mobl1.ccr.corp.intel.com) ([10.254.215.232])
  by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 22 Jun 2022 01:35:26 -0700
From: Huang Ying <ying.huang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Huang Ying <ying.huang@intel.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.com>,
	Rik van Riel <riel@surriel.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Peter Zijlstra <peterz@infradead.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Yang Shi <shy828301@gmail.com>,
	Zi Yan <ziy@nvidia.com>,
	Wei Xu <weixugc@google.com>,
	osalvador <osalvador@suse.de>,
	Shakeel Butt <shakeelb@google.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
Subject: [PATCH -V4 0/3] memory tiering: hot page selection
Date: Wed, 22 Jun 2022 16:35:16 +0800
Message-Id: <20220622083519.708236-1-ying.huang@intel.com>
X-Mailer: git-send-email 2.30.2
MIME-Version: 1.0
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1655886948; a=rsa-sha256;
	cv=none;
	b=cb5dldZ70hsDhQy8I7P9xuJjw8zS/Ghor3/sS9TPC/wxtBkTJayoWmSvzOhB88CCPaii29
	7aommItLnmQUZhZgMbCcU1oeONQjS+DPIfLUc7hZEjx0Os3YG4wSx5JZi9FMZwX+XkUTtS
	vz7zlnOV/Luab5VHQgRC1YSWdPYqItU=
ARC-Authentication-Results: i=1;
	imf26.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=W+rOGnvA;
	spf=none (imf26.hostedemail.com: domain of ying.huang@intel.com has no SPF
 policy when checking 134.134.136.31) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1655886948;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:references:dkim-signature;
	bh=Qv6Y/qGt8UJDXNoNqffIVkL8JcaKv1HA01K5jiiOHi8=;
	b=KHCfHGXln578vjfI4Nr7GXwTmHCe3NDhN1y1gxl9vRG+12qDtV2ILJNS+ZS+QUUSGmLgYD
	H5exaQ+cSZvdmID0VmrrT8w4NaXANvDLimGTpAzLPI6+yAUI97l0z5Awg194BwuLf7YGyr
	8c1KHZfjUCoSy2ibrovyLgv4ObM8jeM=
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=W+rOGnvA;
	spf=none (imf26.hostedemail.com: domain of ying.huang@intel.com has no SPF
 policy when checking 134.134.136.31) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: 35894140015
X-Stat-Signature: 4swoh81mw5woseobbqaxdsomqt873xdn
X-Rspam-User: 
X-HE-Tag: 1655886945-632803
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

To optimize page placement in a memory tiering system with NUMA
balancing, the hot pages in the slow memory nodes need to be
identified.  Essentially, the original NUMA balancing implementation
selects the mostly recently accessed (MRU) pages to promote.  But this
isn't a perfect algorithm to identify the hot pages.  Because the
pages with quite low access frequency may be accessed eventually given
the NUMA balancing page table scanning period could be quite long
(e.g. 60 seconds).  So in this patchset, we implement a new hot page
identification algorithm based on the latency between NUMA balancing
page table scanning and hint page fault.  Which is a kind of mostly
frequently accessed (MFU) algorithm.

In NUMA balancing memory tiering mode, if there are hot pages in slow
memory node and cold pages in fast memory node, we need to
promote/demote hot/cold pages between the fast and cold memory nodes.

A choice is to promote/demote as fast as possible.  But the CPU cycles
and memory bandwidth consumed by the high promoting/demoting
throughput will hurt the latency of some workload because of accessing
inflating and slow memory bandwidth contention.

A way to resolve this issue is to restrict the max promoting/demoting
throughput.  It will take longer to finish the promoting/demoting.
But the workload latency will be better.  This is implemented in this
patchset as the page promotion rate limit mechanism.

The promotion hot threshold is workload and system configuration
dependent.  So in this patchset, a method to adjust the hot threshold
automatically is implemented.  The basic idea is to control the number
of the candidate promotion pages to match the promotion rate limit.

We used the pmbench memory accessing benchmark tested the patchset on
a 2-socket server system with DRAM and PMEM installed.  The test
results are as follows,

		pmbench score		promote rate
		 (accesses/s)			MB/s
		-------------		------------
base		  146887704.1		       725.6
hot selection     165695601.2		       544.0
rate limit	  162814569.8		       165.2
auto adjustment	  170495294.0                  136.9

From the results above,

With hot page selection patch [1/3], the pmbench score increases about
12.8%, and promote rate (overhead) decreases about 25.0%, compared with
base kernel.

With rate limit patch [2/3], pmbench score decreases about 1.7%, and
promote rate decreases about 69.6%, compared with hot page selection
patch.

With threshold auto adjustment patch [3/3], pmbench score increases
about 4.7%, and promote rate decrease about 17.1%, compared with rate
limit patch.

Baolin helped to test the patchset with MySQL on a machine which
contains 1 DRAM node (30G) and 1 PMEM node (126G).

sysbench /usr/share/sysbench/oltp_read_write.lua \
......
--tables=200 \
--table-size=1000000 \
--report-interval=10 \
--threads=16 \
--time=120

The tps can be improved about 5%.

Changelogs:

v4:

- Rebased on v5.19-rc3

- Collected reviewed-by and tested-by.

v3:

- Rebased on v5.19-rc1

- Renamed newly-added fields in struct pglist_data.

v2:

- Added ABI document for promote rate limit per Andrew's comments.  Thanks!

- Added function comments when necessary per Andrew's comments.

- Address other comments from Andrew Morton.

Best Regards,
Huang, Ying