[0/3] memory tiering: hot page selection

Message ID	20220408071222.219689-1-ying.huang@intel.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Huang Ying <ying.huang@intel.com> To: Peter Zijlstra <peterz@infradead.org>, Mel Gorman <mgorman@techsingularity.net>, Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying <ying.huang@intel.com>, Michal Hocko <mhocko@suse.com>, Rik van Riel <riel@surriel.com>, Dave Hansen <dave.hansen@linux.intel.com>, Yang Shi <shy828301@gmail.com>, Zi Yan <ziy@nvidia.com>, Wei Xu <weixugc@google.com>, osalvador <osalvador@suse.de>, Shakeel Butt <shakeelb@google.com>, Zhong Jiang <zhongjiang-ali@linux.alibaba.com> Subject: [PATCH 0/3] memory tiering: hot page selection Date: Fri, 8 Apr 2022 15:12:19 +0800 Message-Id: <20220408071222.219689-1-ying.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	memory tiering: hot page selection \| expand [0/3] memory tiering: hot page selection [1/3] memory tiering: hot page selection with hint page fault latency [2/3] memory tiering: rate limit NUMA migration throughput [3/3] memory tiering: adjust hot threshold automatically

Message ID

20220408071222.219689-1-ying.huang@intel.com (mailing list archive)

Headers

From: Huang Ying <ying.huang@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Huang Ying <ying.huang@intel.com>,
	Michal Hocko <mhocko@suse.com>,
	Rik van Riel <riel@surriel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Yang Shi <shy828301@gmail.com>,
	Zi Yan <ziy@nvidia.com>,
	Wei Xu <weixugc@google.com>,
	osalvador <osalvador@suse.de>,
	Shakeel Butt <shakeelb@google.com>,
	Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
Subject: [PATCH 0/3] memory tiering: hot page selection
Date: Fri,  8 Apr 2022 15:12:19 +0800
Message-Id: <20220408071222.219689-1-ying.huang@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

memory tiering: hot page selection | expand

Message

Huang, Ying April 8, 2022, 7:12 a.m. UTC

To optimize page placement in a memory tiering system with NUMA
balancing, the hot pages in the slow memory node need to be
identified.  Essentially, the original NUMA balancing implementation
selects and promote the mostly recently accessed (MRU) pages.  But the
recently accessed pages may be cold.  So in this patchset, we
implement a new hot page identification algorithm based on the latency
between NUMA balancing page table scanning and hint page fault.

And the hot page promotion can incur some overhead in the system.  To
control the overhead a simple promotion rate limit mechanism is
implemented.

The hot threshold used to identify the hot pages is workload dependent
usually.  So we also implemented a hot threshold automatic adjustment
algorithm.  The basic idea is to increase/decrease the hot threshold
to make the number of pages that pass the hot threshold (promote
candidate) near the rate limit.

We used the pmbench memory accessing benchmark tested the patchset on
a 2-socket server system with DRAM and PMEM installed.  The test
results are as follows,

		pmbench score		promote rate
		 (accesses/s)			MB/s
		-------------		------------
base		  146887704.1		       725.6
hot selection     165695601.2		       544.0
rate limit	  162814569.8		       165.2
auto adjustment	  170495294.0                  136.9

From the results above,

With hot page selection patch [1/3], the pmbench score increases about
12.8%, and promote rate (overhead) decreases about 25.0%, compared with
base kernel.

With rate limit patch [2/3], pmbench score decreases about 1.7%, and
promote rate decreases about 69.6%, compared with hot page selection
patch.

With threshold auto adjustment patch [3/3], pmbench score increases
about 4.7%, and promote rate decrease about 17.1%, compared with rate
limit patch.

Best Regards,
Huang, Ying

Comments

Andrew Morton April 9, 2022, 4:07 a.m. UTC | #1

On Fri,  8 Apr 2022 15:12:19 +0800 Huang Ying <ying.huang@intel.com> wrote:

> To optimize page placement in a memory tiering system with NUMA
> balancing, the hot pages in the slow memory node need to be
> identified.  Essentially, the original NUMA balancing implementation
> selects and promote the mostly recently accessed (MRU) pages.  But the
> recently accessed pages may be cold.

Wait.  What.  How on earth could the most recently accessed page be
considered "cold"?

Changelog needs work, please.

> So in this patchset, we
> implement a new hot page identification algorithm based on the latency
> between NUMA balancing page table scanning and hint page fault.

That sounds reasonable.

> And the hot page promotion can incur some overhead in the system.  To
> control the overhead a simple promotion rate limit mechanism is
> implemented.

That sounds like a hack to fix a problem you just added?  Maybe a
reasonable one..

> The hot threshold used to identify the hot pages is workload dependent
> usually.  So we also implemented a hot threshold automatic adjustment
> algorithm.  The basic idea is to increase/decrease the hot threshold
> to make the number of pages that pass the hot threshold (promote
> candidate) near the rate limit.

hot threshold is tweakable via debugfs.  So it's an unstable interface,
undocumented, may be withdrawn at any time, etc.

Please justify this.  Is it intended that this interface be removed? 
If yes, please describe the plan for this in the changelog.  If no then
this interface should be in sysfs, it should be fully documented in the
kernel tree and it should be fully changelogged (preferably with usage
examples) so that reviewers can best understand what is a permanent
extension to the kernel API which we must maintain for all time.

Does this code build correctly if !LAST_CPUPID_NOT_IN_PAGE_FLAGS?

> TODO: Add ABI document for new sysctl knob.

um, yes.

check_cpupid() is a poor function name.  Check for what?  Liver damage?
A better identifier would be cpupid_valid(), perhaps.  And perhaps it
should be implemented in mm.h.  And even documented.

Oddly, the proposed changes do a decent job of documenting
intra-function changes but a poor job of documenting newly added
functions.  Please review every new function and decide whether it is
so simple and obvious that it can be added to Linux without any inline
description at all.  

pgdat_free_space_enough() and  numa_hint_fault_latency().

numa_migration_check_rate_limit() particularly.  There's that "check"
word again.  Check for what?

The numa_balancing_rate_limit_mbps sysctl.  I assume "b" means "bytes".
That's better than "pages", but still.  Machines vary a lot in how
many bytes they have and in the speed at which they process those
bytes.  Is there some metric which we can use here which at least
partially insulates us from this variability?  So that sysadmins don't
need to set all their machines differently, based upon their size and
speed?

numa_threshold_ts is apparently in units of jiffies.  This was not
documented at the definition site, and it should be.  But jiffies can
vary between machines and configs.  Why add this inter-machine and
inter-config uncertainty/variability when we have include/linux/ktime.h?

Huang, Ying April 11, 2022, 8:16 a.m. UTC | #2

Hi, Andrew,

Thanks a lots for your comments!

On Fri, 2022-04-08 at 21:07 -0700, Andrew Morton wrote:
> On Fri,  8 Apr 2022 15:12:19 +0800 Huang Ying <ying.huang@intel.com>
> wrote:
> 
> > To optimize page placement in a memory tiering system with NUMA
> > balancing, the hot pages in the slow memory node need to be
> > identified.  Essentially, the original NUMA balancing
> > implementation
> > selects and promote the mostly recently accessed (MRU) pages.  But
> > the
> > recently accessed pages may be cold.
> 
> Wait.  What.  How on earth could the most recently accessed page be
> considered "cold"?
> 
> Changelog needs work, please.

Sorry for confusing.  What I really want to say is,

The pages with low accessing frequency should be kept in slow memory
node.  But these pages will still be accessed, so they may be selected
and migrated to the fast memory node by the original NUMA balancing
implementation upon accessing.

Are the new words better?

> > So in this patchset, we
> > implement a new hot page identification algorithm based on the
> > latency
> > between NUMA balancing page table scanning and hint page fault.
> 
> That sounds reasonable.
> 
> > And the hot page promotion can incur some overhead in the system. 
> > To
> > control the overhead a simple promotion rate limit mechanism is
> > implemented.
> 
> That sounds like a hack to fix a problem you just added?  Maybe a
> reasonable one..

Sorry again for confusing.  Try to reword it as follows,

If there are hot pages in slow memory node and cold pages in fast
memory node, we need to promote/demote hot/cold pages between the fast
and cold memory nodes.

One choice is to promote/demote as fast as possible.  But the CPU
cycles and memory bandwidth consumed by the fast promoting/demoting may
hurt the performance of the workload, e.g., the workload latency.

One way to resolve this issue is to restrict the max promoting/demoting
bandwidth.  It will take longer to finish the promoting/demoting.  But
the workload latency may be better.  This is implemented in this
patchset as the page promotion rate limit mechanism.

> > The hot threshold used to identify the hot pages is workload
> > dependent
> > usually.  So we also implemented a hot threshold automatic
> > adjustment
> > algorithm.  The basic idea is to increase/decrease the hot
> > threshold
> > to make the number of pages that pass the hot threshold (promote
> > candidate) near the rate limit.
> 
> hot threshold is tweakable via debugfs.  So it's an unstable
> interface,
> undocumented, may be withdrawn at any time, etc.
> 
> Please justify this.  Is it intended that this interface be removed? 
> If yes, please describe the plan for this in the changelog.  If no
> then
> this interface should be in sysfs, it should be fully documented in
> the
> kernel tree and it should be fully changelogged (preferably with
> usage
> examples) so that reviewers can best understand what is a permanent
> extension to the kernel API which we must maintain for all time.

Previously, I had added a hot threshold sysfs interface.  But later, I
thought that it's hard for the users to determine right hot threshold
value.  On the other hand, in patch [3/3], we implemented a hot
threshold automatic adjustment algorithm, which make it not so
important to set the hot threshold from the user space.  So I moved the
hot threshold setting interface to debugfs just for users to
experiment.  If it turns out that a formal interface is needed finally,
I will add one.

> Does this code build correctly if !LAST_CPUPID_NOT_IN_PAGE_FLAGS?
> 

I have relied on 0-Day kernel build service to capture the build bug of
more configurations and architectures.  But I just tried i386
architcture, and found that NUMA balancing isn't supported there, so I
cannot build test that for i386 too.  Should I try more?

> > TODO: Add ABI document for new sysctl knob.
> 
> um, yes.
> 

Sure.  Will do that in the next version.

> check_cpupid() is a poor function name.  Check for what?  Liver
> damage?
> A better identifier would be cpupid_valid(), perhaps.  And perhaps it
> should be implemented in mm.h.  And even documented.

Yes.  cpupid_valid() is much better.  I will add document for it in the
next version.

> Oddly, the proposed changes do a decent job of documenting
> intra-function changes but a poor job of documenting newly added
> functions.  Please review every new function and decide whether it is
> so simple and obvious that it can be added to Linux without any
> inline
> description at all.  
> 
> pgdat_free_space_enough() and  numa_hint_fault_latency().

Sure.  Will check all new functions and add document if necessary.

> numa_migration_check_rate_limit() particularly.  There's that "check"
> word again.  Check for what?
> 

The original idea is to check whether the migration is ratelimited. 
Will change this to numa_migration_rate_limit().  It will return true
if migration rate exceed the limit.

> 
> The numa_balancing_rate_limit_mbps sysctl.  I assume "b" means
> "bytes".
> That's better than "pages", but still.  Machines vary a lot in how
> many bytes they have and in the speed at which they process those
> bytes.  Is there some metric which we can use here which at least
> partially insulates us from this variability?  So that sysadmins
> don't
> need to set all their machines differently, based upon their size and
> speed?

Will change the sysctl to numb_balancing_promote_rate_limit_MBps to
avoid possible confusing.

The default rate limit is 64 GB/s, that is, no limitation.  As Mel
Gorman suggested.

To reduce the possible disturbing to the workload, A rule of thumb is
to set it to be 10% of the max slow memory bandwidth.  But we don't
know how to detect that automatically yet.

> numa_threshold_ts is apparently in units of jiffies.  This was not
> documented at the definition site, and it should be.  But jiffies can
> vary between machines and configs.  Why add this inter-machine and
> inter-config uncertainty/variability when we have
> include/linux/ktime.h?
> 

We need to call cmpxchg() on it, so ktime_t may be not the best choice
here.  But I got your idea.  I will use some more general time unit,
e.g. milli-second.

Best Regards,
Huang, Ying