diff mbox series

[RFC,-V2,2/8] autonuma, memory tiering: Rate limit NUMA migration throughput

Message ID 20200218082634.1596727-3-ying.huang@intel.com (mailing list archive)
State New, archived
Headers show
Series autonuma: Optimize memory placement in memory tiering system | expand

Commit Message

Huang, Ying Feb. 18, 2020, 8:26 a.m. UTC
From: Huang Ying <ying.huang@intel.com>

In autonuma memory tiering mode, the hot PMEM (persistent memory)
pages could be migrated to DRAM via autonuma.  But this incurs some
overhead too.  So that sometimes the workload performance may be hurt.
To avoid too much disturbing to the workload, the migration throughput
should be rate-limited.

At the other hand, in some situation, for example, some workloads
exits, many DRAM pages become free, so that some pages of the other
workloads can be migrated to DRAM.  To respond to the workloads
changing quickly, it's better to migrate pages faster.

To address the above 2 requirements, a rate limit algorithm as follows
is used,

- If there is enough free memory in DRAM node (that is, > high
  watermark + 2 * rate limit pages), then NUMA migration throughput will
  not be rate-limited to respond to the workload changing quickly.

- Otherwise, counting the number of pages to try to migrate to a DRAM
  node via autonuma, if the count exceeds the limit specified by the
  users, stop NUMA migration until the next second.

A new sysctl knob kernel.numa_balancing_rate_limit_mbps is added for
the users to specify the limit.  If its value is 0, the default
value (high watermark) will be used.

TODO: Add ABI document for new sysctl knob.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/mmzone.h       |  7 ++++
 include/linux/sched/sysctl.h |  6 ++++
 kernel/sched/fair.c          | 62 ++++++++++++++++++++++++++++++++++++
 kernel/sysctl.c              |  8 +++++
 mm/vmstat.c                  |  3 ++
 5 files changed, 86 insertions(+)

Comments

Mel Gorman Feb. 18, 2020, 8:57 a.m. UTC | #1
On Tue, Feb 18, 2020 at 04:26:28PM +0800, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> In autonuma memory tiering mode, the hot PMEM (persistent memory)
> pages could be migrated to DRAM via autonuma.  But this incurs some
> overhead too.  So that sometimes the workload performance may be hurt.
> To avoid too much disturbing to the workload, the migration throughput
> should be rate-limited.
> 
> At the other hand, in some situation, for example, some workloads
> exits, many DRAM pages become free, so that some pages of the other
> workloads can be migrated to DRAM.  To respond to the workloads
> changing quickly, it's better to migrate pages faster.
> 
> To address the above 2 requirements, a rate limit algorithm as follows
> is used,
> 
> - If there is enough free memory in DRAM node (that is, > high
>   watermark + 2 * rate limit pages), then NUMA migration throughput will
>   not be rate-limited to respond to the workload changing quickly.
> 
> - Otherwise, counting the number of pages to try to migrate to a DRAM
>   node via autonuma, if the count exceeds the limit specified by the
>   users, stop NUMA migration until the next second.
> 
> A new sysctl knob kernel.numa_balancing_rate_limit_mbps is added for
> the users to specify the limit.  If its value is 0, the default
> value (high watermark) will be used.
> 
> TODO: Add ABI document for new sysctl knob.
> 

I very strongly suggest that this only be done as a last resort and with
supporting data as to why it is necessary. NUMA balancing did have rate
limiting at one point and it was removed when balancing was smart enough
to mostly do the right thing without rate limiting. I posted a series
that reconciled NUMA balancing with the CPU load balancer recently which
further reduced spurious and unnecessary migrations. I would not like
to see rate limiting reintroduced unless there is no other way of fixing
saturation of memory bandwidth due to NUMA balancing. Even if it's
needed as a stopgap while the feature is finalised, it should be
introduced late in the series explaining why it's temporarily necessary.
Huang, Ying Feb. 19, 2020, 6:01 a.m. UTC | #2
Hi, Mel,

Thanks a lot for your review!

Mel Gorman <mgorman@suse.de> writes:

> On Tue, Feb 18, 2020 at 04:26:28PM +0800, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> In autonuma memory tiering mode, the hot PMEM (persistent memory)
>> pages could be migrated to DRAM via autonuma.  But this incurs some
>> overhead too.  So that sometimes the workload performance may be hurt.
>> To avoid too much disturbing to the workload, the migration throughput
>> should be rate-limited.
>> 
>> At the other hand, in some situation, for example, some workloads
>> exits, many DRAM pages become free, so that some pages of the other
>> workloads can be migrated to DRAM.  To respond to the workloads
>> changing quickly, it's better to migrate pages faster.
>> 
>> To address the above 2 requirements, a rate limit algorithm as follows
>> is used,
>> 
>> - If there is enough free memory in DRAM node (that is, > high
>>   watermark + 2 * rate limit pages), then NUMA migration throughput will
>>   not be rate-limited to respond to the workload changing quickly.
>> 
>> - Otherwise, counting the number of pages to try to migrate to a DRAM
>>   node via autonuma, if the count exceeds the limit specified by the
>>   users, stop NUMA migration until the next second.
>> 
>> A new sysctl knob kernel.numa_balancing_rate_limit_mbps is added for
>> the users to specify the limit.  If its value is 0, the default
>> value (high watermark) will be used.
>> 
>> TODO: Add ABI document for new sysctl knob.
>> 
>
> I very strongly suggest that this only be done as a last resort and with
> supporting data as to why it is necessary. NUMA balancing did have rate
> limiting at one point and it was removed when balancing was smart enough
> to mostly do the right thing without rate limiting. I posted a series
> that reconciled NUMA balancing with the CPU load balancer recently which
> further reduced spurious and unnecessary migrations. I would not like
> to see rate limiting reintroduced unless there is no other way of fixing
> saturation of memory bandwidth due to NUMA balancing. Even if it's
> needed as a stopgap while the feature is finalised, it should be
> introduced late in the series explaining why it's temporarily necessary.

This adds rate limit to NUMA migration between the different
types of memory nodes only (e.g. from PMEM to DRAM), but not between the
same types of memory nodes (e.g. from DRAM to DRAM).  Sorry for
confusing patch subject.  I will change it in the next version.

And, rate limit is an inherent part of the algorithm used in the
patchset.  Because we just use LRU algorithm to find the cold pages on
the fast memory node (e.g. DRAM), and use NUMA hint page fault latency
to find the hot pages on the slow memory node (e.g. PMEM).  But we don't
compare the temperature between the cold DRAM pages and the hot PMEM
pages.  Instead, we just try to exchange some cold DRAM pages and some
hot PMEM pages, even if the cold DRAM pages is hotter than the hot PMEM
pages.  The rate limit is used to control how many pages to exchange
between DRAM and PMEM per second.  This isn't perfect, but it works well
in our testing.

Best Regards,
Huang, Ying
diff mbox series

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index dfb09106ad70..6e7a28becdc2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -249,6 +249,9 @@  enum node_stat_item {
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
 	NR_KERNEL_MISC_RECLAIMABLE,	/* reclaimable non-slab kernel pages */
+#ifdef CONFIG_NUMA_BALANCING
+	NUMA_TRY_MIGRATE,	/* pages to try to migrate via NUMA balancing */
+#endif
 	NR_VM_NODE_STAT_ITEMS
 };
 
@@ -786,6 +789,10 @@  typedef struct pglist_data {
 	struct deferred_split deferred_split_queue;
 #endif
 
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned long numa_ts;
+	unsigned long numa_try;
+#endif
 	/* Fields commonly accessed by the page reclaim scanner */
 
 	/*
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 80dc5030c797..c4b27790b901 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -43,6 +43,12 @@  extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
 
+#ifdef CONFIG_NUMA_BALANCING
+extern unsigned int sysctl_numa_balancing_rate_limit;
+#else
+#define sysctl_numa_balancing_rate_limit	0
+#endif
+
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
 extern __read_mostly unsigned int sysctl_sched_nr_migrate;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ba749f579714..ef694816150b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1064,6 +1064,12 @@  unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+/*
+ * Restrict the NUMA migration per second in MB for each target node
+ * if no enough free space in target node
+ */
+unsigned int sysctl_numa_balancing_rate_limit;
+
 struct numa_group {
 	refcount_t refcount;
 
@@ -1404,6 +1410,43 @@  static inline unsigned long group_weight(struct task_struct *p, int nid,
 	return 1000 * faults / total_faults;
 }
 
+static bool pgdat_free_space_enough(struct pglist_data *pgdat)
+{
+	int z;
+	unsigned long rate_limit;
+
+	rate_limit = sysctl_numa_balancing_rate_limit << (20 - PAGE_SHIFT);
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone_watermark_ok(zone, 0,
+				      high_wmark_pages(zone) + rate_limit * 2,
+				      ZONE_MOVABLE, 0))
+			return true;
+	}
+	return false;
+}
+
+static bool numa_migration_check_rate_limit(struct pglist_data *pgdat,
+					    unsigned long rate_limit, int nr)
+{
+	unsigned long try;
+	unsigned long now = jiffies, last_ts;
+
+	mod_node_page_state(pgdat, NUMA_TRY_MIGRATE, nr);
+	try = node_page_state(pgdat, NUMA_TRY_MIGRATE);
+	last_ts = pgdat->numa_ts;
+	if (now > last_ts + HZ &&
+	    cmpxchg(&pgdat->numa_ts, last_ts, now) == last_ts)
+		pgdat->numa_try = try;
+	if (try - pgdat->numa_try > rate_limit)
+		return false;
+	return true;
+}
+
 bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 				int src_nid, int dst_cpu)
 {
@@ -1411,6 +1454,25 @@  bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	int dst_nid = cpu_to_node(dst_cpu);
 	int last_cpupid, this_cpupid;
 
+	/*
+	 * If memory tiering mode is enabled, will try promote pages
+	 * in slow memory node to fast memory node.
+	 */
+	if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
+	    next_promotion_node(src_nid) != -1) {
+		struct pglist_data *pgdat;
+		unsigned long rate_limit;
+
+		pgdat = NODE_DATA(dst_nid);
+		if (pgdat_free_space_enough(pgdat))
+			return true;
+
+		rate_limit =
+			sysctl_numa_balancing_rate_limit << (20 - PAGE_SHIFT);
+		return numa_migration_check_rate_limit(pgdat, rate_limit,
+						       hpage_nr_pages(page));
+	}
+
 	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
 	last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3756108bb658..2d19e821267a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -419,6 +419,14 @@  static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "numa_balancing_rate_limit_mbps",
+		.data		= &sysctl_numa_balancing_rate_limit,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+	},
 	{
 		.procname	= "numa_balancing",
 		.data		= &sysctl_numa_balancing_mode,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d76714d2fd7c..9326512c612c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1203,6 +1203,9 @@  const char * const vmstat_text[] = {
 	"nr_dirtied",
 	"nr_written",
 	"nr_kernel_misc_reclaimable",
+#ifdef CONFIG_NUMA_BALANCING
+	"numa_try_migrate",
+#endif
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",