mbox series

[-V3,0/3] memory tiering: hot page selection

Message ID 20220614081635.194014-1-ying.huang@intel.com (mailing list archive)
Headers show
Series memory tiering: hot page selection | expand

Message

Huang, Ying June 14, 2022, 8:16 a.m. UTC
To optimize page placement in a memory tiering system with NUMA
balancing, the hot pages in the slow memory nodes need to be
identified.  Essentially, the original NUMA balancing implementation
selects the mostly recently accessed (MRU) pages to promote.  But this
isn't a perfect algorithm to identify the hot pages.  Because the
pages with quite low access frequency may be accessed eventually given
the NUMA balancing page table scanning period could be quite long
(e.g. 60 seconds).  So in this patchset, we implement a new hot page
identification algorithm based on the latency between NUMA balancing
page table scanning and hint page fault.  Which is a kind of mostly
frequently accessed (MFU) algorithm.

In NUMA balancing memory tiering mode, if there are hot pages in slow
memory node and cold pages in fast memory node, we need to
promote/demote hot/cold pages between the fast and cold memory nodes.

A choice is to promote/demote as fast as possible.  But the CPU cycles
and memory bandwidth consumed by the high promoting/demoting
throughput will hurt the latency of some workload because of accessing
inflating and slow memory bandwidth contention.

A way to resolve this issue is to restrict the max promoting/demoting
throughput.  It will take longer to finish the promoting/demoting.
But the workload latency will be better.  This is implemented in this
patchset as the page promotion rate limit mechanism.

The promotion hot threshold is workload and system configuration
dependent.  So in this patchset, a method to adjust the hot threshold
automatically is implemented.  The basic idea is to control the number
of the candidate promotion pages to match the promotion rate limit.

We used the pmbench memory accessing benchmark tested the patchset on
a 2-socket server system with DRAM and PMEM installed.  The test
results are as follows,

		pmbench score		promote rate
		 (accesses/s)			MB/s
		-------------		------------
base		  146887704.1		       725.6
hot selection     165695601.2		       544.0
rate limit	  162814569.8		       165.2
auto adjustment	  170495294.0                  136.9

From the results above,

With hot page selection patch [1/3], the pmbench score increases about
12.8%, and promote rate (overhead) decreases about 25.0%, compared with
base kernel.

With rate limit patch [2/3], pmbench score decreases about 1.7%, and
promote rate decreases about 69.6%, compared with hot page selection
patch.

With threshold auto adjustment patch [3/3], pmbench score increases
about 4.7%, and promote rate decrease about 17.1%, compared with rate
limit patch.

Changelogs:

v3:

- Rebased on v5.19-rc1

- Renamed newly-added fields in struct pglist_data.

v2:

- Added ABI document for promote rate limit per Andrew's comments.  Thanks!

- Added function comments when necessary per Andrew's comments.

- Address other comments from Andrew Morton.

Best Regards,
Huang, Ying

Comments

Johannes Weiner June 14, 2022, 3:30 p.m. UTC | #1
Hi Huang,

Have you had a chance to look at our hot page detection patch that
Hasan has sent out some time ago? [1]

It hooks into page reclaim to determine what is and isn't hot. Reclaim
is an existing, well-tested mechanism to do just that. It's just 13
lines of code: set active bit on the first hint fault; promote on the
second one if the active bit is still set. This promotes only pages
hot enough that they can compete with toptier access frequencies.

It's not just convenient, it's also essential to link tier promotion
rate to page aging. Tiered NUMA balancing is about establishing a
global LRU order across two (or more) nodes. LRU promotions *within* a
node require multiple LRU cycles with references. LRU promotions
*between* nodes must follow the same rules, and be subject to the same
aging pressure, or you can get much colder pages promoted into a very
hot workingset and wreak havoc.

We've hammered this patch quite extensively with several Meta
production workloads and it's been working reliably at keeping
reasonable promotion rates.

@@ -4202,6 +4202,19 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 
 	last_cpupid = page_cpupid_last(page);
 	page_nid = page_to_nid(page);
+
+	/* Only migrate pages that are active on non-toptier node */
+	if (numa_promotion_tiered_enabled &&
+		!node_is_toptier(page_nid) &&
+		!PageActive(page)) {
+		count_vm_numa_event(NUMA_HINT_FAULTS);
+		if (page_nid == numa_node_id())
+			count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
+		mark_page_accessed(page);
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		goto out;
+	}
+
 	target_nid = numa_migrate_prep(page, vma, vmf->address, page_nid,
 			&flags);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);

[1] https://lore.kernel.org/all/20211130003634.35468-1-hasanalmaruf@fb.com/t/#m85b95624622f175ca17a00cc8cc0fc9cc4eeb6d2
Huang, Ying June 15, 2022, 3:47 a.m. UTC | #2
On Tue, 2022-06-14 at 11:30 -0400, Johannes Weiner wrote:
> Hi Huang,

Hi, Johannes,

> Have you had a chance to look at our hot page detection patch that
> Hasan has sent out some time ago? [1]

Yes.  I have seen that patch before.

> It hooks into page reclaim to determine what is and isn't hot. Reclaim
> is an existing, well-tested mechanism to do just that. It's just 13
> lines of code: set active bit on the first hint fault; promote on the
> second one if the active bit is still set. This promotes only pages
> hot enough that they can compete with toptier access frequencies.

In general, I think that patch is good.  And it can work with the hot
page selection patchset (this series) together.  That is, if
!PageActive(), then activate the page; otherwise, promote the page if
the hint page fault latency is short too.

In a system with swap device configured, and with continuous memory
pressure on all memory types (including PMEM), the NUMA balancing hint
page fault can help the page reclaiming, the page accesses can be
detected much earlier.  And page reclaiming can help page promotion
via keeping recently-not-accessed pages in inactive list and
recently-accessed pages in active list.

In a system without swap device configured and continuous memory
pressure on slow tier memory (e.g., PMEM), page reclaiming doesn't
help much because the active/inactive list aren't scanned regularly.
This is true for some users.  And the method in this series still
helps.

> It's not just convenient, it's also essential to link tier promotion
> rate to page aging. Tiered NUMA balancing is about establishing a
> global LRU order across two (or more) nodes. LRU promotions *within* a
> node require multiple LRU cycles with references.

IMHO, LRU algorithm is good for page reclaiming.  It isn't sufficient
for page promoting by itself.  Because it can identify cold pages
well, but its accuracy of identifying hot pages isn't enough.  That
is, it's hard to distinguish between warm pages and hot pages with
LRU/MRU itself.  The hint page fault latency introduced in this series
is to help on that.

> LRU promotions
> *between* nodes must follow the same rules, and be subject to the same
> aging pressure, or you can get much colder pages promoted into a very
> hot workingset and wreak havoc.
> 
> We've hammered this patch quite extensively with several Meta
> production workloads and it's been working reliably at keeping
> reasonable promotion rates.

Sounds good.  Do you have some data to share?

> @@ -4202,6 +4202,19 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  
> 
>  	last_cpupid = page_cpupid_last(page);
>  	page_nid = page_to_nid(page);
> +
> +	/* Only migrate pages that are active on non-toptier node */
> +	if (numa_promotion_tiered_enabled &&
> +		!node_is_toptier(page_nid) &&
> +		!PageActive(page)) {
> +		count_vm_numa_event(NUMA_HINT_FAULTS);
> +		if (page_nid == numa_node_id())
> +			count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
> +		mark_page_accessed(page);
> +		pte_unmap_unlock(vmf->pte, vmf->ptl);
> +		goto out;
> +	}
> +
>  	target_nid = numa_migrate_prep(page, vma, vmf->address, page_nid,
>  			&flags);
>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
> 
> [1] https://lore.kernel.org/all/20211130003634.35468-1-hasanalmaruf@fb.com/t/#m85b95624622f175ca17a00cc8cc0fc9cc4eeb6d2

Best Regards,
Huang, Ying
Baolin Wang June 20, 2022, 3:19 a.m. UTC | #3
On 6/14/2022 4:16 PM, Huang Ying wrote:
> To optimize page placement in a memory tiering system with NUMA
> balancing, the hot pages in the slow memory nodes need to be
> identified.  Essentially, the original NUMA balancing implementation
> selects the mostly recently accessed (MRU) pages to promote.  But this
> isn't a perfect algorithm to identify the hot pages.  Because the
> pages with quite low access frequency may be accessed eventually given
> the NUMA balancing page table scanning period could be quite long
> (e.g. 60 seconds).  So in this patchset, we implement a new hot page
> identification algorithm based on the latency between NUMA balancing
> page table scanning and hint page fault.  Which is a kind of mostly
> frequently accessed (MFU) algorithm.
> 
> In NUMA balancing memory tiering mode, if there are hot pages in slow
> memory node and cold pages in fast memory node, we need to
> promote/demote hot/cold pages between the fast and cold memory nodes.
> 
> A choice is to promote/demote as fast as possible.  But the CPU cycles
> and memory bandwidth consumed by the high promoting/demoting
> throughput will hurt the latency of some workload because of accessing
> inflating and slow memory bandwidth contention.
> 
> A way to resolve this issue is to restrict the max promoting/demoting
> throughput.  It will take longer to finish the promoting/demoting.
> But the workload latency will be better.  This is implemented in this
> patchset as the page promotion rate limit mechanism.
> 
> The promotion hot threshold is workload and system configuration
> dependent.  So in this patchset, a method to adjust the hot threshold
> automatically is implemented.  The basic idea is to control the number
> of the candidate promotion pages to match the promotion rate limit.
> 
> We used the pmbench memory accessing benchmark tested the patchset on
> a 2-socket server system with DRAM and PMEM installed.  The test
> results are as follows,
> 
> 		pmbench score		promote rate
> 		 (accesses/s)			MB/s
> 		-------------		------------
> base		  146887704.1		       725.6
> hot selection     165695601.2		       544.0
> rate limit	  162814569.8		       165.2
> auto adjustment	  170495294.0                  136.9
> 
>  From the results above,
> 
> With hot page selection patch [1/3], the pmbench score increases about
> 12.8%, and promote rate (overhead) decreases about 25.0%, compared with
> base kernel.
> 
> With rate limit patch [2/3], pmbench score decreases about 1.7%, and
> promote rate decreases about 69.6%, compared with hot page selection
> patch.
> 
> With threshold auto adjustment patch [3/3], pmbench score increases
> about 4.7%, and promote rate decrease about 17.1%, compared with rate
> limit patch.

I did a simple testing with mysql on my machine which contains 1 DRAM 
node (30G) and 1 PMEM node (126G).

sysbench /usr/share/sysbench/oltp_read_write.lua \
......
--tables=200 \
--table-size=1000000 \
--report-interval=10 \
--threads=16 \
--time=120

The tps can be improved about 5% from below data, and I think this is a 
good start to optimize the promotion. So for this series, please feel 
free to add:

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Without this patchset:
  transactions:                        2080188 (3466.48 per sec.)

With this patch set:
  transactions:                        2174296 (3623.40 per sec.)
Huang, Ying June 20, 2022, 3:24 a.m. UTC | #4
Baolin Wang <baolin.wang@linux.alibaba.com> writes:

> On 6/14/2022 4:16 PM, Huang Ying wrote:
>> To optimize page placement in a memory tiering system with NUMA
>> balancing, the hot pages in the slow memory nodes need to be
>> identified.  Essentially, the original NUMA balancing implementation
>> selects the mostly recently accessed (MRU) pages to promote.  But this
>> isn't a perfect algorithm to identify the hot pages.  Because the
>> pages with quite low access frequency may be accessed eventually given
>> the NUMA balancing page table scanning period could be quite long
>> (e.g. 60 seconds).  So in this patchset, we implement a new hot page
>> identification algorithm based on the latency between NUMA balancing
>> page table scanning and hint page fault.  Which is a kind of mostly
>> frequently accessed (MFU) algorithm.
>> In NUMA balancing memory tiering mode, if there are hot pages in
>> slow
>> memory node and cold pages in fast memory node, we need to
>> promote/demote hot/cold pages between the fast and cold memory nodes.
>> A choice is to promote/demote as fast as possible.  But the CPU
>> cycles
>> and memory bandwidth consumed by the high promoting/demoting
>> throughput will hurt the latency of some workload because of accessing
>> inflating and slow memory bandwidth contention.
>> A way to resolve this issue is to restrict the max
>> promoting/demoting
>> throughput.  It will take longer to finish the promoting/demoting.
>> But the workload latency will be better.  This is implemented in this
>> patchset as the page promotion rate limit mechanism.
>> The promotion hot threshold is workload and system configuration
>> dependent.  So in this patchset, a method to adjust the hot threshold
>> automatically is implemented.  The basic idea is to control the number
>> of the candidate promotion pages to match the promotion rate limit.
>> We used the pmbench memory accessing benchmark tested the patchset
>> on
>> a 2-socket server system with DRAM and PMEM installed.  The test
>> results are as follows,
>> 		pmbench score		promote rate
>> 		 (accesses/s)			MB/s
>> 		-------------		------------
>> base		  146887704.1		       725.6
>> hot selection     165695601.2		       544.0
>> rate limit	  162814569.8		       165.2
>> auto adjustment	  170495294.0                  136.9
>>  From the results above,
>> With hot page selection patch [1/3], the pmbench score increases
>> about
>> 12.8%, and promote rate (overhead) decreases about 25.0%, compared with
>> base kernel.
>> With rate limit patch [2/3], pmbench score decreases about 1.7%, and
>> promote rate decreases about 69.6%, compared with hot page selection
>> patch.
>> With threshold auto adjustment patch [3/3], pmbench score increases
>> about 4.7%, and promote rate decrease about 17.1%, compared with rate
>> limit patch.
>
> I did a simple testing with mysql on my machine which contains 1 DRAM
> node (30G) and 1 PMEM node (126G).
>
> sysbench /usr/share/sysbench/oltp_read_write.lua \
> ......
> --tables=200 \
> --table-size=1000000 \
> --report-interval=10 \
> --threads=16 \
> --time=120
>
> The tps can be improved about 5% from below data, and I think this is
> a good start to optimize the promotion. So for this series, please
> feel free to add:
>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>
> Without this patchset:
>  transactions:                        2080188 (3466.48 per sec.)
>
> With this patch set:
>  transactions:                        2174296 (3623.40 per sec.)

Thanks a lot!

Best Regards,
Huang, Ying