percpu: reduce the number of searches calculating best upa

Message ID	20201102052647.8211-1-vvghjk1234@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=YsDo=EI=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7E7F222267 From: Wonhuyk Yang <vvghjk1234@gmail.com> To: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org>, Christoph Lameter <cl@linux.com>, linux-mm@kvack.org, Wonhyuk Yang <vvghjk1234@gmail.com> Subject: [PATCH] percpu: reduce the number of searches calculating best upa Date: Mon, 2 Nov 2020 14:26:47 +0900 Message-Id: <20201102052647.8211-1-vvghjk1234@gmail.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	percpu: reduce the number of searches calculating best upa \| expand percpu: reduce the number of searches calculating best upa

Message ID

20201102052647.8211-1-vvghjk1234@gmail.com (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7E7F222267
From: Wonhuyk Yang <vvghjk1234@gmail.com>
To: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>,
	Christoph Lameter <cl@linux.com>,
	linux-mm@kvack.org,
	Wonhyuk Yang <vvghjk1234@gmail.com>
Subject: [PATCH] percpu: reduce the number of searches calculating best upa
Date: Mon,  2 Nov 2020 14:26:47 +0900
Message-Id: <20201102052647.8211-1-vvghjk1234@gmail.com>
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

percpu: reduce the number of searches calculating best upa | expand

Commit Message

Wonhyuk Yang Nov. 2, 2020, 5:26 a.m. UTC

From: Wonhyuk Yang <vvghjk1234@gmail.com>

Best upa is determined by iterating 1 to max_upa. If the size of
alloc_size is power of 2, numbers of iteration decrease to
logarithmic level.

Prime factorization of alloc_size makes it easy to get possible
upas. When alloc_size is power of 2, we can avoid cost of the
prime factorization and possible upas are 1, 2, 4, ... max_upa.

Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com>
---
 mm/percpu.c | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

Comments

Dennis Zhou Nov. 6, 2020, 11:33 p.m. UTC | #1

Hi,

On Mon, Nov 02, 2020 at 02:26:47PM +0900, Wonhuyk Yang wrote:
> From: Wonhyuk Yang <vvghjk1234@gmail.com>
> 
> Best upa is determined by iterating 1 to max_upa. If the size of
> alloc_size is power of 2, numbers of iteration decrease to
> logarithmic level.
> 
> Prime factorization of alloc_size makes it easy to get possible
> upas. When alloc_size is power of 2, we can avoid cost of the
> prime factorization and possible upas are 1, 2, 4, ... max_upa.
> 
> Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com>
> ---
>  mm/percpu.c | 20 ++++++++------------
>  1 file changed, 8 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 66a93f096394..a24f3973744f 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -2689,18 +2689,17 @@ static struct pcpu_alloc_info * __init pcpu_build_alloc_info(
>  
>  	/*
>  	 * Determine min_unit_size, alloc_size and max_upa such that
> -	 * alloc_size is multiple of atom_size and is the smallest
> -	 * which can accommodate 4k aligned segments which are equal to
> -	 * or larger than min_unit_size.
> +	 * alloc_size is the maximu value of min_unit_size, atom_size.
> +	 * Also, alloc_size is power of 2 because both min_unit_size
> +	 * and atom_size are power of 2.
>  	 */
>  	min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);
> +	min_unit_size = roundup_pow_of_two(min_unit_size);

While this may make sense for the vast majority of users, there remain
users such as embedded devices that page in the first chunk and have
fairly limited use of percpu memory. In these cases, we wouldn't want to
round up the min_unit_size as that might be wasteful albeit not much but
still to be not really worth changing the behavior here.

>  
>  	/* determine the maximum # of units that can fit in an allocation */
> -	alloc_size = roundup(min_unit_size, atom_size);
> -	upa = alloc_size / min_unit_size;
> -	while (alloc_size % upa || (offset_in_page(alloc_size / upa)))
> -		upa--;
> -	max_upa = upa;
> +	alloc_size = max_t(size_t, min_unit_size, atom_size);
> +	max_upa = alloc_size / min_unit_size;
> +
>  
>  	/* group cpus according to their proximity */
>  	for_each_possible_cpu(cpu) {
> @@ -2727,12 +2726,9 @@ static struct pcpu_alloc_info * __init pcpu_build_alloc_info(
>  	 * Related to atom_size, which could be much larger than the unit_size.
>  	 */
>  	last_allocs = INT_MAX;
> -	for (upa = max_upa; upa; upa--) {
> +	for (upa = max_upa; upa; upa >>= 1) {
>  		int allocs = 0, wasted = 0;
>  
> -		if (alloc_size % upa || (offset_in_page(alloc_size / upa)))
> -			continue;
> -
>  		for (group = 0; group < nr_groups; group++) {
>  			int this_allocs = DIV_ROUND_UP(group_cnt[group], upa);
>  			allocs += this_allocs;
> -- 
> 2.17.1
> 

Overall, I'm not inclined to take this because it is trying to optimize
boot time code, which runs at most a few times, by introducing a new
assumption. I personally find this code a little complex to parse, so
I'd rather not make the change unless it aided in maintainability or was
side effect free.

Thanks,
Dennis

diff --git a/mm/percpu.c b/mm/percpu.c
index 66a93f096394..a24f3973744f 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -2689,18 +2689,17 @@  static struct pcpu_alloc_info * __init pcpu_build_alloc_info(
 
 	/*
 	 * Determine min_unit_size, alloc_size and max_upa such that
-	 * alloc_size is multiple of atom_size and is the smallest
-	 * which can accommodate 4k aligned segments which are equal to
-	 * or larger than min_unit_size.
+	 * alloc_size is the maximu value of min_unit_size, atom_size.
+	 * Also, alloc_size is power of 2 because both min_unit_size
+	 * and atom_size are power of 2.
 	 */
 	min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);
+	min_unit_size = roundup_pow_of_two(min_unit_size);
 
 	/* determine the maximum # of units that can fit in an allocation */
-	alloc_size = roundup(min_unit_size, atom_size);
-	upa = alloc_size / min_unit_size;
-	while (alloc_size % upa || (offset_in_page(alloc_size / upa)))
-		upa--;
-	max_upa = upa;
+	alloc_size = max_t(size_t, min_unit_size, atom_size);
+	max_upa = alloc_size / min_unit_size;
+
 
 	/* group cpus according to their proximity */
 	for_each_possible_cpu(cpu) {
@@ -2727,12 +2726,9 @@  static struct pcpu_alloc_info * __init pcpu_build_alloc_info(
 	 * Related to atom_size, which could be much larger than the unit_size.
 	 */
 	last_allocs = INT_MAX;
-	for (upa = max_upa; upa; upa--) {
+	for (upa = max_upa; upa; upa >>= 1) {
 		int allocs = 0, wasted = 0;
 
-		if (alloc_size % upa || (offset_in_page(alloc_size / upa)))
-			continue;
-
 		for (group = 0; group < nr_groups; group++) {
 			int this_allocs = DIV_ROUND_UP(group_cnt[group], upa);
 			allocs += this_allocs;

percpu: reduce the number of searches calculating best upa

Commit Message

Comments

Patch