diff mbox series

[next] mm/mincore: improve performance by adding an unlikely hint

Message ID 20250217170934.457266-1-colin.i.king@gmail.com (mailing list archive)
State New
Headers show
Series [next] mm/mincore: improve performance by adding an unlikely hint | expand

Commit Message

Colin Ian King Feb. 17, 2025, 5:09 p.m. UTC
Adding an unlikely() hint on the masked start comparison error
return path improves run-time performance of the mincore system call.

Benchmarking on an i9-12900 shows an improvement of 7ns on mincore calls
on a 256KB mmap'd region where 50% of the pages we resident.

Results based on running 20 tests with turbo disabled (to reduce
clock freq turbo changes), with 10 second run per test and comparing
the number of mincores calls per second. The % standard deviation of
the 20 tests was ~0.10%, so results are reliable.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
---
 mm/mincore.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Matthew Wilcox Feb. 17, 2025, 5:58 p.m. UTC | #1
On Mon, Feb 17, 2025 at 05:09:34PM +0000, Colin Ian King wrote:
> Adding an unlikely() hint on the masked start comparison error
> return path improves run-time performance of the mincore system call.
> 
> Benchmarking on an i9-12900 shows an improvement of 7ns on mincore calls
> on a 256KB mmap'd region where 50% of the pages we resident.
> 
> Results based on running 20 tests with turbo disabled (to reduce
> clock freq turbo changes), with 10 second run per test and comparing
> the number of mincores calls per second. The % standard deviation of
> the 20 tests was ~0.10%, so results are reliable.

I think you've elided _just_ enough information here that nobody can
judge whether your stats skills are any good ;-)  You've told us 7ns
(per call, presumably) and you've told us 0.10% standard deviation,
but you haven't told us how long the syscall takes, so nobody can tell
whether 7ns is within 0.10% or not ;-)

> Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
> ---
>  mm/mincore.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/mincore.c b/mm/mincore.c
> index d6bd19e520fc..832f29f46767 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -239,7 +239,7 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
>  	start = untagged_addr(start);
>  
>  	/* Check the start address: needs to be page-aligned.. */
> -	if (start & ~PAGE_MASK)
> +	if (unlikely(start & ~PAGE_MASK))
>  		return -EINVAL;

We might get even more advantage by moving the EINVAL test before
untagged_addr() since we know that the tags are all in the high bits and
we don't need to have the test be dependent on the previous arithmetic.
Colin Ian King Feb. 17, 2025, 6 p.m. UTC | #2
fOn 17/02/2025 17:58, Matthew Wilcox wrote:
> On Mon, Feb 17, 2025 at 05:09:34PM +0000, Colin Ian King wrote:
>> Adding an unlikely() hint on the masked start comparison error
>> return path improves run-time performance of the mincore system call.
>>
>> Benchmarking on an i9-12900 shows an improvement of 7ns on mincore calls
>> on a 256KB mmap'd region where 50% of the pages we resident.
>>
>> Results based on running 20 tests with turbo disabled (to reduce
>> clock freq turbo changes), with 10 second run per test and comparing
>> the number of mincores calls per second. The % standard deviation of
>> the 20 tests was ~0.10%, so results are reliable.
> 
> I think you've elided _just_ enough information here that nobody can
> judge whether your stats skills are any good ;-)  You've told us 7ns
> (per call, presumably) and you've told us 0.10% standard deviation,
> but you haven't told us how long the syscall takes, so nobody can tell
> whether 7ns is within 0.10% or not ;-)

Ugh, my bad.

Improvement was from ~970 down to 963 ns, so small ~0.7% improvement.

Colin

> 
>> Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
>> ---
>>   mm/mincore.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/mincore.c b/mm/mincore.c
>> index d6bd19e520fc..832f29f46767 100644
>> --- a/mm/mincore.c
>> +++ b/mm/mincore.c
>> @@ -239,7 +239,7 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
>>   	start = untagged_addr(start);
>>   
>>   	/* Check the start address: needs to be page-aligned.. */
>> -	if (start & ~PAGE_MASK)
>> +	if (unlikely(start & ~PAGE_MASK))
>>   		return -EINVAL;
> 
> We might get even more advantage by moving the EINVAL test before
> untagged_addr() since we know that the tags are all in the high bits and
> we don't need to have the test be dependent on the previous arithmetic.
Andrew Morton Feb. 18, 2025, 3:13 a.m. UTC | #3
On Mon, 17 Feb 2025 18:00:22 +0000 "Colin King (gmail)" <colin.i.king@gmail.com> wrote:

> fOn 17/02/2025 17:58, Matthew Wilcox wrote:
> > On Mon, Feb 17, 2025 at 05:09:34PM +0000, Colin Ian King wrote:
> >> Adding an unlikely() hint on the masked start comparison error
> >> return path improves run-time performance of the mincore system call.
> >>
> >> Benchmarking on an i9-12900 shows an improvement of 7ns on mincore calls
> >> on a 256KB mmap'd region where 50% of the pages we resident.
> >>
> >> Results based on running 20 tests with turbo disabled (to reduce
> >> clock freq turbo changes), with 10 second run per test and comparing
> >> the number of mincores calls per second. The % standard deviation of
> >> the 20 tests was ~0.10%, so results are reliable.
> > 
> > I think you've elided _just_ enough information here that nobody can
> > judge whether your stats skills are any good ;-)  You've told us 7ns
> > (per call, presumably) and you've told us 0.10% standard deviation,
> > but you haven't told us how long the syscall takes, so nobody can tell
> > whether 7ns is within 0.10% or not ;-)
> 
> Ugh, my bad.
> 
> Improvement was from ~970 down to 963 ns, so small ~0.7% improvement.
> 

It actually doesn't change the generated code:

hp2:/usr/src/25> diff -u mm/mincore.lst.old mm/mincore.lst 
--- mm/mincore.lst.old	2025-02-17 19:11:34.093727411 -0800
+++ mm/mincore.lst	2025-02-17 19:12:59.797009056 -0800
@@ -1563,7 +1563,7 @@
 	start = untagged_addr(start);
 
 	/* Check the start address: needs to be page-aligned.. */
-	if (start & ~PAGE_MASK)
+	if (unlikely(start & ~PAGE_MASK))
      b27:	31 ff                	xor    %edi,%edi
 	asm (ALTERNATIVE("",
      b29:	90                   	nop
Colin Ian King Feb. 18, 2025, 2:16 p.m. UTC | #4
On 18/02/2025 03:13, Andrew Morton wrote:
> On Mon, 17 Feb 2025 18:00:22 +0000 "Colin King (gmail)" <colin.i.king@gmail.com> wrote:
> 
>> fOn 17/02/2025 17:58, Matthew Wilcox wrote:
>>> On Mon, Feb 17, 2025 at 05:09:34PM +0000, Colin Ian King wrote:
>>>> Adding an unlikely() hint on the masked start comparison error
>>>> return path improves run-time performance of the mincore system call.
>>>>
>>>> Benchmarking on an i9-12900 shows an improvement of 7ns on mincore calls
>>>> on a 256KB mmap'd region where 50% of the pages we resident.
>>>>
>>>> Results based on running 20 tests with turbo disabled (to reduce
>>>> clock freq turbo changes), with 10 second run per test and comparing
>>>> the number of mincores calls per second. The % standard deviation of
>>>> the 20 tests was ~0.10%, so results are reliable.
>>>
>>> I think you've elided _just_ enough information here that nobody can
>>> judge whether your stats skills are any good ;-)  You've told us 7ns
>>> (per call, presumably) and you've told us 0.10% standard deviation,
>>> but you haven't told us how long the syscall takes, so nobody can tell
>>> whether 7ns is within 0.10% or not ;-)
>>
>> Ugh, my bad.
>>
>> Improvement was from ~970 down to 963 ns, so small ~0.7% improvement.
>>
> 
> It actually doesn't change the generated code:

I've compare the generated x86 object code using gcc 14.2.1 20240912 
(Fedora 41) and 14.2.0 (Debian 14.2.0-17), 14.2.1 20250211 (Clear Linux) 
and I get differences in the generated object code comparing old and 
new, and the improvement on ClearLinux is more significant too because 
it uses -O3. So I'm confident the change is generating improved object code.


> 
> hp2:/usr/src/25> diff -u mm/mincore.lst.old mm/mincore.lst
> --- mm/mincore.lst.old	2025-02-17 19:11:34.093727411 -0800
> +++ mm/mincore.lst	2025-02-17 19:12:59.797009056 -0800
> @@ -1563,7 +1563,7 @@
>   	start = untagged_addr(start);
>   
>   	/* Check the start address: needs to be page-aligned.. */
> -	if (start & ~PAGE_MASK)
> +	if (unlikely(start & ~PAGE_MASK))
>        b27:	31 ff                	xor    %edi,%edi
>   	asm (ALTERNATIVE("",
>        b29:	90                   	nop
Andrew Morton Feb. 19, 2025, 12:08 a.m. UTC | #5
On Tue, 18 Feb 2025 14:16:20 +0000 "Colin King (gmail)" <colin.i.king@gmail.com> wrote:

> >> Improvement was from ~970 down to 963 ns, so small ~0.7% improvement.
> >>
> > 
> > It actually doesn't change the generated code:
> 
> I've compare the generated x86 object code using gcc 14.2.1 20240912 
> (Fedora 41) and 14.2.0 (Debian 14.2.0-17), 14.2.1 20250211 (Clear Linux) 
> and I get differences in the generated object code comparing old and 
> new, and the improvement on ClearLinux is more significant too because 
> it uses -O3. So I'm confident the change is generating improved object code.

I was using gcc-13.2.0.

Please resend, with a Matthew-friendly changelog?
diff mbox series

Patch

diff --git a/mm/mincore.c b/mm/mincore.c
index d6bd19e520fc..832f29f46767 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -239,7 +239,7 @@  SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
 	start = untagged_addr(start);
 
 	/* Check the start address: needs to be page-aligned.. */
-	if (start & ~PAGE_MASK)
+	if (unlikely(start & ~PAGE_MASK))
 		return -EINVAL;
 
 	/* ..and we need to be passed a valid user-space range */