diff mbox series

arm64/lib: copy_page: s/stnp/stp

Message ID 20240613001812.2141-1-jszhang@kernel.org (mailing list archive)
State New, archived
Headers show
Series arm64/lib: copy_page: s/stnp/stp | expand

Commit Message

Jisheng Zhang June 13, 2024, 12:18 a.m. UTC
stnp performs non-temporal store, give a hints to the memory system
that caching is not useful for this data. But the scenario where
copy_page() used may not have this implication, although I must admit
there's such case where stnp helps performance(good). In this good
case, we can rely on the HW write streaming mechanism in some
implementations such as cortex-a55 to detect the case and take actions.

testing with https://github.com/apinski-cavium/copy_page_benchmark
this patch can reduce the time by about 3% on cortex-a55 platforms.

Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
---
 arch/arm64/lib/copy_page.S | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

Comments

Catalin Marinas June 24, 2024, 5:56 p.m. UTC | #1
On Thu, Jun 13, 2024 at 08:18:12AM +0800, Jisheng Zhang wrote:
> stnp performs non-temporal store, give a hints to the memory system
> that caching is not useful for this data. But the scenario where
> copy_page() used may not have this implication, although I must admit
> there's such case where stnp helps performance(good). In this good
> case, we can rely on the HW write streaming mechanism in some
> implementations such as cortex-a55 to detect the case and take actions.
> 
> testing with https://github.com/apinski-cavium/copy_page_benchmark
> this patch can reduce the time by about 3% on cortex-a55 platforms.

What about other CPUs? I'm also not convinced by such microbenchmarks.
It looks like it always copies to the same page, the stp may even
benefit from some caching of the data which we wouldn't need in a real
scenario.

So, I'm not merging this unless it's backed by some solid data across
several CPU implementations.
Jisheng Zhang June 26, 2024, 11:50 a.m. UTC | #2
On Mon, Jun 24, 2024 at 06:56:33PM +0100, Catalin Marinas wrote:
> On Thu, Jun 13, 2024 at 08:18:12AM +0800, Jisheng Zhang wrote:
> > stnp performs non-temporal store, give a hints to the memory system
> > that caching is not useful for this data. But the scenario where
> > copy_page() used may not have this implication, although I must admit
> > there's such case where stnp helps performance(good). In this good
> > case, we can rely on the HW write streaming mechanism in some
> > implementations such as cortex-a55 to detect the case and take actions.
> > 
> > testing with https://github.com/apinski-cavium/copy_page_benchmark
> > this patch can reduce the time by about 3% on cortex-a55 platforms.
> 
> What about other CPUs? I'm also not convinced by such microbenchmarks.

Per my test on CA53 and CA73, CA73 got similar improvements. As for CA53
there's no difference, maybe due to the follwoing commit in the ATF:

54035fc4672aa ("Disable non-temporal hint on Cortex-A53/57")

> It looks like it always copies to the same page, the stp may even
> benefit from some caching of the data which we wouldn't need in a real
> scenario.

Yep this is also my understanding where's the improvement from. And
I must admit there's case where stnp helps performance. we can rely
on the HW write streaming mechanism to detect and take actions.

However, sometimes, the "cache" behavior can benefit the scenario. Then
in this case, the stnp here would double lose.

what do you think?

> 
> So, I'm not merging this unless it's backed by some solid data across
> several CPU implementations.
> 
> -- 
> Catalin
Catalin Marinas June 26, 2024, 6:08 p.m. UTC | #3
On Wed, Jun 26, 2024 at 07:50:57PM +0800, Jisheng Zhang wrote:
> On Mon, Jun 24, 2024 at 06:56:33PM +0100, Catalin Marinas wrote:
> > On Thu, Jun 13, 2024 at 08:18:12AM +0800, Jisheng Zhang wrote:
> > > stnp performs non-temporal store, give a hints to the memory system
> > > that caching is not useful for this data. But the scenario where
> > > copy_page() used may not have this implication, although I must admit
> > > there's such case where stnp helps performance(good). In this good
> > > case, we can rely on the HW write streaming mechanism in some
> > > implementations such as cortex-a55 to detect the case and take actions.
> > > 
> > > testing with https://github.com/apinski-cavium/copy_page_benchmark
> > > this patch can reduce the time by about 3% on cortex-a55 platforms.
[...]
> > It looks like it always copies to the same page, the stp may even
> > benefit from some caching of the data which we wouldn't need in a real
> > scenario.
> 
> Yep this is also my understanding where's the improvement from. And
> I must admit there's case where stnp helps performance. we can rely
> on the HW write streaming mechanism to detect and take actions.

Well, is that case realistic? Can you show any improvement with some
real-world uses? Most likely modern CPUs fall back to non-temporal
stores after a series of STPs but it depends on how soon they do it, how
much cache gets polluted. OTOH, page copying could be the result of a
CoW and we'd expect subsequent accesses from the user where some caching
may be beneficial.

So, hard to tell but we should make a decision based on a microbenchmark
that writes over the same page multiple times. If you have some
real-world tests that exercise this path (e.g. CoW, Android app startup)
and show an improvement, I'd be in favour of this. Otherwise, no.

Thanks.
diff mbox series

Patch

diff --git a/arch/arm64/lib/copy_page.S b/arch/arm64/lib/copy_page.S
index 6a56d7cf309d..4c74fe2d8bd6 100644
--- a/arch/arm64/lib/copy_page.S
+++ b/arch/arm64/lib/copy_page.S
@@ -32,21 +32,21 @@  SYM_FUNC_START(__pi_copy_page)
 1:
 	tst	x0, #(PAGE_SIZE - 1)
 
-	stnp	x2, x3, [x0, #-256]
+	stp	x2, x3, [x0, #-256]
 	ldp	x2, x3, [x1]
-	stnp	x4, x5, [x0, #16 - 256]
+	stp	x4, x5, [x0, #16 - 256]
 	ldp	x4, x5, [x1, #16]
-	stnp	x6, x7, [x0, #32 - 256]
+	stp	x6, x7, [x0, #32 - 256]
 	ldp	x6, x7, [x1, #32]
-	stnp	x8, x9, [x0, #48 - 256]
+	stp	x8, x9, [x0, #48 - 256]
 	ldp	x8, x9, [x1, #48]
-	stnp	x10, x11, [x0, #64 - 256]
+	stp	x10, x11, [x0, #64 - 256]
 	ldp	x10, x11, [x1, #64]
-	stnp	x12, x13, [x0, #80 - 256]
+	stp	x12, x13, [x0, #80 - 256]
 	ldp	x12, x13, [x1, #80]
-	stnp	x14, x15, [x0, #96 - 256]
+	stp	x14, x15, [x0, #96 - 256]
 	ldp	x14, x15, [x1, #96]
-	stnp	x16, x17, [x0, #112 - 256]
+	stp	x16, x17, [x0, #112 - 256]
 	ldp	x16, x17, [x1, #112]
 
 	add	x0, x0, #128
@@ -54,14 +54,14 @@  SYM_FUNC_START(__pi_copy_page)
 
 	b.ne	1b
 
-	stnp	x2, x3, [x0, #-256]
-	stnp	x4, x5, [x0, #16 - 256]
-	stnp	x6, x7, [x0, #32 - 256]
-	stnp	x8, x9, [x0, #48 - 256]
-	stnp	x10, x11, [x0, #64 - 256]
-	stnp	x12, x13, [x0, #80 - 256]
-	stnp	x14, x15, [x0, #96 - 256]
-	stnp	x16, x17, [x0, #112 - 256]
+	stp	x2, x3, [x0, #-256]
+	stp	x4, x5, [x0, #16 - 256]
+	stp	x6, x7, [x0, #32 - 256]
+	stp	x8, x9, [x0, #48 - 256]
+	stp	x10, x11, [x0, #64 - 256]
+	stp	x12, x13, [x0, #80 - 256]
+	stp	x14, x15, [x0, #96 - 256]
+	stp	x16, x17, [x0, #112 - 256]
 
 	ret
 SYM_FUNC_END(__pi_copy_page)