Message ID | 20211108071149.823570-2-reijiw@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: DC {ZVA,GVA,GZVA} shouldn't be used when DCZID_EL0.DZP == 1 | expand |
On 2021-11-08 07:11, Reiji Watanabe wrote: > Currently, clear_page() uses DC ZVA instruction unconditionally. But it > should make sure that DCZID_EL0.DZP, which indicates whether or not use > of DC ZVA instruction is prohibited, is zero when using the instruction. > Use STP as memset does instead when DCZID_EL0.DZP == 1. Reviewed-by: Robin Murphy <robin.murphy@arm.com> FWIW I did eventually figure out the "pre-bias" trick from v1 thanks to Mark's nod toward the original context, but a quick survey of various optimisation guides implied that the explicit add should generally be preferred over post-index writeback anyway, so I think we're all good here. Cheers, Robin. > Fixes: f27bb139c387 ("arm64: Miscellaneous library functions") > Signed-off-by: Reiji Watanabe <reijiw@google.com> > --- > arch/arm64/lib/clear_page.S | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/arch/arm64/lib/clear_page.S b/arch/arm64/lib/clear_page.S > index b84b179edba3..ea3a6c927fe8 100644 > --- a/arch/arm64/lib/clear_page.S > +++ b/arch/arm64/lib/clear_page.S > @@ -16,6 +16,7 @@ > */ > SYM_FUNC_START_PI(clear_page) > mrs x1, dczid_el0 > + tbnz x1, #4, 2f /* Branch if DC ZVA is prohibited */ > and w1, w1, #0xf > mov x2, #4 > lsl x1, x2, x1 > @@ -25,5 +26,14 @@ SYM_FUNC_START_PI(clear_page) > tst x0, #(PAGE_SIZE - 1) > b.ne 1b > ret > + > +2: stp xzr, xzr, [x0] > + stp xzr, xzr, [x0, #16] > + stp xzr, xzr, [x0, #32] > + stp xzr, xzr, [x0, #48] > + add x0, x0, #64 > + tst x0, #(PAGE_SIZE - 1) > + b.ne 2b > + ret > SYM_FUNC_END_PI(clear_page) > EXPORT_SYMBOL(clear_page) >
On Tue, Nov 16, 2021 at 3:00 PM Robin Murphy <robin.murphy@arm.com> wrote: > > On 2021-11-08 07:11, Reiji Watanabe wrote: > > Currently, clear_page() uses DC ZVA instruction unconditionally. But it > > should make sure that DCZID_EL0.DZP, which indicates whether or not use > > of DC ZVA instruction is prohibited, is zero when using the instruction. > > Use STP as memset does instead when DCZID_EL0.DZP == 1. > > Reviewed-by: Robin Murphy <robin.murphy@arm.com> > > FWIW I did eventually figure out the "pre-bias" trick from v1 thanks to > Mark's nod toward the original context, but a quick survey of various > optimisation guides implied that the explicit add should generally be > preferred over post-index writeback anyway, so I think we're all good here. Thank you for the review! The original code, which used *pre*-index (not post-index) addressing, made no significant difference in page_clear performance on my test environment from the current code. Now, I am looking at creating v3 patches to use stnp instead of stp in page_clear (NOTE: DC ZVA shows much better performance on my test system than stp/stnp). Although using stnp didn't show significant difference in clear_page() performance on my test system from stp (no significant difference in cache-misses, cache_refill, cache_wb, or cache_allocate event counter either), using stnp should be more appropriate for page_clear than stp, and I understand it could show better performance on some CPUs. Thanks, Reiji > > Cheers, > Robin. > > > Fixes: f27bb139c387 ("arm64: Miscellaneous library functions") > > Signed-off-by: Reiji Watanabe <reijiw@google.com> > > --- > > arch/arm64/lib/clear_page.S | 10 ++++++++++ > > 1 file changed, 10 insertions(+) > > > > diff --git a/arch/arm64/lib/clear_page.S b/arch/arm64/lib/clear_page.S > > index b84b179edba3..ea3a6c927fe8 100644 > > --- a/arch/arm64/lib/clear_page.S > > +++ b/arch/arm64/lib/clear_page.S > > @@ -16,6 +16,7 @@ > > */ > > SYM_FUNC_START_PI(clear_page) > > mrs x1, dczid_el0 > > + tbnz x1, #4, 2f /* Branch if DC ZVA is prohibited */ > > and w1, w1, #0xf > > mov x2, #4 > > lsl x1, x2, x1 > > @@ -25,5 +26,14 @@ SYM_FUNC_START_PI(clear_page) > > tst x0, #(PAGE_SIZE - 1) > > b.ne 1b > > ret > > + > > +2: stp xzr, xzr, [x0] > > + stp xzr, xzr, [x0, #16] > > + stp xzr, xzr, [x0, #32] > > + stp xzr, xzr, [x0, #48] > > + add x0, x0, #64 > > + tst x0, #(PAGE_SIZE - 1) > > + b.ne 2b > > + ret > > SYM_FUNC_END_PI(clear_page) > > EXPORT_SYMBOL(clear_page) > >
On 2021-11-18 08:18, Reiji Watanabe wrote: > On Tue, Nov 16, 2021 at 3:00 PM Robin Murphy <robin.murphy@arm.com> wrote: >> >> On 2021-11-08 07:11, Reiji Watanabe wrote: >>> Currently, clear_page() uses DC ZVA instruction unconditionally. But it >>> should make sure that DCZID_EL0.DZP, which indicates whether or not use >>> of DC ZVA instruction is prohibited, is zero when using the instruction. >>> Use STP as memset does instead when DCZID_EL0.DZP == 1. >> >> Reviewed-by: Robin Murphy <robin.murphy@arm.com> >> >> FWIW I did eventually figure out the "pre-bias" trick from v1 thanks to >> Mark's nod toward the original context, but a quick survey of various >> optimisation guides implied that the explicit add should generally be >> preferred over post-index writeback anyway, so I think we're all good here. > > Thank you for the review! > The original code, which used *pre*-index (not post-index) addressing, Oops, in the context I think I meant writeback in general anyway. This is what happens when a sudden urge to review random patches at 11PM strikes :) > made no significant difference in page_clear performance on my test > environment from the current code. > > Now, I am looking at creating v3 patches to use stnp instead of stp > in page_clear (NOTE: DC ZVA shows much better performance on my test > system than stp/stnp). > > Although using stnp didn't show significant difference in clear_page() > performance on my test system from stp (no significant difference in > cache-misses, cache_refill, cache_wb, or cache_allocate event counter > either), using stnp should be more appropriate for page_clear than stp, > and I understand it could show better performance on some CPUs. Indeed - certainly most Arm Ltd. cores tend to be good at spotting the store pattern and switching into write-streaming mode automatically - but semantically, STNP probably is appropriate for the great majority of clear_page() usage. Feel free to keep my review tag with that change. Thanks, Robin.
diff --git a/arch/arm64/lib/clear_page.S b/arch/arm64/lib/clear_page.S index b84b179edba3..ea3a6c927fe8 100644 --- a/arch/arm64/lib/clear_page.S +++ b/arch/arm64/lib/clear_page.S @@ -16,6 +16,7 @@ */ SYM_FUNC_START_PI(clear_page) mrs x1, dczid_el0 + tbnz x1, #4, 2f /* Branch if DC ZVA is prohibited */ and w1, w1, #0xf mov x2, #4 lsl x1, x2, x1 @@ -25,5 +26,14 @@ SYM_FUNC_START_PI(clear_page) tst x0, #(PAGE_SIZE - 1) b.ne 1b ret + +2: stp xzr, xzr, [x0] + stp xzr, xzr, [x0, #16] + stp xzr, xzr, [x0, #32] + stp xzr, xzr, [x0, #48] + add x0, x0, #64 + tst x0, #(PAGE_SIZE - 1) + b.ne 2b + ret SYM_FUNC_END_PI(clear_page) EXPORT_SYMBOL(clear_page)
Currently, clear_page() uses DC ZVA instruction unconditionally. But it should make sure that DCZID_EL0.DZP, which indicates whether or not use of DC ZVA instruction is prohibited, is zero when using the instruction. Use STP as memset does instead when DCZID_EL0.DZP == 1. Fixes: f27bb139c387 ("arm64: Miscellaneous library functions") Signed-off-by: Reiji Watanabe <reijiw@google.com> --- arch/arm64/lib/clear_page.S | 10 ++++++++++ 1 file changed, 10 insertions(+)