Message ID | 20210323073432.3422227-3-yangyingliang@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: lib: improve copy performance | expand |
On 2021-03-23 07:34, Yang Yingliang wrote: > When copy over 128 bytes, src/dst is added after > each ldp/stp instruction, it will cost more time. > To improve this, we only add src/dst after load > or store 64 bytes. This breaks the required behaviour for copy_*_user(), since the fault handler expects the base address to be up-to-date at all times. Say you're copying 128 bytes and fault on the 4th store, it should return 80 bytes not copied; the code below would return 128 bytes not copied, even though 48 bytes have actually been written to the destination. We've had a couple of tries at updating this code (because the whole template is frankly a bit terrible, and a long way from the well-optimised code it was derived from), but getting the fault-handling behaviour right without making the handler itself ludicrously complex has proven tricky. And then it got bumped down the priority list while the uaccess behaviour in general was in flux - now that the dust has largely settled on that I should probably try to find time to pick this up again... Robin. > Copy 4096 bytes cost on Kunpeng920 (ms): > Without this patch: > memcpy: 143.85 copy_from_user: 172.69 copy_to_user: 199.23 > > With this patch: > memcpy: 107.12 copy_from_user: 157.50 copy_to_user: 198.85 > > It's about 25% improvement in memcpy(). > > Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> > --- > arch/arm64/lib/copy_template.S | 36 +++++++++++++++++++--------------- > 1 file changed, 20 insertions(+), 16 deletions(-) > > diff --git a/arch/arm64/lib/copy_template.S b/arch/arm64/lib/copy_template.S > index 488df234c49a..c3cd6f84c9c0 100644 > --- a/arch/arm64/lib/copy_template.S > +++ b/arch/arm64/lib/copy_template.S > @@ -152,29 +152,33 @@ D_h .req x14 > .p2align L1_CACHE_SHIFT > .Lcpy_body_large: > /* pre-get 64 bytes data. */ > - ldp1 A_l, A_h, src, #16 > - ldp1 B_l, B_h, src, #16 > - ldp1 C_l, C_h, src, #16 > - ldp1 D_l, D_h, src, #16 > + ldp2 A_l, A_h, src, #0, #8 > + ldp2 B_l, B_h, src, #16, #24 > + ldp2 C_l, C_h, src, #32, #40 > + ldp2 D_l, D_h, src, #48, #56 > + add src, src, #64 > 1: > /* > * interlace the load of next 64 bytes data block with store of the last > * loaded 64 bytes data. > */ > - stp1 A_l, A_h, dst, #16 > - ldp1 A_l, A_h, src, #16 > - stp1 B_l, B_h, dst, #16 > - ldp1 B_l, B_h, src, #16 > - stp1 C_l, C_h, dst, #16 > - ldp1 C_l, C_h, src, #16 > - stp1 D_l, D_h, dst, #16 > - ldp1 D_l, D_h, src, #16 > + stp2 A_l, A_h, dst, #0, #8 > + ldp2 A_l, A_h, src, #0, #8 > + stp2 B_l, B_h, dst, #16, #24 > + ldp2 B_l, B_h, src, #16, #24 > + stp2 C_l, C_h, dst, #32, #40 > + ldp2 C_l, C_h, src, #32, #40 > + stp2 D_l, D_h, dst, #48, #56 > + ldp2 D_l, D_h, src, #48, #56 > + add src, src, #64 > + add dst, dst, #64 > subs count, count, #64 > b.ge 1b > - stp1 A_l, A_h, dst, #16 > - stp1 B_l, B_h, dst, #16 > - stp1 C_l, C_h, dst, #16 > - stp1 D_l, D_h, dst, #16 > + stp2 A_l, A_h, dst, #0, #8 > + stp2 B_l, B_h, dst, #16, #24 > + stp2 C_l, C_h, dst, #32, #40 > + stp2 D_l, D_h, dst, #48, #56 > + add dst, dst, #64 > > tst count, #0x3f > b.ne .Ltail63 >
On Tue, Mar 23, 2021 at 12:08:56PM +0000, Robin Murphy wrote: > On 2021-03-23 07:34, Yang Yingliang wrote: > > When copy over 128 bytes, src/dst is added after > > each ldp/stp instruction, it will cost more time. > > To improve this, we only add src/dst after load > > or store 64 bytes. > > This breaks the required behaviour for copy_*_user(), since the fault > handler expects the base address to be up-to-date at all times. Say you're > copying 128 bytes and fault on the 4th store, it should return 80 bytes not > copied; the code below would return 128 bytes not copied, even though 48 > bytes have actually been written to the destination. > > We've had a couple of tries at updating this code (because the whole > template is frankly a bit terrible, and a long way from the well-optimised > code it was derived from), but getting the fault-handling behaviour right > without making the handler itself ludicrously complex has proven tricky. And > then it got bumped down the priority list while the uaccess behaviour in > general was in flux - now that the dust has largely settled on that I should > probably try to find time to pick this up again... I think the v5 from Oli was pretty close, but it didn't get any review: https://lore.kernel.org/r/20200914151800.2270-1-oli.swede@arm.com he also included tests: https://lore.kernel.org/r/20200916104636.19172-1-oli.swede@arm.com It would be great if you or somebody else has time to revive those! Will
On 2021-03-23 13:32, Will Deacon wrote: > On Tue, Mar 23, 2021 at 12:08:56PM +0000, Robin Murphy wrote: >> On 2021-03-23 07:34, Yang Yingliang wrote: >>> When copy over 128 bytes, src/dst is added after >>> each ldp/stp instruction, it will cost more time. >>> To improve this, we only add src/dst after load >>> or store 64 bytes. >> >> This breaks the required behaviour for copy_*_user(), since the fault >> handler expects the base address to be up-to-date at all times. Say you're >> copying 128 bytes and fault on the 4th store, it should return 80 bytes not >> copied; the code below would return 128 bytes not copied, even though 48 >> bytes have actually been written to the destination. >> >> We've had a couple of tries at updating this code (because the whole >> template is frankly a bit terrible, and a long way from the well-optimised >> code it was derived from), but getting the fault-handling behaviour right >> without making the handler itself ludicrously complex has proven tricky. And >> then it got bumped down the priority list while the uaccess behaviour in >> general was in flux - now that the dust has largely settled on that I should >> probably try to find time to pick this up again... > > I think the v5 from Oli was pretty close, but it didn't get any review: > > https://lore.kernel.org/r/20200914151800.2270-1-oli.swede@arm.com > > he also included tests: > > https://lore.kernel.org/r/20200916104636.19172-1-oli.swede@arm.com > > It would be great if you or somebody else has time to revive those! Yeah, we still have a ticket open for it. Since the uaccess overhaul has pretty much killed off any remaining value in the template idea, I might have a quick go at spinning a series to just update memcpy() and the other library routines to their shiny new versions, then come back and work on some dedicated usercopy routines built around LDTR/STTR (and the desired fault behaviour) as a follow-up. (I was also holding out hope for copy_in_user() to disappear if we wait long enough, but apparently not yet...) Robin.
On Tue, Mar 23, 2021 at 01:32:18PM +0000, Will Deacon wrote: > On Tue, Mar 23, 2021 at 12:08:56PM +0000, Robin Murphy wrote: > > On 2021-03-23 07:34, Yang Yingliang wrote: > > > When copy over 128 bytes, src/dst is added after > > > each ldp/stp instruction, it will cost more time. > > > To improve this, we only add src/dst after load > > > or store 64 bytes. > > > > This breaks the required behaviour for copy_*_user(), since the fault > > handler expects the base address to be up-to-date at all times. Say you're > > copying 128 bytes and fault on the 4th store, it should return 80 bytes not > > copied; the code below would return 128 bytes not copied, even though 48 > > bytes have actually been written to the destination. > > > > We've had a couple of tries at updating this code (because the whole > > template is frankly a bit terrible, and a long way from the well-optimised > > code it was derived from), but getting the fault-handling behaviour right > > without making the handler itself ludicrously complex has proven tricky. And > > then it got bumped down the priority list while the uaccess behaviour in > > general was in flux - now that the dust has largely settled on that I should > > probably try to find time to pick this up again... > > I think the v5 from Oli was pretty close, but it didn't get any review: > > https://lore.kernel.org/r/20200914151800.2270-1-oli.swede@arm.com These are still unread in my inbox as I was planning to look at them again. However, I think we discussed a few options on how to proceed but no concrete plans: 1. Merge Oli's patches as they are, with some potential complexity issues as fixing the user copy accuracy was non-trivial. I think the latest version uses a two-stage approach: when taking a fault, it falls back to to byte-by-byte with the expectation that it faults again and we can then report the correct fault address. 2. Only use Cortex Strings for in-kernel memcpy() while the uaccess routines are some simple loops that align the uaccess part only (unlike Cortex Strings which usually to align the source). 3. Similar to 2 but with Cortex Strings imported automatically with some script to make it easier to keep the routines up to date. If having non-optimal (but good enough) uaccess routines is acceptable, I'd go for (2) with a plan to move to (3) at the next Cortex Strings update. I also need to look again at option (1) to see how complex it is but given the time one spends on importing a new Cortex Strings library, I don't think (1) scales well on the long term. We could, however, go for (1) now and look at (3) with the next update.
From: Robin Murphy > Sent: 23 March 2021 12:09 > > On 2021-03-23 07:34, Yang Yingliang wrote: > > When copy over 128 bytes, src/dst is added after > > each ldp/stp instruction, it will cost more time. > > To improve this, we only add src/dst after load > > or store 64 bytes. > > This breaks the required behaviour for copy_*_user(), since the fault > handler expects the base address to be up-to-date at all times. Say > you're copying 128 bytes and fault on the 4th store, it should return 80 > bytes not copied; the code below would return 128 bytes not copied, even > though 48 bytes have actually been written to the destination. Are there any non-superscaler amd64 cpu (that anyone cares about)? If the cpu can execute multiple instructions in one clock then it is usually possible to get the loop control (almost) free. You might need to unroll once to interleave read/write but any more may be pointless. So something like: a = *src++ do { b = *src++; *dst++ = a; a = *src++; *dst++ = b; } while (src != lim); *dst++ = b; David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 2021-03-24 16:38, David Laight wrote: > From: Robin Murphy >> Sent: 23 March 2021 12:09 >> >> On 2021-03-23 07:34, Yang Yingliang wrote: >>> When copy over 128 bytes, src/dst is added after >>> each ldp/stp instruction, it will cost more time. >>> To improve this, we only add src/dst after load >>> or store 64 bytes. >> >> This breaks the required behaviour for copy_*_user(), since the fault >> handler expects the base address to be up-to-date at all times. Say >> you're copying 128 bytes and fault on the 4th store, it should return 80 >> bytes not copied; the code below would return 128 bytes not copied, even >> though 48 bytes have actually been written to the destination. > > Are there any non-superscaler amd64 cpu (that anyone cares about)? > > If the cpu can execute multiple instructions in one clock > then it is usually possible to get the loop control (almost) free. > > You might need to unroll once to interleave read/write > but any more may be pointless. Nah, the whole point is that using post-increment addressing is crap in the first place because it introduces register dependencies between each access that could be avoided entirely if we could use offset addressing (and especially crap when we don't even *have* a post-index addressing mode for the unprivileged load/store instructions used in copy_*_user() and have to simulate it with extra instructions that throw off the code alignment). We already have code that's tuned to work well across our microarchitectures[1], the issue is that butchering it to satisfy the additional requirements of copy_*_user() with a common template has hobbled regular memcpy() performance. I intend to have a crack at fixing that properly tomorrow ;) Robin. [1] https://github.com/ARM-software/optimized-routines > So something like: > a = *src++ > do { > b = *src++; > *dst++ = a; > a = *src++; > *dst++ = b; > } while (src != lim); > *dst++ = b; > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) >
diff --git a/arch/arm64/lib/copy_template.S b/arch/arm64/lib/copy_template.S index 488df234c49a..c3cd6f84c9c0 100644 --- a/arch/arm64/lib/copy_template.S +++ b/arch/arm64/lib/copy_template.S @@ -152,29 +152,33 @@ D_h .req x14 .p2align L1_CACHE_SHIFT .Lcpy_body_large: /* pre-get 64 bytes data. */ - ldp1 A_l, A_h, src, #16 - ldp1 B_l, B_h, src, #16 - ldp1 C_l, C_h, src, #16 - ldp1 D_l, D_h, src, #16 + ldp2 A_l, A_h, src, #0, #8 + ldp2 B_l, B_h, src, #16, #24 + ldp2 C_l, C_h, src, #32, #40 + ldp2 D_l, D_h, src, #48, #56 + add src, src, #64 1: /* * interlace the load of next 64 bytes data block with store of the last * loaded 64 bytes data. */ - stp1 A_l, A_h, dst, #16 - ldp1 A_l, A_h, src, #16 - stp1 B_l, B_h, dst, #16 - ldp1 B_l, B_h, src, #16 - stp1 C_l, C_h, dst, #16 - ldp1 C_l, C_h, src, #16 - stp1 D_l, D_h, dst, #16 - ldp1 D_l, D_h, src, #16 + stp2 A_l, A_h, dst, #0, #8 + ldp2 A_l, A_h, src, #0, #8 + stp2 B_l, B_h, dst, #16, #24 + ldp2 B_l, B_h, src, #16, #24 + stp2 C_l, C_h, dst, #32, #40 + ldp2 C_l, C_h, src, #32, #40 + stp2 D_l, D_h, dst, #48, #56 + ldp2 D_l, D_h, src, #48, #56 + add src, src, #64 + add dst, dst, #64 subs count, count, #64 b.ge 1b - stp1 A_l, A_h, dst, #16 - stp1 B_l, B_h, dst, #16 - stp1 C_l, C_h, dst, #16 - stp1 D_l, D_h, dst, #16 + stp2 A_l, A_h, dst, #0, #8 + stp2 B_l, B_h, dst, #16, #24 + stp2 C_l, C_h, dst, #32, #40 + stp2 D_l, D_h, dst, #48, #56 + add dst, dst, #64 tst count, #0x3f b.ne .Ltail63
When copy over 128 bytes, src/dst is added after each ldp/stp instruction, it will cost more time. To improve this, we only add src/dst after load or store 64 bytes. Copy 4096 bytes cost on Kunpeng920 (ms): Without this patch: memcpy: 143.85 copy_from_user: 172.69 copy_to_user: 199.23 With this patch: memcpy: 107.12 copy_from_user: 157.50 copy_to_user: 198.85 It's about 25% improvement in memcpy(). Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> --- arch/arm64/lib/copy_template.S | 36 +++++++++++++++++++--------------- 1 file changed, 20 insertions(+), 16 deletions(-)