Message ID | 20230509022207.3700-1-zhang_fei_0403@163.com (mailing list archive) |
---|---|
State | Superseded, archived |
Delegated to: | Palmer Dabbelt |
Headers | show |
Series | [1/2] RISC-V: lib: Improve memset assembler formatting | expand |
Context | Check | Description |
---|---|---|
conchuod/cover_letter | success | Single patches do not need cover letters |
conchuod/tree_selection | success | Guessed tree name to be for-next at HEAD ac9a78681b92 |
conchuod/fixes_present | success | Fixes tag not required for -next series |
conchuod/maintainers_pattern | success | MAINTAINERS pattern errors before the patch: 6 and now 6 |
conchuod/verify_signedoff | success | Signed-off-by tag matches author and committer |
conchuod/kdoc | success | Errors and warnings before: 0 this patch: 0 |
conchuod/build_rv64_clang_allmodconfig | success | Errors and warnings before: 14 this patch: 14 |
conchuod/module_param | success | Was 0 now: 0 |
conchuod/build_rv64_gcc_allmodconfig | success | Errors and warnings before: 28 this patch: 28 |
conchuod/build_rv32_defconfig | success | Build OK |
conchuod/dtb_warn_rv64 | success | Errors and warnings before: 3 this patch: 3 |
conchuod/header_inline | success | No static functions without inline keyword in header files |
conchuod/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 183 lines checked |
conchuod/source_inline | success | Was 0 now: 0 |
conchuod/build_rv64_nommu_k210_defconfig | success | Build OK |
conchuod/verify_fixes | success | No Fixes tag |
conchuod/build_rv64_nommu_virt_defconfig | success | Build OK |
On Tue, May 09, 2023 at 10:22:07AM +0800, zhangfei wrote: > From: zhangfei <zhangfei@nj.iscas.ac.cn> > > > > 5: > > > - sb a1, 0(t0) > > > - addi t0, t0, 1 > > > - bltu t0, a3, 5b > > > + sb a1, 0(t0) > > > + sb a1, -1(a3) > > > + li a4, 2 > > > + bgeu a4, a2, 6f > > > + > > > + sb a1, 1(t0) > > > + sb a1, 2(t0) > > > + sb a1, -2(a3) > > > + sb a1, -3(a3) > > > + li a4, 6 > > > + bgeu a4, a2, 6f > > > + > > > + sb a1, 3(t0) > > > + sb a1, -4(a3) > > > + li a4, 8 > > > + bgeu a4, a2, 6f > > > > Why is this check here? > > Hi, > > I filled head and tail with minimal branching. Each conditional ensures that > all the subsequently used offsets are well-defined and in the dest region. I know. You trimmed my comment, so I'll quote myself, here """ After the check of a2 against 6 above we know that offsets 6(t0) and -7(a3) are safe. Are we trying to avoid too may redundant stores with these additional checks? """ So, again. Why the additional check against 8 above and, the one you trimmed, checking 10? > > Although this approach may result in redundant storage, compared to byte by > byte storage, it allows storage instructions to be executed in parallel and > reduces the number of jumps. I understood that when I read the code, but text like this should go in the commit message to avoid people having to think their way through stuff. > > I used the code linked below for performance testing and commented on the memset > that calls the arm architecture in the code to ensure it runs properly on the > risc-v platform. > > [1] https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memset.c#L53 > > The testing platform selected RISC-V SiFive U74.The test data is as follows: > > Before optimization > --------------------- > Random memset (bytes/ns): > memset_call 32K:0.45 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.30 > > Medium memset (bytes/ns): > memset_call 8B:0.18 16B:0.48 32B:0.91 64B:1.63 128B:2.71 256B:4.40 512B:5.67 > Large memset (bytes/ns): > memset_call 1K:6.62 2K:7.02 4K:7.46 8K:7.70 16K:7.82 32K:7.63 64K:1.40 > > After optimization > --------------------- > Random memset bytes/ns): > memset_call 32K:0.46 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.31 > Medium memset (bytes/ns ) > memset_call 8B:0.27 16B:0.48 32B:0.91 64B:1.64 128B:2.71 256B:4.40 512B:5.67 > Large memset (bytes/ns): > memset_call 1K:6.62 2K:7.02 4K:7.47 8K:7.71 16K:7.83 32K:7.63 64K:1.40 > > From the results, it can be seen that memset has significantly improved its performance with > a data volume of around 8B, from 0.18 bytes/ns to 0.27 bytes/ns. And these benchmark results belong in the cover letter, which this series is missing. Thanks, drew
On Tue, May 09, 2023 at 10:22:05AM +0800, zhangfei wrote: > From: Andrew Jones <ajones@ventanamicro.com> > > Aligning the first operand of each instructions with a tab is a > typical style which improves readability. Apply it to memset.S. > While there, we also make a small grammar change to a comment. > > No functional change intended. > > Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Please pick up Conor's r-b on this reposting. Thanks, drew > --- > arch/riscv/lib/memset.S | 143 ++++++++++++++++++++-------------------- > 1 file changed, 72 insertions(+), 71 deletions(-) > > diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S > index 34c5360c6705..e613c5c27998 100644 > --- a/arch/riscv/lib/memset.S > +++ b/arch/riscv/lib/memset.S > @@ -3,111 +3,112 @@ > * Copyright (C) 2013 Regents of the University of California > */ > > - > #include <linux/linkage.h> > #include <asm/asm.h> > > /* void *memset(void *, int, size_t) */ > ENTRY(__memset) > WEAK(memset) > - move t0, a0 /* Preserve return value */ > + move t0, a0 /* Preserve return value */ > > /* Defer to byte-oriented fill for small sizes */ > - sltiu a3, a2, 16 > - bnez a3, 4f > + sltiu a3, a2, 16 > + bnez a3, 4f > > /* > * Round to nearest XLEN-aligned address > - * greater than or equal to start address > + * greater than or equal to the start address. > */ > - addi a3, t0, SZREG-1 > - andi a3, a3, ~(SZREG-1) > - beq a3, t0, 2f /* Skip if already aligned */ > + addi a3, t0, SZREG-1 > + andi a3, a3, ~(SZREG-1) > + beq a3, t0, 2f /* Skip if already aligned */ > + > /* Handle initial misalignment */ > - sub a4, a3, t0 > + sub a4, a3, t0 > 1: > - sb a1, 0(t0) > - addi t0, t0, 1 > - bltu t0, a3, 1b > - sub a2, a2, a4 /* Update count */ > + sb a1, 0(t0) > + addi t0, t0, 1 > + bltu t0, a3, 1b > + sub a2, a2, a4 /* Update count */ > > 2: /* Duff's device with 32 XLEN stores per iteration */ > /* Broadcast value into all bytes */ > - andi a1, a1, 0xff > - slli a3, a1, 8 > - or a1, a3, a1 > - slli a3, a1, 16 > - or a1, a3, a1 > + andi a1, a1, 0xff > + slli a3, a1, 8 > + or a1, a3, a1 > + slli a3, a1, 16 > + or a1, a3, a1 > #ifdef CONFIG_64BIT > - slli a3, a1, 32 > - or a1, a3, a1 > + slli a3, a1, 32 > + or a1, a3, a1 > #endif > > /* Calculate end address */ > - andi a4, a2, ~(SZREG-1) > - add a3, t0, a4 > + andi a4, a2, ~(SZREG-1) > + add a3, t0, a4 > > - andi a4, a4, 31*SZREG /* Calculate remainder */ > - beqz a4, 3f /* Shortcut if no remainder */ > - neg a4, a4 > - addi a4, a4, 32*SZREG /* Calculate initial offset */ > + andi a4, a4, 31*SZREG /* Calculate remainder */ > + beqz a4, 3f /* Shortcut if no remainder */ > + neg a4, a4 > + addi a4, a4, 32*SZREG /* Calculate initial offset */ > > /* Adjust start address with offset */ > - sub t0, t0, a4 > + sub t0, t0, a4 > > /* Jump into loop body */ > /* Assumes 32-bit instruction lengths */ > - la a5, 3f > + la a5, 3f > #ifdef CONFIG_64BIT > - srli a4, a4, 1 > + srli a4, a4, 1 > #endif > - add a5, a5, a4 > - jr a5 > + add a5, a5, a4 > + jr a5 > 3: > - REG_S a1, 0(t0) > - REG_S a1, SZREG(t0) > - REG_S a1, 2*SZREG(t0) > - REG_S a1, 3*SZREG(t0) > - REG_S a1, 4*SZREG(t0) > - REG_S a1, 5*SZREG(t0) > - REG_S a1, 6*SZREG(t0) > - REG_S a1, 7*SZREG(t0) > - REG_S a1, 8*SZREG(t0) > - REG_S a1, 9*SZREG(t0) > - REG_S a1, 10*SZREG(t0) > - REG_S a1, 11*SZREG(t0) > - REG_S a1, 12*SZREG(t0) > - REG_S a1, 13*SZREG(t0) > - REG_S a1, 14*SZREG(t0) > - REG_S a1, 15*SZREG(t0) > - REG_S a1, 16*SZREG(t0) > - REG_S a1, 17*SZREG(t0) > - REG_S a1, 18*SZREG(t0) > - REG_S a1, 19*SZREG(t0) > - REG_S a1, 20*SZREG(t0) > - REG_S a1, 21*SZREG(t0) > - REG_S a1, 22*SZREG(t0) > - REG_S a1, 23*SZREG(t0) > - REG_S a1, 24*SZREG(t0) > - REG_S a1, 25*SZREG(t0) > - REG_S a1, 26*SZREG(t0) > - REG_S a1, 27*SZREG(t0) > - REG_S a1, 28*SZREG(t0) > - REG_S a1, 29*SZREG(t0) > - REG_S a1, 30*SZREG(t0) > - REG_S a1, 31*SZREG(t0) > - addi t0, t0, 32*SZREG > - bltu t0, a3, 3b > - andi a2, a2, SZREG-1 /* Update count */ > + REG_S a1, 0(t0) > + REG_S a1, SZREG(t0) > + REG_S a1, 2*SZREG(t0) > + REG_S a1, 3*SZREG(t0) > + REG_S a1, 4*SZREG(t0) > + REG_S a1, 5*SZREG(t0) > + REG_S a1, 6*SZREG(t0) > + REG_S a1, 7*SZREG(t0) > + REG_S a1, 8*SZREG(t0) > + REG_S a1, 9*SZREG(t0) > + REG_S a1, 10*SZREG(t0) > + REG_S a1, 11*SZREG(t0) > + REG_S a1, 12*SZREG(t0) > + REG_S a1, 13*SZREG(t0) > + REG_S a1, 14*SZREG(t0) > + REG_S a1, 15*SZREG(t0) > + REG_S a1, 16*SZREG(t0) > + REG_S a1, 17*SZREG(t0) > + REG_S a1, 18*SZREG(t0) > + REG_S a1, 19*SZREG(t0) > + REG_S a1, 20*SZREG(t0) > + REG_S a1, 21*SZREG(t0) > + REG_S a1, 22*SZREG(t0) > + REG_S a1, 23*SZREG(t0) > + REG_S a1, 24*SZREG(t0) > + REG_S a1, 25*SZREG(t0) > + REG_S a1, 26*SZREG(t0) > + REG_S a1, 27*SZREG(t0) > + REG_S a1, 28*SZREG(t0) > + REG_S a1, 29*SZREG(t0) > + REG_S a1, 30*SZREG(t0) > + REG_S a1, 31*SZREG(t0) > + > + addi t0, t0, 32*SZREG > + bltu t0, a3, 3b > + andi a2, a2, SZREG-1 /* Update count */ > > 4: > /* Handle trailing misalignment */ > - beqz a2, 6f > - add a3, t0, a2 > + beqz a2, 6f > + add a3, t0, a2 > 5: > - sb a1, 0(t0) > - addi t0, t0, 1 > - bltu t0, a3, 5b > + sb a1, 0(t0) > + addi t0, t0, 1 > + bltu t0, a3, 5b > 6: > ret > END(__memset) > -- > 2.33.0 >
diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S index 34c5360c6705..e613c5c27998 100644 --- a/arch/riscv/lib/memset.S +++ b/arch/riscv/lib/memset.S @@ -3,111 +3,112 @@ * Copyright (C) 2013 Regents of the University of California */ - #include <linux/linkage.h> #include <asm/asm.h> /* void *memset(void *, int, size_t) */ ENTRY(__memset) WEAK(memset) - move t0, a0 /* Preserve return value */ + move t0, a0 /* Preserve return value */ /* Defer to byte-oriented fill for small sizes */ - sltiu a3, a2, 16 - bnez a3, 4f + sltiu a3, a2, 16 + bnez a3, 4f /* * Round to nearest XLEN-aligned address - * greater than or equal to start address + * greater than or equal to the start address. */ - addi a3, t0, SZREG-1 - andi a3, a3, ~(SZREG-1) - beq a3, t0, 2f /* Skip if already aligned */ + addi a3, t0, SZREG-1 + andi a3, a3, ~(SZREG-1) + beq a3, t0, 2f /* Skip if already aligned */ + /* Handle initial misalignment */ - sub a4, a3, t0 + sub a4, a3, t0 1: - sb a1, 0(t0) - addi t0, t0, 1 - bltu t0, a3, 1b - sub a2, a2, a4 /* Update count */ + sb a1, 0(t0) + addi t0, t0, 1 + bltu t0, a3, 1b + sub a2, a2, a4 /* Update count */ 2: /* Duff's device with 32 XLEN stores per iteration */ /* Broadcast value into all bytes */ - andi a1, a1, 0xff - slli a3, a1, 8 - or a1, a3, a1 - slli a3, a1, 16 - or a1, a3, a1 + andi a1, a1, 0xff + slli a3, a1, 8 + or a1, a3, a1 + slli a3, a1, 16 + or a1, a3, a1 #ifdef CONFIG_64BIT - slli a3, a1, 32 - or a1, a3, a1 + slli a3, a1, 32 + or a1, a3, a1 #endif /* Calculate end address */ - andi a4, a2, ~(SZREG-1) - add a3, t0, a4 + andi a4, a2, ~(SZREG-1) + add a3, t0, a4 - andi a4, a4, 31*SZREG /* Calculate remainder */ - beqz a4, 3f /* Shortcut if no remainder */ - neg a4, a4 - addi a4, a4, 32*SZREG /* Calculate initial offset */ + andi a4, a4, 31*SZREG /* Calculate remainder */ + beqz a4, 3f /* Shortcut if no remainder */ + neg a4, a4 + addi a4, a4, 32*SZREG /* Calculate initial offset */ /* Adjust start address with offset */ - sub t0, t0, a4 + sub t0, t0, a4 /* Jump into loop body */ /* Assumes 32-bit instruction lengths */ - la a5, 3f + la a5, 3f #ifdef CONFIG_64BIT - srli a4, a4, 1 + srli a4, a4, 1 #endif - add a5, a5, a4 - jr a5 + add a5, a5, a4 + jr a5 3: - REG_S a1, 0(t0) - REG_S a1, SZREG(t0) - REG_S a1, 2*SZREG(t0) - REG_S a1, 3*SZREG(t0) - REG_S a1, 4*SZREG(t0) - REG_S a1, 5*SZREG(t0) - REG_S a1, 6*SZREG(t0) - REG_S a1, 7*SZREG(t0) - REG_S a1, 8*SZREG(t0) - REG_S a1, 9*SZREG(t0) - REG_S a1, 10*SZREG(t0) - REG_S a1, 11*SZREG(t0) - REG_S a1, 12*SZREG(t0) - REG_S a1, 13*SZREG(t0) - REG_S a1, 14*SZREG(t0) - REG_S a1, 15*SZREG(t0) - REG_S a1, 16*SZREG(t0) - REG_S a1, 17*SZREG(t0) - REG_S a1, 18*SZREG(t0) - REG_S a1, 19*SZREG(t0) - REG_S a1, 20*SZREG(t0) - REG_S a1, 21*SZREG(t0) - REG_S a1, 22*SZREG(t0) - REG_S a1, 23*SZREG(t0) - REG_S a1, 24*SZREG(t0) - REG_S a1, 25*SZREG(t0) - REG_S a1, 26*SZREG(t0) - REG_S a1, 27*SZREG(t0) - REG_S a1, 28*SZREG(t0) - REG_S a1, 29*SZREG(t0) - REG_S a1, 30*SZREG(t0) - REG_S a1, 31*SZREG(t0) - addi t0, t0, 32*SZREG - bltu t0, a3, 3b - andi a2, a2, SZREG-1 /* Update count */ + REG_S a1, 0(t0) + REG_S a1, SZREG(t0) + REG_S a1, 2*SZREG(t0) + REG_S a1, 3*SZREG(t0) + REG_S a1, 4*SZREG(t0) + REG_S a1, 5*SZREG(t0) + REG_S a1, 6*SZREG(t0) + REG_S a1, 7*SZREG(t0) + REG_S a1, 8*SZREG(t0) + REG_S a1, 9*SZREG(t0) + REG_S a1, 10*SZREG(t0) + REG_S a1, 11*SZREG(t0) + REG_S a1, 12*SZREG(t0) + REG_S a1, 13*SZREG(t0) + REG_S a1, 14*SZREG(t0) + REG_S a1, 15*SZREG(t0) + REG_S a1, 16*SZREG(t0) + REG_S a1, 17*SZREG(t0) + REG_S a1, 18*SZREG(t0) + REG_S a1, 19*SZREG(t0) + REG_S a1, 20*SZREG(t0) + REG_S a1, 21*SZREG(t0) + REG_S a1, 22*SZREG(t0) + REG_S a1, 23*SZREG(t0) + REG_S a1, 24*SZREG(t0) + REG_S a1, 25*SZREG(t0) + REG_S a1, 26*SZREG(t0) + REG_S a1, 27*SZREG(t0) + REG_S a1, 28*SZREG(t0) + REG_S a1, 29*SZREG(t0) + REG_S a1, 30*SZREG(t0) + REG_S a1, 31*SZREG(t0) + + addi t0, t0, 32*SZREG + bltu t0, a3, 3b + andi a2, a2, SZREG-1 /* Update count */ 4: /* Handle trailing misalignment */ - beqz a2, 6f - add a3, t0, a2 + beqz a2, 6f + add a3, t0, a2 5: - sb a1, 0(t0) - addi t0, t0, 1 - bltu t0, a3, 5b + sb a1, 0(t0) + addi t0, t0, 1 + bltu t0, a3, 5b 6: ret END(__memset)