[1/2] RISC-V: Probe for unaligned access speed

Message ID	20230623222016.3742145-2-evan@rivosinc.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org> From: Evan Green <evan@rivosinc.com> To: Palmer Dabbelt <palmer@rivosinc.com> Subject: [PATCH 1/2] RISC-V: Probe for unaligned access speed Date: Fri, 23 Jun 2023 15:20:15 -0700 Message-Id: <20230623222016.3742145-2-evan@rivosinc.com> In-Reply-To: <20230623222016.3742145-1-evan@rivosinc.com> References: <20230623222016.3742145-1-evan@rivosinc.com> MIME-Version: 1.0 Precedence: list Cc: linux-doc@vger.kernel.org, Yangyu Chen <cyy@cyyself.name>, Conor Dooley <conor.dooley@microchip.com>, Guo Ren <guoren@kernel.org>, Jisheng Zhang <jszhang@kernel.org>, linux-riscv@lists.infradead.org, Jonathan Corbet <corbet@lwn.net>, Xianting Tian <xianting.tian@linux.alibaba.com>, Masahiro Yamada <masahiroy@kernel.org>, Greentime Hu <greentime.hu@sifive.com>, Simon Hosie <shosie@rivosinc.com>, Li Zhengyu <lizhengyu3@huawei.com>, Andrew Jones <ajones@ventanamicro.com>, Albert Ou <aou@eecs.berkeley.edu>, Alexandre Ghiti <alexghiti@rivosinc.com>, Ley Foon Tan <leyfoon.tan@starfivetech.com>, Paul Walmsley <paul.walmsley@sifive.com>, Heiko Stuebner <heiko.stuebner@vrull.eu>, Anup Patel <apatel@ventanamicro.com>, linux-kernel@vger.kernel.org, Sia Jee Heng <jeeheng.sia@starfivetech.com>, Evan Green <evan@rivosinc.com>, Palmer Dabbelt <palmer@dabbelt.com>, Andy Chiu <andy.chiu@sifive.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-riscv" <linux-riscv-bounces@lists.infradead.org> Errors-To: linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org
Series	[1/2] RISC-V: Probe for unaligned access speed \| expand [1/2] RISC-V: Probe for unaligned access speed [2/2] RISC-V: alternative: Remove feature_probe_func

Context	Check	Description
conchuod/cover_letter	success	Single patches do not need cover letters
conchuod/tree_selection	success	Guessed tree name to be for-next at HEAD 4681dacadeef
conchuod/fixes_present	success	Fixes tag not required for -next series
conchuod/maintainers_pattern	success	MAINTAINERS pattern errors before the patch: 6 and now 6
conchuod/verify_signedoff	success	Signed-off-by tag matches author and committer
conchuod/kdoc	success	Errors and warnings before: 0 this patch: 0
conchuod/build_rv64_clang_allmodconfig	success	Errors and warnings before: 8 this patch: 8
conchuod/module_param	success	Was 0 now: 0
conchuod/build_rv64_gcc_allmodconfig	success	Errors and warnings before: 9 this patch: 9
conchuod/build_rv32_defconfig	success	Build OK
conchuod/dtb_warn_rv64	success	Errors and warnings before: 20 this patch: 20
conchuod/header_inline	success	No static functions without inline keyword in header files
conchuod/checkpatch	warning	CHECK: Blank lines aren't necessary after an open brace '{' CHECK: Consider using #include <linux/cpufeature.h> instead of <asm/cpufeature.h> WARNING: 'THead' may be misspelled - perhaps 'Thread'? WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
conchuod/build_rv64_nommu_k210_defconfig	success	Build OK
conchuod/verify_fixes	success	No Fixes tag
conchuod/build_rv64_nommu_virt_defconfig	success	Build OK

Evan Green June 23, 2023, 10:20 p.m. UTC

Rather than deferring misaligned access speed determinations to a vendor
function, let's probe them and find out how fast they are. If we
determine that a misaligned word access is faster than N byte accesses,
mark the hardware's misaligned access as "fast".

Fix the documentation as well to reflect this bar. Previously the only
SoC that returned "fast" was the THead C906. The change to the
documentation is more a clarification, since the C906 is fast in the
sense of the corrected documentation.

Signed-off-by: Evan Green <evan@rivosinc.com>
---

 Documentation/riscv/hwprobe.rst     |  8 +--
 arch/riscv/include/asm/cpufeature.h |  2 +
 arch/riscv/kernel/Makefile          |  1 +
 arch/riscv/kernel/copy-noalign.S    | 71 ++++++++++++++++++++++++++
 arch/riscv/kernel/copy-noalign.h    | 13 +++++
 arch/riscv/kernel/cpufeature.c      | 78 +++++++++++++++++++++++++++++
 arch/riscv/kernel/smpboot.c         |  2 +
 7 files changed, 171 insertions(+), 4 deletions(-)
 create mode 100644 arch/riscv/kernel/copy-noalign.S
 create mode 100644 arch/riscv/kernel/copy-noalign.h

David Laight June 25, 2023, 9:42 p.m. UTC | #1

From: Evan Green
> Sent: 23 June 2023 23:20
> 
> Rather than deferring misaligned access speed determinations to a vendor
> function, let's probe them and find out how fast they are. If we
> determine that a misaligned word access is faster than N byte accesses,
> mark the hardware's misaligned access as "fast".
> 
> Fix the documentation as well to reflect this bar. Previously the only
> SoC that returned "fast" was the THead C906. The change to the
> documentation is more a clarification, since the C906 is fast in the
> sense of the corrected documentation.
> 
> Signed-off-by: Evan Green <evan@rivosinc.com>
> ---
...
> diff --git a/Documentation/riscv/hwprobe.rst b/Documentation/riscv/hwprobe.rst
> index 19165ebd82ba..710325751766 100644
> --- a/Documentation/riscv/hwprobe.rst
> +++ b/Documentation/riscv/hwprobe.rst
> @@ -88,12 +88,12 @@ The following keys are defined:
>      always extremely slow.
> 
>    * :c:macro:`RISCV_HWPROBE_MISALIGNED_SLOW`: Misaligned accesses are supported
> -    in hardware, but are slower than the cooresponding aligned accesses
> -    sequences.
> +    in hardware, but are slower than N byte accesses, where N is the native
> +    word size.
> 
>    * :c:macro:`RISCV_HWPROBE_MISALIGNED_FAST`: Misaligned accesses are supported
> -    in hardware and are faster than the cooresponding aligned accesses
> -    sequences.
> +    in hardware and are faster than N byte accesses, where N is the native
> +    word size.

I think I'd just say 'faster/slower than using byte accesses' (ie no N).

There are two obvious FAST cases:
1) the misaligned access takes an extra clock - worth aligning copies.
2) the misaligned access is pretty much as fast as an aligned one.

Even if you find it hard to distinguish them you should probably
allow for them both.

x86 (on Intel (non-atom) cpu) is definitely in the latter camp.
Misaligned copies are measurable slower - but not enough to
ever worry about.
I think that misaligned transfers get spilt into 8 byte accesses
(pretty irrelevant in the kernel) and then accesses that cross
cache line boundaries are split on the boundary.
With pipelined writes and two reads/clock it doesn't often
make a measurable difference.
That is definitely what I see for uncached accesses to PCIe space.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Conor Dooley June 26, 2023, 2:14 p.m. UTC | #2

Hey Evan,

Some minor nitpickery comments & one actual one.

On Fri, Jun 23, 2023 at 03:20:15PM -0700, Evan Green wrote:
> Rather than deferring misaligned access speed determinations to a vendor

Could you pick a consistent word to use? You've got "unaligned",
"misaligned" and "noalign" sprinkled through out the series.

> function, let's probe them and find out how fast they are. If we
> determine that a misaligned word access is faster than N byte accesses,
> mark the hardware's misaligned access as "fast".
> 
> Fix the documentation as well to reflect this bar. Previously the only
> SoC that returned "fast" was the THead C906. The change to the
> documentation is more a clarification, since the C906 is fast in the
> sense of the corrected documentation.
> 
> Signed-off-by: Evan Green <evan@rivosinc.com>
> ---
> 
>  Documentation/riscv/hwprobe.rst     |  8 +--
>  arch/riscv/include/asm/cpufeature.h |  2 +
>  arch/riscv/kernel/Makefile          |  1 +
>  arch/riscv/kernel/copy-noalign.S    | 71 ++++++++++++++++++++++++++
>  arch/riscv/kernel/copy-noalign.h    | 13 +++++
>  arch/riscv/kernel/cpufeature.c      | 78 +++++++++++++++++++++++++++++
>  arch/riscv/kernel/smpboot.c         |  2 +
>  7 files changed, 171 insertions(+), 4 deletions(-)
>  create mode 100644 arch/riscv/kernel/copy-noalign.S
>  create mode 100644 arch/riscv/kernel/copy-noalign.h
> 
> diff --git a/Documentation/riscv/hwprobe.rst b/Documentation/riscv/hwprobe.rst
> index 19165ebd82ba..710325751766 100644
> --- a/Documentation/riscv/hwprobe.rst
> +++ b/Documentation/riscv/hwprobe.rst
> @@ -88,12 +88,12 @@ The following keys are defined:
>      always extremely slow.
>  
>    * :c:macro:`RISCV_HWPROBE_MISALIGNED_SLOW`: Misaligned accesses are supported
> -    in hardware, but are slower than the cooresponding aligned accesses
> -    sequences.

Nice, fixed the typo by removing the offender ;)

> +    in hardware, but are slower than N byte accesses, where N is the native
> +    word size.
>  
>    * :c:macro:`RISCV_HWPROBE_MISALIGNED_FAST`: Misaligned accesses are supported
> -    in hardware and are faster than the cooresponding aligned accesses
> -    sequences.
> +    in hardware and are faster than N byte accesses, where N is the native
> +    word size.
>  
>    * :c:macro:`RISCV_HWPROBE_MISALIGNED_UNSUPPORTED`: Misaligned accesses are
>      not supported at all and will generate a misaligned address fault.
> diff --git a/arch/riscv/include/asm/cpufeature.h b/arch/riscv/include/asm/cpufeature.h
> index 23fed53b8815..b8e917176616 100644
> --- a/arch/riscv/include/asm/cpufeature.h
> +++ b/arch/riscv/include/asm/cpufeature.h
> @@ -30,4 +30,6 @@ DECLARE_PER_CPU(long, misaligned_access_speed);
>  /* Per-cpu ISA extensions. */
>  extern struct riscv_isainfo hart_isa[NR_CPUS];
>  
> +void check_misaligned_access(int cpu);
> +
>  #endif
> diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> index a42951911067..f934d7ab7840 100644
> --- a/arch/riscv/kernel/Makefile
> +++ b/arch/riscv/kernel/Makefile
> @@ -34,6 +34,7 @@ extra-y += vmlinux.lds
>  obj-y	+= head.o
>  obj-y	+= soc.o
>  obj-$(CONFIG_RISCV_ALTERNATIVE) += alternative.o
> +obj-y	+= copy-noalign.o
>  obj-y	+= cpu.o
>  obj-y	+= cpufeature.o
>  obj-y	+= entry.o
> diff --git a/arch/riscv/kernel/copy-noalign.S b/arch/riscv/kernel/copy-noalign.S
> new file mode 100644
> index 000000000000..3807fc2324b2
> --- /dev/null
> +++ b/arch/riscv/kernel/copy-noalign.S
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (C) 2023 Rivos Inc. */
> +
> +#include <linux/linkage.h>
> +#include <asm/asm.h>
> +
> +	.text
> +
> +/* void __copy_words_unaligned(void *, const void *, size_t) */
> +/* Performs a memcpy without aligning buffers, using word loads and stores. */
> +/* Note: The size is truncated to a multiple of 8 * SZREG */
> +ENTRY(__copy_words_unaligned)
> +	andi a4, a2, ~((8*SZREG)-1)
> +	beqz a4, 2f
> +	add a3, a1, a4
> +1:
> +	REG_L a4,       0(a1)
> +	REG_L a5,   SZREG(a1)
> +	REG_L a6, 2*SZREG(a1)
> +	REG_L a7, 3*SZREG(a1)
> +	REG_L t0, 4*SZREG(a1)
> +	REG_L t1, 5*SZREG(a1)
> +	REG_L t2, 6*SZREG(a1)
> +	REG_L t3, 7*SZREG(a1)
> +	REG_S a4,       0(a0)
> +	REG_S a5,   SZREG(a0)
> +	REG_S a6, 2*SZREG(a0)
> +	REG_S a7, 3*SZREG(a0)
> +	REG_S t0, 4*SZREG(a0)
> +	REG_S t1, 5*SZREG(a0)
> +	REG_S t2, 6*SZREG(a0)
> +	REG_S t3, 7*SZREG(a0)
> +	addi a0, a0, 8*SZREG
> +	addi a1, a1, 8*SZREG
> +	bltu a1, a3, 1b
> +
> +2:
> +	ret
> +END(__copy_words_unaligned)
> +
> +/* void __copy_bytes_unaligned(void *, const void *, size_t) */
> +/* Performs a memcpy without aligning buffers, using only byte accesses. */
> +/* Note: The size is truncated to a multiple of 8 */
> +ENTRY(__copy_bytes_unaligned)
> +	andi a4, a2, ~(8-1)
> +	beqz a4, 2f
> +	add a3, a1, a4

Could you align operands for ASM please, to make this a little easier to
read?

> +1:
> +	lb a4, 0(a1)
> +	lb a5, 1(a1)
> +	lb a6, 2(a1)
> +	lb a7, 3(a1)
> +	lb t0, 4(a1)
> +	lb t1, 5(a1)
> +	lb t2, 6(a1)
> +	lb t3, 7(a1)
> +	sb a4, 0(a0)
> +	sb a5, 1(a0)
> +	sb a6, 2(a0)
> +	sb a7, 3(a0)
> +	sb t0, 4(a0)
> +	sb t1, 5(a0)
> +	sb t2, 6(a0)
> +	sb t3, 7(a0)
> +	addi a0, a0, 8
> +	addi a1, a1, 8
> +	bltu a1, a3, 1b
> +
> +2:
> +	ret
> +END(__copy_bytes_unaligned)
> diff --git a/arch/riscv/kernel/copy-noalign.h b/arch/riscv/kernel/copy-noalign.h
> new file mode 100644
> index 000000000000..99fbb9c763e0
> --- /dev/null
> +++ b/arch/riscv/kernel/copy-noalign.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2023 Rivos, Inc.
> + */
> +#ifndef __RISCV_KERNEL_COPY_NOALIGN_H
> +#define __RISCV_KERNEL_COPY_NOALIGN_H
> +
> +#include <linux/types.h>
> +
> +void __copy_words_unaligned(void *dst, const void *src, size_t size);
> +void __copy_bytes_unaligned(void *dst, const void *src, size_t size);
> +
> +#endif /* __RISCV_KERNEL_COPY_NOALIGN_H */
> diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
> index bdcf460ea53d..3f7200dcc00c 100644
> --- a/arch/riscv/kernel/cpufeature.c
> +++ b/arch/riscv/kernel/cpufeature.c
> @@ -19,11 +19,21 @@
>  #include <asm/cacheflush.h>
>  #include <asm/cpufeature.h>
>  #include <asm/hwcap.h>
> +#include <asm/hwprobe.h>
>  #include <asm/patch.h>
>  #include <asm/processor.h>
>  #include <asm/vector.h>
>  
> +#include "copy-noalign.h"
> +
>  #define NUM_ALPHA_EXTS ('z' - 'a' + 1)
> +#define MISALIGNED_ACCESS_JIFFIES_LG2 1
> +#define MISALIGNED_BUFFER_SIZE 0x4000
> +#define MISALIGNED_COPY_SIZE ((MISALIGNED_BUFFER_SIZE / 2) - 0x80)
> +

I think this blank line is misplaced, it should go after NUM_ALPHA_EXTS
instead of (or as well as) here.

> +#define MISALIGNED_COPY_MBS(_count) \
> +	((HZ * (_count) * MISALIGNED_COPY_SIZE) >> \
> +	 (20 + MISALIGNED_ACCESS_JIFFIES_LG2))
>  
>  unsigned long elf_hwcap __read_mostly;
>  
> @@ -396,6 +406,74 @@ unsigned long riscv_get_elf_hwcap(void)
>  	return hwcap;
>  }
>  
> +void check_misaligned_access(int cpu)
> +{
> +	unsigned long j0, j1;
> +	struct page *page;
> +	void *dst;
> +	void *src;
> +	long word_copies = 0;
> +	long byte_copies = 0;
> +	long speed = RISCV_HWPROBE_MISALIGNED_SLOW;

Is this not a change from current behaviour, that may actually lead to
incorrect reporting. Presently, only T-Head stuff sets a speed, so
hwprobe falls back to UNKNOWN for everything else. With this, we will
get slow set, for anything failing the test.
Slow is defined as "Misaligned accesses are supported in hardware, but
are slower than the cooresponding aligned accesses sequences (sic)", but
you have no way of knowing, based on the test you are performing, whether
the hardware supports it or if it is emulated by firmware.
Perhaps that is not relevant to userspace, but wanted to know your
thoughts.

> +
> +	page = alloc_pages(GFP_NOWAIT, get_order(MISALIGNED_BUFFER_SIZE));
> +	if (!page) {
> +		pr_warn("Can't alloc pages to measure memcpy performance");
> +		return;
> +	}
> +
> +	/* Make a misaligned destination buffer. */
> +	dst = (void *)((unsigned long)page_address(page) | 0x1);
> +	/* Misalign src as well, but differently (off by 1 + 2 = 3). */
> +	src = dst + (MISALIGNED_BUFFER_SIZE / 2);
> +	src += 2;
> +	/* Do a warmup. */
> +	__copy_words_unaligned(dst, src, MISALIGNED_COPY_SIZE);
> +	preempt_disable();
> +	j0 = jiffies;
> +	while ((j1 = jiffies) == j0)
> +		cpu_relax();
> +
> +	while (time_before(jiffies,
> +			   j1 + (1 << MISALIGNED_ACCESS_JIFFIES_LG2))) {

Does this not fit in 100 chars?

> +
> +		__copy_words_unaligned(dst, src, MISALIGNED_COPY_SIZE);
> +		word_copies += 1;
> +	}
> +
> +	__copy_bytes_unaligned(dst, src, MISALIGNED_COPY_SIZE);
> +	j0 = jiffies;
> +	while ((j1 = jiffies) == j0)
> +		cpu_relax();
> +
> +	while (time_before(jiffies,
> +			   j1 + (1 << MISALIGNED_ACCESS_JIFFIES_LG2))) {

Ditto here, no?

Cheers,
Conor.

Jessica Clarke June 26, 2023, 9:42 p.m. UTC | #3

On 23 Jun 2023, at 23:20, Evan Green <evan@rivosinc.com> wrote:
> 
> Rather than deferring misaligned access speed determinations to a vendor
> function, let's probe them and find out how fast they are. If we
> determine that a misaligned word access is faster than N byte accesses,
> mark the hardware's misaligned access as "fast".

How sure are you that your measurements can be extrapolated and aren’t
an artefact of the testing process? For example, off the top of my head:

* The first run will potentially be penalised by data cache misses,
  untrained prefetchers, TLB misses, branch predictors, etc. compared
  with later runs. You have one warmup, but who knows how many
  iterations it will take to converge?

* The code being benchmarked isn’t the code being run, so differences
  in access patterns, loop unrolling, loop alignment, etc. may cause the
  real code to behave differently (and perhaps change which is better).

The non-determinism that could in theory result from this also seems
like a not great idea to have.

Jess

> Fix the documentation as well to reflect this bar. Previously the only
> SoC that returned "fast" was the THead C906. The change to the
> documentation is more a clarification, since the C906 is fast in the
> sense of the corrected documentation.
> 
> Signed-off-by: Evan Green <evan@rivosinc.com>
> ---
> 
> Documentation/riscv/hwprobe.rst     |  8 +--
> arch/riscv/include/asm/cpufeature.h |  2 +
> arch/riscv/kernel/Makefile          |  1 +
> arch/riscv/kernel/copy-noalign.S    | 71 ++++++++++++++++++++++++++
> arch/riscv/kernel/copy-noalign.h    | 13 +++++
> arch/riscv/kernel/cpufeature.c      | 78 +++++++++++++++++++++++++++++
> arch/riscv/kernel/smpboot.c         |  2 +
> 7 files changed, 171 insertions(+), 4 deletions(-)
> create mode 100644 arch/riscv/kernel/copy-noalign.S
> create mode 100644 arch/riscv/kernel/copy-noalign.h
> 
> diff --git a/Documentation/riscv/hwprobe.rst b/Documentation/riscv/hwprobe.rst
> index 19165ebd82ba..710325751766 100644
> --- a/Documentation/riscv/hwprobe.rst
> +++ b/Documentation/riscv/hwprobe.rst
> @@ -88,12 +88,12 @@ The following keys are defined:
>     always extremely slow.
> 
>   * :c:macro:`RISCV_HWPROBE_MISALIGNED_SLOW`: Misaligned accesses are supported
> -    in hardware, but are slower than the cooresponding aligned accesses
> -    sequences.
> +    in hardware, but are slower than N byte accesses, where N is the native
> +    word size.
> 
>   * :c:macro:`RISCV_HWPROBE_MISALIGNED_FAST`: Misaligned accesses are supported
> -    in hardware and are faster than the cooresponding aligned accesses
> -    sequences.
> +    in hardware and are faster than N byte accesses, where N is the native
> +    word size.
> 
>   * :c:macro:`RISCV_HWPROBE_MISALIGNED_UNSUPPORTED`: Misaligned accesses are
>     not supported at all and will generate a misaligned address fault.
> diff --git a/arch/riscv/include/asm/cpufeature.h b/arch/riscv/include/asm/cpufeature.h
> index 23fed53b8815..b8e917176616 100644
> --- a/arch/riscv/include/asm/cpufeature.h
> +++ b/arch/riscv/include/asm/cpufeature.h
> @@ -30,4 +30,6 @@ DECLARE_PER_CPU(long, misaligned_access_speed);
> /* Per-cpu ISA extensions. */
> extern struct riscv_isainfo hart_isa[NR_CPUS];
> 
> +void check_misaligned_access(int cpu);
> +
> #endif
> diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> index a42951911067..f934d7ab7840 100644
> --- a/arch/riscv/kernel/Makefile
> +++ b/arch/riscv/kernel/Makefile
> @@ -34,6 +34,7 @@ extra-y += vmlinux.lds
> obj-y += head.o
> obj-y += soc.o
> obj-$(CONFIG_RISCV_ALTERNATIVE) += alternative.o
> +obj-y += copy-noalign.o
> obj-y += cpu.o
> obj-y += cpufeature.o
> obj-y += entry.o
> diff --git a/arch/riscv/kernel/copy-noalign.S b/arch/riscv/kernel/copy-noalign.S
> new file mode 100644
> index 000000000000..3807fc2324b2
> --- /dev/null
> +++ b/arch/riscv/kernel/copy-noalign.S
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (C) 2023 Rivos Inc. */
> +
> +#include <linux/linkage.h>
> +#include <asm/asm.h>
> +
> + .text
> +
> +/* void __copy_words_unaligned(void *, const void *, size_t) */
> +/* Performs a memcpy without aligning buffers, using word loads and stores. */
> +/* Note: The size is truncated to a multiple of 8 * SZREG */
> +ENTRY(__copy_words_unaligned)
> + andi a4, a2, ~((8*SZREG)-1)
> + beqz a4, 2f
> + add a3, a1, a4
> +1:
> + REG_L a4,       0(a1)
> + REG_L a5,   SZREG(a1)
> + REG_L a6, 2*SZREG(a1)
> + REG_L a7, 3*SZREG(a1)
> + REG_L t0, 4*SZREG(a1)
> + REG_L t1, 5*SZREG(a1)
> + REG_L t2, 6*SZREG(a1)
> + REG_L t3, 7*SZREG(a1)
> + REG_S a4,       0(a0)
> + REG_S a5,   SZREG(a0)
> + REG_S a6, 2*SZREG(a0)
> + REG_S a7, 3*SZREG(a0)
> + REG_S t0, 4*SZREG(a0)
> + REG_S t1, 5*SZREG(a0)
> + REG_S t2, 6*SZREG(a0)
> + REG_S t3, 7*SZREG(a0)
> + addi a0, a0, 8*SZREG
> + addi a1, a1, 8*SZREG
> + bltu a1, a3, 1b
> +
> +2:
> + ret
> +END(__copy_words_unaligned)
> +
> +/* void __copy_bytes_unaligned(void *, const void *, size_t) */
> +/* Performs a memcpy without aligning buffers, using only byte accesses. */
> +/* Note: The size is truncated to a multiple of 8 */
> +ENTRY(__copy_bytes_unaligned)
> + andi a4, a2, ~(8-1)
> + beqz a4, 2f
> + add a3, a1, a4
> +1:
> + lb a4, 0(a1)
> + lb a5, 1(a1)
> + lb a6, 2(a1)
> + lb a7, 3(a1)
> + lb t0, 4(a1)
> + lb t1, 5(a1)
> + lb t2, 6(a1)
> + lb t3, 7(a1)
> + sb a4, 0(a0)
> + sb a5, 1(a0)
> + sb a6, 2(a0)
> + sb a7, 3(a0)
> + sb t0, 4(a0)
> + sb t1, 5(a0)
> + sb t2, 6(a0)
> + sb t3, 7(a0)
> + addi a0, a0, 8
> + addi a1, a1, 8
> + bltu a1, a3, 1b
> +
> +2:
> + ret
> +END(__copy_bytes_unaligned)
> diff --git a/arch/riscv/kernel/copy-noalign.h b/arch/riscv/kernel/copy-noalign.h
> new file mode 100644
> index 000000000000..99fbb9c763e0
> --- /dev/null
> +++ b/arch/riscv/kernel/copy-noalign.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2023 Rivos, Inc.
> + */
> +#ifndef __RISCV_KERNEL_COPY_NOALIGN_H
> +#define __RISCV_KERNEL_COPY_NOALIGN_H
> +
> +#include <linux/types.h>
> +
> +void __copy_words_unaligned(void *dst, const void *src, size_t size);
> +void __copy_bytes_unaligned(void *dst, const void *src, size_t size);
> +
> +#endif /* __RISCV_KERNEL_COPY_NOALIGN_H */
> diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
> index bdcf460ea53d..3f7200dcc00c 100644
> --- a/arch/riscv/kernel/cpufeature.c
> +++ b/arch/riscv/kernel/cpufeature.c
> @@ -19,11 +19,21 @@
> #include <asm/cacheflush.h>
> #include <asm/cpufeature.h>
> #include <asm/hwcap.h>
> +#include <asm/hwprobe.h>
> #include <asm/patch.h>
> #include <asm/processor.h>
> #include <asm/vector.h>
> 
> +#include "copy-noalign.h"
> +
> #define NUM_ALPHA_EXTS ('z' - 'a' + 1)
> +#define MISALIGNED_ACCESS_JIFFIES_LG2 1
> +#define MISALIGNED_BUFFER_SIZE 0x4000
> +#define MISALIGNED_COPY_SIZE ((MISALIGNED_BUFFER_SIZE / 2) - 0x80)
> +
> +#define MISALIGNED_COPY_MBS(_count) \
> + ((HZ * (_count) * MISALIGNED_COPY_SIZE) >> \
> + (20 + MISALIGNED_ACCESS_JIFFIES_LG2))
> 
> unsigned long elf_hwcap __read_mostly;
> 
> @@ -396,6 +406,74 @@ unsigned long riscv_get_elf_hwcap(void)
> return hwcap;
> }
> 
> +void check_misaligned_access(int cpu)
> +{
> + unsigned long j0, j1;
> + struct page *page;
> + void *dst;
> + void *src;
> + long word_copies = 0;
> + long byte_copies = 0;
> + long speed = RISCV_HWPROBE_MISALIGNED_SLOW;
> +
> + page = alloc_pages(GFP_NOWAIT, get_order(MISALIGNED_BUFFER_SIZE));
> + if (!page) {
> + pr_warn("Can't alloc pages to measure memcpy performance");
> + return;
> + }
> +
> + /* Make a misaligned destination buffer. */
> + dst = (void *)((unsigned long)page_address(page) | 0x1);
> + /* Misalign src as well, but differently (off by 1 + 2 = 3). */
> + src = dst + (MISALIGNED_BUFFER_SIZE / 2);
> + src += 2;
> + /* Do a warmup. */
> + __copy_words_unaligned(dst, src, MISALIGNED_COPY_SIZE);
> + preempt_disable();
> + j0 = jiffies;
> + while ((j1 = jiffies) == j0)
> + cpu_relax();
> +
> + while (time_before(jiffies,
> +   j1 + (1 << MISALIGNED_ACCESS_JIFFIES_LG2))) {
> +
> + __copy_words_unaligned(dst, src, MISALIGNED_COPY_SIZE);
> + word_copies += 1;
> + }
> +
> + __copy_bytes_unaligned(dst, src, MISALIGNED_COPY_SIZE);
> + j0 = jiffies;
> + while ((j1 = jiffies) == j0)
> + cpu_relax();
> +
> + while (time_before(jiffies,
> +   j1 + (1 << MISALIGNED_ACCESS_JIFFIES_LG2))) {
> + __copy_bytes_unaligned(dst, src, MISALIGNED_COPY_SIZE);
> + byte_copies += 1;
> + }
> +
> + preempt_enable();
> + if (word_copies >= byte_copies)
> + speed = RISCV_HWPROBE_MISALIGNED_FAST;
> +
> + pr_info("cpu%d: Unaligned word copy %ld MB/s, byte copy %ld MB/s, misaligned accesses are %s\n",
> + cpu,
> + MISALIGNED_COPY_MBS(word_copies),
> + MISALIGNED_COPY_MBS(byte_copies),
> + (speed == RISCV_HWPROBE_MISALIGNED_FAST) ? "fast" : "slow");
> +
> + per_cpu(misaligned_access_speed, cpu) = speed;
> + __free_pages(page, get_order(MISALIGNED_BUFFER_SIZE));
> +}
> +
> +static int check_misaligned_access0(void)
> +{
> + check_misaligned_access(0);
> + return 0;
> +}
> +
> +arch_initcall(check_misaligned_access0);
> +
> #ifdef CONFIG_RISCV_ALTERNATIVE
> /*
>  * Alternative patch sites consider 48 bits when determining when to patch
> diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c
> index bb0b76e1a6d4..e34a71b4786b 100644
> --- a/arch/riscv/kernel/smpboot.c
> +++ b/arch/riscv/kernel/smpboot.c
> @@ -26,6 +26,7 @@
> #include <linux/sched/task_stack.h>
> #include <linux/sched/mm.h>
> #include <asm/cpu_ops.h>
> +#include <asm/cpufeature.h>
> #include <asm/irq.h>
> #include <asm/mmu_context.h>
> #include <asm/numa.h>
> @@ -244,6 +245,7 @@ asmlinkage __visible void smp_callin(void)
> notify_cpu_starting(curr_cpuid);
> numa_add_cpu(curr_cpuid);
> set_cpu_online(curr_cpuid, 1);
> + check_misaligned_access(curr_cpuid);
> probe_vendor_features(curr_cpuid);
> 
> if (has_vector()) {
> -- 
> 2.34.1
> 
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

Evan Green June 27, 2023, 7:11 p.m. UTC | #4

On Mon, Jun 26, 2023 at 7:15 AM Conor Dooley <conor.dooley@microchip.com> wrote:
>
> Hey Evan,
>
> Some minor nitpickery comments & one actual one.
>
> On Fri, Jun 23, 2023 at 03:20:15PM -0700, Evan Green wrote:
> > Rather than deferring misaligned access speed determinations to a vendor
>
> Could you pick a consistent word to use? You've got "unaligned",
> "misaligned" and "noalign" sprinkled through out the series.
>
> > function, let's probe them and find out how fast they are. If we
> > determine that a misaligned word access is faster than N byte accesses,
> > mark the hardware's misaligned access as "fast".
> >
> > Fix the documentation as well to reflect this bar. Previously the only
> > SoC that returned "fast" was the THead C906. The change to the
> > documentation is more a clarification, since the C906 is fast in the
> > sense of the corrected documentation.
> >
> > Signed-off-by: Evan Green <evan@rivosinc.com>
> > ---
> >
> >  Documentation/riscv/hwprobe.rst     |  8 +--
> >  arch/riscv/include/asm/cpufeature.h |  2 +
> >  arch/riscv/kernel/Makefile          |  1 +
> >  arch/riscv/kernel/copy-noalign.S    | 71 ++++++++++++++++++++++++++
> >  arch/riscv/kernel/copy-noalign.h    | 13 +++++
> >  arch/riscv/kernel/cpufeature.c      | 78 +++++++++++++++++++++++++++++
> >  arch/riscv/kernel/smpboot.c         |  2 +
> >  7 files changed, 171 insertions(+), 4 deletions(-)
> >  create mode 100644 arch/riscv/kernel/copy-noalign.S
> >  create mode 100644 arch/riscv/kernel/copy-noalign.h
> >
> > diff --git a/Documentation/riscv/hwprobe.rst b/Documentation/riscv/hwprobe.rst
> > index 19165ebd82ba..710325751766 100644
> > --- a/Documentation/riscv/hwprobe.rst
> > +++ b/Documentation/riscv/hwprobe.rst
> > @@ -88,12 +88,12 @@ The following keys are defined:
> >      always extremely slow.
> >
> >    * :c:macro:`RISCV_HWPROBE_MISALIGNED_SLOW`: Misaligned accesses are supported
> > -    in hardware, but are slower than the cooresponding aligned accesses
> > -    sequences.
>
> Nice, fixed the typo by removing the offender ;)
>
> > +    in hardware, but are slower than N byte accesses, where N is the native
> > +    word size.
> >
> >    * :c:macro:`RISCV_HWPROBE_MISALIGNED_FAST`: Misaligned accesses are supported
> > -    in hardware and are faster than the cooresponding aligned accesses
> > -    sequences.
> > +    in hardware and are faster than N byte accesses, where N is the native
> > +    word size.
> >
> >    * :c:macro:`RISCV_HWPROBE_MISALIGNED_UNSUPPORTED`: Misaligned accesses are
> >      not supported at all and will generate a misaligned address fault.
> > diff --git a/arch/riscv/include/asm/cpufeature.h b/arch/riscv/include/asm/cpufeature.h
> > index 23fed53b8815..b8e917176616 100644
> > --- a/arch/riscv/include/asm/cpufeature.h
> > +++ b/arch/riscv/include/asm/cpufeature.h
> > @@ -30,4 +30,6 @@ DECLARE_PER_CPU(long, misaligned_access_speed);
> >  /* Per-cpu ISA extensions. */
> >  extern struct riscv_isainfo hart_isa[NR_CPUS];
> >
> > +void check_misaligned_access(int cpu);
> > +
> >  #endif
> > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > index a42951911067..f934d7ab7840 100644
> > --- a/arch/riscv/kernel/Makefile
> > +++ b/arch/riscv/kernel/Makefile
> > @@ -34,6 +34,7 @@ extra-y += vmlinux.lds
> >  obj-y        += head.o
> >  obj-y        += soc.o
> >  obj-$(CONFIG_RISCV_ALTERNATIVE) += alternative.o
> > +obj-y        += copy-noalign.o
> >  obj-y        += cpu.o
> >  obj-y        += cpufeature.o
> >  obj-y        += entry.o
> > diff --git a/arch/riscv/kernel/copy-noalign.S b/arch/riscv/kernel/copy-noalign.S
> > new file mode 100644
> > index 000000000000..3807fc2324b2
> > --- /dev/null
> > +++ b/arch/riscv/kernel/copy-noalign.S
> > @@ -0,0 +1,71 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/* Copyright (C) 2023 Rivos Inc. */
> > +
> > +#include <linux/linkage.h>
> > +#include <asm/asm.h>
> > +
> > +     .text
> > +
> > +/* void __copy_words_unaligned(void *, const void *, size_t) */
> > +/* Performs a memcpy without aligning buffers, using word loads and stores. */
> > +/* Note: The size is truncated to a multiple of 8 * SZREG */
> > +ENTRY(__copy_words_unaligned)
> > +     andi a4, a2, ~((8*SZREG)-1)
> > +     beqz a4, 2f
> > +     add a3, a1, a4
> > +1:
> > +     REG_L a4,       0(a1)
> > +     REG_L a5,   SZREG(a1)
> > +     REG_L a6, 2*SZREG(a1)
> > +     REG_L a7, 3*SZREG(a1)
> > +     REG_L t0, 4*SZREG(a1)
> > +     REG_L t1, 5*SZREG(a1)
> > +     REG_L t2, 6*SZREG(a1)
> > +     REG_L t3, 7*SZREG(a1)
> > +     REG_S a4,       0(a0)
> > +     REG_S a5,   SZREG(a0)
> > +     REG_S a6, 2*SZREG(a0)
> > +     REG_S a7, 3*SZREG(a0)
> > +     REG_S t0, 4*SZREG(a0)
> > +     REG_S t1, 5*SZREG(a0)
> > +     REG_S t2, 6*SZREG(a0)
> > +     REG_S t3, 7*SZREG(a0)
> > +     addi a0, a0, 8*SZREG
> > +     addi a1, a1, 8*SZREG
> > +     bltu a1, a3, 1b
> > +
> > +2:
> > +     ret
> > +END(__copy_words_unaligned)
> > +
> > +/* void __copy_bytes_unaligned(void *, const void *, size_t) */
> > +/* Performs a memcpy without aligning buffers, using only byte accesses. */
> > +/* Note: The size is truncated to a multiple of 8 */
> > +ENTRY(__copy_bytes_unaligned)
> > +     andi a4, a2, ~(8-1)
> > +     beqz a4, 2f
> > +     add a3, a1, a4
>
> Could you align operands for ASM please, to make this a little easier to
> read?
>
> > +1:
> > +     lb a4, 0(a1)
> > +     lb a5, 1(a1)
> > +     lb a6, 2(a1)
> > +     lb a7, 3(a1)
> > +     lb t0, 4(a1)
> > +     lb t1, 5(a1)
> > +     lb t2, 6(a1)
> > +     lb t3, 7(a1)
> > +     sb a4, 0(a0)
> > +     sb a5, 1(a0)
> > +     sb a6, 2(a0)
> > +     sb a7, 3(a0)
> > +     sb t0, 4(a0)
> > +     sb t1, 5(a0)
> > +     sb t2, 6(a0)
> > +     sb t3, 7(a0)
> > +     addi a0, a0, 8
> > +     addi a1, a1, 8
> > +     bltu a1, a3, 1b
> > +
> > +2:
> > +     ret
> > +END(__copy_bytes_unaligned)
> > diff --git a/arch/riscv/kernel/copy-noalign.h b/arch/riscv/kernel/copy-noalign.h
> > new file mode 100644
> > index 000000000000..99fbb9c763e0
> > --- /dev/null
> > +++ b/arch/riscv/kernel/copy-noalign.h
> > @@ -0,0 +1,13 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Copyright (C) 2023 Rivos, Inc.
> > + */
> > +#ifndef __RISCV_KERNEL_COPY_NOALIGN_H
> > +#define __RISCV_KERNEL_COPY_NOALIGN_H
> > +
> > +#include <linux/types.h>
> > +
> > +void __copy_words_unaligned(void *dst, const void *src, size_t size);
> > +void __copy_bytes_unaligned(void *dst, const void *src, size_t size);
> > +
> > +#endif /* __RISCV_KERNEL_COPY_NOALIGN_H */
> > diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
> > index bdcf460ea53d..3f7200dcc00c 100644
> > --- a/arch/riscv/kernel/cpufeature.c
> > +++ b/arch/riscv/kernel/cpufeature.c
> > @@ -19,11 +19,21 @@
> >  #include <asm/cacheflush.h>
> >  #include <asm/cpufeature.h>
> >  #include <asm/hwcap.h>
> > +#include <asm/hwprobe.h>
> >  #include <asm/patch.h>
> >  #include <asm/processor.h>
> >  #include <asm/vector.h>
> >
> > +#include "copy-noalign.h"
> > +
> >  #define NUM_ALPHA_EXTS ('z' - 'a' + 1)
> > +#define MISALIGNED_ACCESS_JIFFIES_LG2 1
> > +#define MISALIGNED_BUFFER_SIZE 0x4000
> > +#define MISALIGNED_COPY_SIZE ((MISALIGNED_BUFFER_SIZE / 2) - 0x80)
> > +
>
> I think this blank line is misplaced, it should go after NUM_ALPHA_EXTS
> instead of (or as well as) here.
>
> > +#define MISALIGNED_COPY_MBS(_count) \
> > +     ((HZ * (_count) * MISALIGNED_COPY_SIZE) >> \
> > +      (20 + MISALIGNED_ACCESS_JIFFIES_LG2))
> >
> >  unsigned long elf_hwcap __read_mostly;
> >
> > @@ -396,6 +406,74 @@ unsigned long riscv_get_elf_hwcap(void)
> >       return hwcap;
> >  }
> >
> > +void check_misaligned_access(int cpu)
> > +{
> > +     unsigned long j0, j1;
> > +     struct page *page;
> > +     void *dst;
> > +     void *src;
> > +     long word_copies = 0;
> > +     long byte_copies = 0;
> > +     long speed = RISCV_HWPROBE_MISALIGNED_SLOW;
>
> Is this not a change from current behaviour, that may actually lead to
> incorrect reporting. Presently, only T-Head stuff sets a speed, so
> hwprobe falls back to UNKNOWN for everything else. With this, we will
> get slow set, for anything failing the test.
> Slow is defined as "Misaligned accesses are supported in hardware, but
> are slower than the cooresponding aligned accesses sequences (sic)", but
> you have no way of knowing, based on the test you are performing, whether
> the hardware supports it or if it is emulated by firmware.
> Perhaps that is not relevant to userspace, but wanted to know your
> thoughts.
>

Hm, that's true. EMULATED was an easy one when we were planning to get
this info from the DT. It also might be an easy one in the future, if
we get an SBI call that allows the kernel to take over misaligned trap
handling. We'd then be able to do a misaligned access and see if our
trap handler got called.

One option is to leave the value alone if we fail the FAST test
(rather than changing it from UNKNOWN to SLOW). This isn't great
though, as it effectively makes UNKNOWN synonymous with SLOW, but in a
way where usermode can't tell the difference between "I truly don't
know" and "I tried the fast test and it failed".

The alternative, as it is now, may mislabel some emulated systems as
slow until the new SBI call shows up. I'm not sure how bad this is in
practice. We could add a subsequent performance bar below which we
guess "emulated". This probably matches what usermode will use that
value for anyway (a synonym for "very slow"), but it's basically the
same problem with reversed polarity (we mislabel some slow systems as
emulated). I'm open to suggestions!

-Evan

> > +
> > +     page = alloc_pages(GFP_NOWAIT, get_order(MISALIGNED_BUFFER_SIZE));
> > +     if (!page) {
> > +             pr_warn("Can't alloc pages to measure memcpy performance");
> > +             return;
> > +     }
> > +
> > +     /* Make a misaligned destination buffer. */
> > +     dst = (void *)((unsigned long)page_address(page) | 0x1);
> > +     /* Misalign src as well, but differently (off by 1 + 2 = 3). */
> > +     src = dst + (MISALIGNED_BUFFER_SIZE / 2);
> > +     src += 2;
> > +     /* Do a warmup. */
> > +     __copy_words_unaligned(dst, src, MISALIGNED_COPY_SIZE);
> > +     preempt_disable();
> > +     j0 = jiffies;
> > +     while ((j1 = jiffies) == j0)
> > +             cpu_relax();
> > +
> > +     while (time_before(jiffies,
> > +                        j1 + (1 << MISALIGNED_ACCESS_JIFFIES_LG2))) {
>
> Does this not fit in 100 chars?
>
> > +
> > +             __copy_words_unaligned(dst, src, MISALIGNED_COPY_SIZE);
> > +             word_copies += 1;
> > +     }
> > +
> > +     __copy_bytes_unaligned(dst, src, MISALIGNED_COPY_SIZE);
> > +     j0 = jiffies;
> > +     while ((j1 = jiffies) == j0)
> > +             cpu_relax();
> > +
> > +     while (time_before(jiffies,
> > +                        j1 + (1 << MISALIGNED_ACCESS_JIFFIES_LG2))) {
>
> Ditto here, no?
>
> Cheers,
> Conor.

Evan Green June 27, 2023, 7:11 p.m. UTC | #5

On Mon, Jun 26, 2023 at 2:42 PM Jessica Clarke <jrtc27@jrtc27.com> wrote:
>
> On 23 Jun 2023, at 23:20, Evan Green <evan@rivosinc.com> wrote:
> >
> > Rather than deferring misaligned access speed determinations to a vendor
> > function, let's probe them and find out how fast they are. If we
> > determine that a misaligned word access is faster than N byte accesses,
> > mark the hardware's misaligned access as "fast".
>
> How sure are you that your measurements can be extrapolated and aren’t
> an artefact of the testing process? For example, off the top of my head:
>
> * The first run will potentially be penalised by data cache misses,
>   untrained prefetchers, TLB misses, branch predictors, etc. compared
>   with later runs. You have one warmup, but who knows how many
>   iterations it will take to converge?

I'd expect the cache penalties to be reasonably covered by a single
warmup. You're right about branch prediction, which is why I tried to
use a large-ish buffer size, minimize the ratio of conditionals to
loads/stores, and do the test for a decent number of iterations (on my
THead, about 1800 and 400 for words and bytes).

When I ran the test a handful of times, I did see variation on the
order of ~5%. But the comparison of the two numbers doesn't seem to be
anywhere near that margin (THead C906 was ~4x faster doing misaligned
word accesses, others with slow misaligned accesses also reporting
numbers not anywhere close to each other).

>
> * The code being benchmarked isn’t the code being run, so differences
>   in access patterns, loop unrolling, loop alignment, etc. may cause the
>   real code to behave differently (and perhaps change which is better).

I'm not trying to make statements about memcpy specifically, but
(only) about misaligned accesses, which is why I tried to write loops
that isolated that element as much as possible.

>
> The non-determinism that could in theory result from this also seems
> like a not great idea to have.

This is fair, if we have machines where this waffles from boot to boot
that's not great. In theory if misaligned word accesses come out to
being almost exactly equal to N byte accesses, then it doesn't matter
which you choose, though of course it could still make a difference in
practice. The alternative though of providing no info just pushes the
same problem out into userspace, which seems worse.
-Evan

David Laight June 29, 2023, 12:05 p.m. UTC | #6

From: Evan Green
> Sent: 27 June 2023 20:12
> 
> On Mon, Jun 26, 2023 at 2:42 PM Jessica Clarke <jrtc27@jrtc27.com> wrote:
> >
> > On 23 Jun 2023, at 23:20, Evan Green <evan@rivosinc.com> wrote:
> > >
> > > Rather than deferring misaligned access speed determinations to a vendor
> > > function, let's probe them and find out how fast they are. If we
> > > determine that a misaligned word access is faster than N byte accesses,
> > > mark the hardware's misaligned access as "fast".
> >
> > How sure are you that your measurements can be extrapolated and aren’t
> > an artefact of the testing process? For example, off the top of my head:
> >
> > * The first run will potentially be penalised by data cache misses,
> >   untrained prefetchers, TLB misses, branch predictors, etc. compared
> >   with later runs. You have one warmup, but who knows how many
> >   iterations it will take to converge?
> 
> I'd expect the cache penalties to be reasonably covered by a single
> warmup. You're right about branch prediction, which is why I tried to
> use a large-ish buffer size, minimize the ratio of conditionals to
> loads/stores, and do the test for a decent number of iterations (on my
> THead, about 1800 and 400 for words and bytes).
> 
> When I ran the test a handful of times, I did see variation on the
> order of ~5%. But the comparison of the two numbers doesn't seem to be
> anywhere near that margin (THead C906 was ~4x faster doing misaligned
> word accesses, others with slow misaligned accesses also reporting
> numbers not anywhere close to each other).

Isn't the EMULATED case so much slower than anything else that
it is even pretty obvious from a single access?
(Possibly the 2nd access to avoid 'cold cache'.)

One of the things that can perturb measurements is hardware
interrupts. That can be mitigated by counting clocks for a few
(10 is plenty) iterations of a short request and taking the
fastest value.
For short hot-cache code sequences you can actually compare the
actual clock counts with theoretical minimum values.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Conor Dooley June 29, 2023, 2:03 p.m. UTC | #7

On Tue, Jun 27, 2023 at 12:11:25PM -0700, Evan Green wrote:
> On Mon, Jun 26, 2023 at 7:15 AM Conor Dooley <conor.dooley@microchip.com> wrote:
> > > +void check_misaligned_access(int cpu)
> > > +{
> > > +     unsigned long j0, j1;
> > > +     struct page *page;
> > > +     void *dst;
> > > +     void *src;
> > > +     long word_copies = 0;
> > > +     long byte_copies = 0;
> > > +     long speed = RISCV_HWPROBE_MISALIGNED_SLOW;
> >
> > Is this not a change from current behaviour, that may actually lead to
> > incorrect reporting. Presently, only T-Head stuff sets a speed, so
> > hwprobe falls back to UNKNOWN for everything else. With this, we will
> > get slow set, for anything failing the test.
> > Slow is defined as "Misaligned accesses are supported in hardware, but
> > are slower than the cooresponding aligned accesses sequences (sic)", but
> > you have no way of knowing, based on the test you are performing, whether
> > the hardware supports it or if it is emulated by firmware.
> > Perhaps that is not relevant to userspace, but wanted to know your
> > thoughts.
> >
> 
> Hm, that's true. EMULATED was an easy one when we were planning to get
> this info from the DT. It also might be an easy one in the future, if
> we get an SBI call that allows the kernel to take over misaligned trap
> handling. We'd then be able to do a misaligned access and see if our
> trap handler got called.
> 
> One option is to leave the value alone if we fail the FAST test
> (rather than changing it from UNKNOWN to SLOW). This isn't great
> though, as it effectively makes UNKNOWN synonymous with SLOW, but in a
> way where usermode can't tell the difference between "I truly don't
> know" and "I tried the fast test and it failed".
> 
> The alternative, as it is now, may mislabel some emulated systems as
> slow until the new SBI call shows up.

Make that "mislabel some emulated systems forever", existing systems
don't magically grow support for new extensions unfortunately.

Realistically though, does it matter to userspace if it is slow because
the hardware is slow, or if the emulation is slow, since there's not
really a way for userspace to tell from the syscall by how much it is
slower.
It can probably guess that emulation is worse, given how crap the
speed I see on mpfs is.

I'd rather we did say slow, rather than people start to interpret
UNKNOWN as slow.

> I'm not sure how bad this is in
> practice. We could add a subsequent performance bar below which we
> guess "emulated".

Nah, I don't really think that that is required.

> This probably matches what usermode will use that
> value for anyway (a synonym for "very slow"), but it's basically the
> same problem with reversed polarity (we mislabel some slow systems as
> emulated). I'm open to suggestions!

I think I just agreed with you, give or take. If it is fast, say fast.
If it is slow, we say it is slow. If we know it is emulated, then we can
report it being emulated. Is it too late to remove the "hardware" from
the syscall documentation, IOW s/supported in hardware/supported/?

Please actually describe the assumptions/subtleties in the commit
message though, so that the rationale for stuff is in the history :)

Cheers,
Conor.

Evan Green June 29, 2023, 11:09 p.m. UTC | #8

On Thu, Jun 29, 2023 at 5:05 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Evan Green
> > Sent: 27 June 2023 20:12
> >
> > On Mon, Jun 26, 2023 at 2:42 PM Jessica Clarke <jrtc27@jrtc27.com> wrote:
> > >
> > > On 23 Jun 2023, at 23:20, Evan Green <evan@rivosinc.com> wrote:
> > > >
> > > > Rather than deferring misaligned access speed determinations to a vendor
> > > > function, let's probe them and find out how fast they are. If we
> > > > determine that a misaligned word access is faster than N byte accesses,
> > > > mark the hardware's misaligned access as "fast".
> > >
> > > How sure are you that your measurements can be extrapolated and aren’t
> > > an artefact of the testing process? For example, off the top of my head:
> > >
> > > * The first run will potentially be penalised by data cache misses,
> > >   untrained prefetchers, TLB misses, branch predictors, etc. compared
> > >   with later runs. You have one warmup, but who knows how many
> > >   iterations it will take to converge?
> >
> > I'd expect the cache penalties to be reasonably covered by a single
> > warmup. You're right about branch prediction, which is why I tried to
> > use a large-ish buffer size, minimize the ratio of conditionals to
> > loads/stores, and do the test for a decent number of iterations (on my
> > THead, about 1800 and 400 for words and bytes).
> >
> > When I ran the test a handful of times, I did see variation on the
> > order of ~5%. But the comparison of the two numbers doesn't seem to be
> > anywhere near that margin (THead C906 was ~4x faster doing misaligned
> > word accesses, others with slow misaligned accesses also reporting
> > numbers not anywhere close to each other).
>
> Isn't the EMULATED case so much slower than anything else that
> it is even pretty obvious from a single access?
> (Possibly the 2nd access to avoid 'cold cache'.)
>
> One of the things that can perturb measurements is hardware
> interrupts. That can be mitigated by counting clocks for a few
> (10 is plenty) iterations of a short request and taking the
> fastest value.
> For short hot-cache code sequences you can actually compare the
> actual clock counts with theoretical minimum values.

Yeah, one thing I could do is disable interrupts, measure the cycle
count of doing an individual iteration, do this N times, and take the
minimum value as the time to compare. In the end I'll then have two
numbers to compare, like I do in this patch. In theory the variance on
that should be really tight. N will have to depend on the overall
amount of time I'm taking so as not to shut interrupts off for very
long. Let me experiment with this and see how the results look.
-Evan

Evan Green June 29, 2023, 11:10 p.m. UTC | #9

On Thu, Jun 29, 2023 at 7:03 AM Conor Dooley <conor.dooley@microchip.com> wrote:
>
> On Tue, Jun 27, 2023 at 12:11:25PM -0700, Evan Green wrote:
> > On Mon, Jun 26, 2023 at 7:15 AM Conor Dooley <conor.dooley@microchip.com> wrote:
> > > > +void check_misaligned_access(int cpu)
> > > > +{
> > > > +     unsigned long j0, j1;
> > > > +     struct page *page;
> > > > +     void *dst;
> > > > +     void *src;
> > > > +     long word_copies = 0;
> > > > +     long byte_copies = 0;
> > > > +     long speed = RISCV_HWPROBE_MISALIGNED_SLOW;
> > >
> > > Is this not a change from current behaviour, that may actually lead to
> > > incorrect reporting. Presently, only T-Head stuff sets a speed, so
> > > hwprobe falls back to UNKNOWN for everything else. With this, we will
> > > get slow set, for anything failing the test.
> > > Slow is defined as "Misaligned accesses are supported in hardware, but
> > > are slower than the cooresponding aligned accesses sequences (sic)", but
> > > you have no way of knowing, based on the test you are performing, whether
> > > the hardware supports it or if it is emulated by firmware.
> > > Perhaps that is not relevant to userspace, but wanted to know your
> > > thoughts.
> > >
> >
> > Hm, that's true. EMULATED was an easy one when we were planning to get
> > this info from the DT. It also might be an easy one in the future, if
> > we get an SBI call that allows the kernel to take over misaligned trap
> > handling. We'd then be able to do a misaligned access and see if our
> > trap handler got called.
> >
> > One option is to leave the value alone if we fail the FAST test
> > (rather than changing it from UNKNOWN to SLOW). This isn't great
> > though, as it effectively makes UNKNOWN synonymous with SLOW, but in a
> > way where usermode can't tell the difference between "I truly don't
> > know" and "I tried the fast test and it failed".
> >
> > The alternative, as it is now, may mislabel some emulated systems as
> > slow until the new SBI call shows up.
>
> Make that "mislabel some emulated systems forever", existing systems
> don't magically grow support for new extensions unfortunately.

Right.

>
> Realistically though, does it matter to userspace if it is slow because
> the hardware is slow, or if the emulation is slow, since there's not
> really a way for userspace to tell from the syscall by how much it is
> slower.
> It can probably guess that emulation is worse, given how crap the
> speed I see on mpfs is.
>
> I'd rather we did say slow, rather than people start to interpret
> UNKNOWN as slow.

I think I agree.

>
> > I'm not sure how bad this is in
> > practice. We could add a subsequent performance bar below which we
> > guess "emulated".
>
> Nah, I don't really think that that is required.
>
> > This probably matches what usermode will use that
> > value for anyway (a synonym for "very slow"), but it's basically the
> > same problem with reversed polarity (we mislabel some slow systems as
> > emulated). I'm open to suggestions!
>
> I think I just agreed with you, give or take. If it is fast, say fast.
> If it is slow, we say it is slow. If we know it is emulated, then we can
> report it being emulated. Is it too late to remove the "hardware" from
> the syscall documentation, IOW s/supported in hardware/supported/?
>
> Please actually describe the assumptions/subtleties in the commit
> message though, so that the rationale for stuff is in the history :)

Will do. I pondered an alternative of creating a "gray zone" where if
misaligned words and bytes come out close to each other (which I don't
expect them to), we leave the setting of UNKNOWN alone. But I'm not
sure this really solves anything, it just moves the "waffle point"
around, so I couldn't convince myself it was valuable.
-Evan

David Laight June 30, 2023, 8:29 a.m. UTC | #10

...
> Yeah, one thing I could do is disable interrupts, measure the cycle
> count of doing an individual iteration, do this N times, and take the
> minimum value as the time to compare. In the end I'll then have two
> numbers to compare, like I do in this patch. In theory the variance on
> that should be really tight. N will have to depend on the overall
> amount of time I'm taking so as not to shut interrupts off for very
> long. Let me experiment with this and see how the results look.
> -Evan

I doubt you'll need many iterations or a long test.

You can do tests in userspace without disabling pre-emption
or interrupts - the large/silly values they generate are
easily ignored.

I suspect you'll get enough info from something like:
	unsigned long x[2];
	volatile unsigned long *p = (void *)((unsigned char *)x + 1);
	full_cpu_barrier()
	start = rdtsc();
	full_cpu_barrier();
	*p; *p; *p; *p; *p; *p; *p; *p;
	*p; *p; *p; *p; *p; *p; *p; *p;
	full_cpu_barrier()
	elapsed = rdtsc() - start;
Once the i-cache is loaded it should be pretty constant.
For aligned addresses I'd expect each extra '*p' to be
one more clock.
With hardware support for misaligned transfers at most
2 clocks (test on x86 and it will be 1 clock).
The emulated version will be 100s or 1000s.
	
I'm not sure how much of a cpu barrier you need.
Definitely needs to wait for all memory accesses
and the rdtsc().

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

[1/2] RISC-V: Probe for unaligned access speed

Checks

Commit Message

Comments

Patch