[v4,2/5] riscv: Add checksum library

Message ID	20230911-optimize_checksum-v4-2-77cc2ad9e9d7@rivosinc.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org> From: Charlie Jenkins <charlie@rivosinc.com> Date: Mon, 11 Sep 2023 15:57:12 -0700 Subject: [PATCH v4 2/5] riscv: Add checksum library MIME-Version: 1.0 Message-Id: <20230911-optimize_checksum-v4-2-77cc2ad9e9d7@rivosinc.com> References: <20230911-optimize_checksum-v4-0-77cc2ad9e9d7@rivosinc.com> In-Reply-To: <20230911-optimize_checksum-v4-0-77cc2ad9e9d7@rivosinc.com> To: Charlie Jenkins <charlie@rivosinc.com>, Palmer Dabbelt <palmer@dabbelt.com>, Conor Dooley <conor@kernel.org>, Samuel Holland <samuel.holland@sifive.com>, David Laight <David.Laight@aculab.com>, linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org Cc: Paul Walmsley <paul.walmsley@sifive.com>, Albert Ou <aou@eecs.berkeley.edu> Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-riscv" <linux-riscv-bounces@lists.infradead.org> Errors-To: linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org
Series	riscv: Add fine-tuned checksum functions \| expand [v4,0/5] riscv: Add fine-tuned checksum functions [v4,1/5] riscv: Checksum header [v4,2/5] riscv: Add checksum library [v4,3/5] riscv: Vector checksum header [v4,4/5] riscv: Vector checksum library [v4,5/5] riscv: Test checksum functions

Context	Check	Description
conchuod/cover_letter	success	Series has a cover letter
conchuod/tree_selection	success	Guessed tree name to be for-next at HEAD 0bb80ecc33a8
conchuod/fixes_present	success	Fixes tag not required for -next series
conchuod/maintainers_pattern	success	MAINTAINERS pattern errors before the patch: 5 and now 5
conchuod/verify_signedoff	success	Signed-off-by tag matches author and committer
conchuod/kdoc	success	Errors and warnings before: 0 this patch: 0
conchuod/build_rv64_clang_allmodconfig	fail	Failed to build the tree with this patch.
conchuod/module_param	success	Was 0 now: 0
conchuod/build_rv64_gcc_allmodconfig	success	Errors and warnings before: 379 this patch: 379
conchuod/build_rv32_defconfig	fail	Build failed
conchuod/dtb_warn_rv64	success	Errors and warnings before: 25 this patch: 25
conchuod/header_inline	success	No static functions without inline keyword in header files
conchuod/checkpatch	warning	CHECK: extern prototypes should be avoided in .h files WARNING: Avoid line continuations in quoted strings WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: else is not generally useful after a break or return WARNING: unnecessary whitespace before a quoted newline
conchuod/build_rv64_nommu_k210_defconfig	success	Build OK
conchuod/verify_fixes	success	No Fixes tag
conchuod/build_rv64_nommu_virt_defconfig	success	Build OK

Charlie Jenkins Sept. 11, 2023, 10:57 p.m. UTC

Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
will load from the buffer in groups of 32 bits, and when compiled for
64-bit will load in groups of 64 bits. Benchmarking by proxy compiling
csum_ipv6_magic (64-bit version) for an x86 chip as well as running
the riscv generated code in QEMU, discovered that summing in a
tree-like structure is about 4% faster than doing 64-bit reads.

Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
---
 arch/riscv/include/asm/checksum.h |  11 ++
 arch/riscv/lib/Makefile           |   1 +
 arch/riscv/lib/csum.c             | 210 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 222 insertions(+)

David Laight Sept. 12, 2023, 8:45 a.m. UTC | #1

From: Charlie Jenkins
> Sent: 11 September 2023 23:57
> 
> Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
> will load from the buffer in groups of 32 bits, and when compiled for
> 64-bit will load in groups of 64 bits. Benchmarking by proxy compiling
> csum_ipv6_magic (64-bit version) for an x86 chip as well as running
> the riscv generated code in QEMU, discovered that summing in a
> tree-like structure is about 4% faster than doing 64-bit reads.
> 
...
> +	sum   = saddr->s6_addr32[0];
> +	sum  += saddr->s6_addr32[1];
> +	sum1  = saddr->s6_addr32[2];
> +	sum1 += saddr->s6_addr32[3];
> +
> +	sum2  = daddr->s6_addr32[0];
> +	sum2 += daddr->s6_addr32[1];
> +	sum3  = daddr->s6_addr32[2];
> +	sum3 += daddr->s6_addr32[3];
> +
> +	sum4  = csum;
> +	sum4 += ulen;
> +	sum4 += uproto;
> +
> +	sum  += sum1;
> +	sum2 += sum3;
> +
> +	sum += sum2;
> +	sum += sum4;

Have you got gcc to compile that as-is?

Whenever I've tried to get a 'tree add' compiled so that the
early adds can be executed in parallel gcc always pessimises
it to a linear sequence of adds.

But I agree that adding 32bit values to a 64bit register
may be no slower than trying to do an 'add carry' sequence
that is guaranteed to only do one add/clock.
(And on Intel cpu from core-2 until IIRC Haswell adc took 2 clocks!)

IIRC RISCV doesn't have a carry flag, so the adc sequence
is hard - probably takes two extra instructions per value.
Although with parallel execute it may not matter.
Consider:
	val = buf[offset];
	sum += val;
	carry += sum < val;
	val = buf[offset1];
	sum += val;
	...
the compare and 'carry +=' can be executed at the same time
as the following two instructions.
You do then a final sum += carry; sum += sum < carry;

Assuming all instructions are 1 clock and any read delays
get filled with other instructions (by source or hardware
instruction re-ordering) even without parallel execute
that is 4 clocks for 64 bits, which is much the same as the
2 clocks for 32 bits.

Remember that all the 32bit values can summed first as
they won't overflow.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Charlie Jenkins Sept. 13, 2023, 3:09 a.m. UTC | #2

On Tue, Sep 12, 2023 at 08:45:38AM +0000, David Laight wrote:
> From: Charlie Jenkins
> > Sent: 11 September 2023 23:57
> > 
> > Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
> > will load from the buffer in groups of 32 bits, and when compiled for
> > 64-bit will load in groups of 64 bits. Benchmarking by proxy compiling
> > csum_ipv6_magic (64-bit version) for an x86 chip as well as running
> > the riscv generated code in QEMU, discovered that summing in a
> > tree-like structure is about 4% faster than doing 64-bit reads.
> > 
> ...
> > +	sum   = saddr->s6_addr32[0];
> > +	sum  += saddr->s6_addr32[1];
> > +	sum1  = saddr->s6_addr32[2];
> > +	sum1 += saddr->s6_addr32[3];
> > +
> > +	sum2  = daddr->s6_addr32[0];
> > +	sum2 += daddr->s6_addr32[1];
> > +	sum3  = daddr->s6_addr32[2];
> > +	sum3 += daddr->s6_addr32[3];
> > +
> > +	sum4  = csum;
> > +	sum4 += ulen;
> > +	sum4 += uproto;
> > +
> > +	sum  += sum1;
> > +	sum2 += sum3;
> > +
> > +	sum += sum2;
> > +	sum += sum4;
> 
> Have you got gcc to compile that as-is?
> 
> Whenever I've tried to get a 'tree add' compiled so that the
> early adds can be executed in parallel gcc always pessimises
> it to a linear sequence of adds.
> 
> But I agree that adding 32bit values to a 64bit register
> may be no slower than trying to do an 'add carry' sequence
> that is guaranteed to only do one add/clock.
> (And on Intel cpu from core-2 until IIRC Haswell adc took 2 clocks!)
> 
> IIRC RISCV doesn't have a carry flag, so the adc sequence
> is hard - probably takes two extra instructions per value.
> Although with parallel execute it may not matter.
> Consider:
> 	val = buf[offset];
> 	sum += val;
> 	carry += sum < val;
> 	val = buf[offset1];
> 	sum += val;
> 	...
> the compare and 'carry +=' can be executed at the same time
> as the following two instructions.
> You do then a final sum += carry; sum += sum < carry;
> 
> Assuming all instructions are 1 clock and any read delays
> get filled with other instructions (by source or hardware
> instruction re-ordering) even without parallel execute
> that is 4 clocks for 64 bits, which is much the same as the
> 2 clocks for 32 bits.
> 
> Remember that all the 32bit values can summed first as
> they won't overflow.
> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
Yeah it does seem like the tree-add does just do a linear add. All three
of them were pretty much the same on riscv so I used the version that
did best on x86 with the knowledge that my QEMU setup does not
accurately represent real hardware.

I don't quite understand how doing the carry in the middle of each
stage, even though it can be executed at the same time, would be faster
than just doing a single overflow check at the end. I can just revert
back to the non-tree add version since there is no improvement on riscv.
I can also revert back to the default version that uses carry += sum < val
as well.

- Charlie

David Laight Sept. 13, 2023, 8:47 a.m. UTC | #3

From: Charlie Jenkins
> Sent: 13 September 2023 04:10
> 
> On Tue, Sep 12, 2023 at 08:45:38AM +0000, David Laight wrote:
> > From: Charlie Jenkins
> > > Sent: 11 September 2023 23:57
> > >
> > > Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
> > > will load from the buffer in groups of 32 bits, and when compiled for
> > > 64-bit will load in groups of 64 bits. Benchmarking by proxy compiling
> > > csum_ipv6_magic (64-bit version) for an x86 chip as well as running
> > > the riscv generated code in QEMU, discovered that summing in a
> > > tree-like structure is about 4% faster than doing 64-bit reads.
> > >
> > ...
> > > +	sum   = saddr->s6_addr32[0];
> > > +	sum  += saddr->s6_addr32[1];
> > > +	sum1  = saddr->s6_addr32[2];
> > > +	sum1 += saddr->s6_addr32[3];
> > > +
> > > +	sum2  = daddr->s6_addr32[0];
> > > +	sum2 += daddr->s6_addr32[1];
> > > +	sum3  = daddr->s6_addr32[2];
> > > +	sum3 += daddr->s6_addr32[3];
> > > +
> > > +	sum4  = csum;
> > > +	sum4 += ulen;
> > > +	sum4 += uproto;
> > > +
> > > +	sum  += sum1;
> > > +	sum2 += sum3;
> > > +
> > > +	sum += sum2;
> > > +	sum += sum4;
> >
> > Have you got gcc to compile that as-is?
> >
> > Whenever I've tried to get a 'tree add' compiled so that the
> > early adds can be executed in parallel gcc always pessimises
> > it to a linear sequence of adds.
> >
> > But I agree that adding 32bit values to a 64bit register
> > may be no slower than trying to do an 'add carry' sequence
> > that is guaranteed to only do one add/clock.
> > (And on Intel cpu from core-2 until IIRC Haswell adc took 2 clocks!)
> >
> > IIRC RISCV doesn't have a carry flag, so the adc sequence
> > is hard - probably takes two extra instructions per value.
> > Although with parallel execute it may not matter.
> > Consider:
> > 	val = buf[offset];
> > 	sum += val;
> > 	carry += sum < val;
> > 	val = buf[offset1];
> > 	sum += val;
> > 	...
> > the compare and 'carry +=' can be executed at the same time
> > as the following two instructions.
> > You do then a final sum += carry; sum += sum < carry;
> >
> > Assuming all instructions are 1 clock and any read delays
> > get filled with other instructions (by source or hardware
> > instruction re-ordering) even without parallel execute
> > that is 4 clocks for 64 bits, which is much the same as the
> > 2 clocks for 32 bits.
> >
> > Remember that all the 32bit values can summed first as
> > they won't overflow.
> >
> > 	David

> Yeah it does seem like the tree-add does just do a linear add. All three
> of them were pretty much the same on riscv so I used the version that
> did best on x86 with the knowledge that my QEMU setup does not
> accurately represent real hardware.

The problem there is that any measurement on x86 has pretty much
no relevance to what any RISCV cpu might do.
The multiple execution units and out of order execution on x86
are far different from anything any RISCV cpu is likely to have
for many years.
You might get nearer running on one of the Atom cpu - but it won't
really match.
There are too many fundamental differences between the architectures.

All you can do is to find and read the instruction timings for
a target physical cpu and look for things like:
- Whether arithmetic results are available next clock.
  (It probably is)
- How many clocks it takes for read data to be available.
  I suspect the cpu will stall if the data is needed.
  A block of sequential reads is one way to avoid the stall.
  On x86 the instruction that needs the data is just deferred
  until it is available, the following instructions execute
  (provided their input are all available).
- Clock delays for taken/not taken predicted/not predicted branches.
  
> I don't quite understand how doing the carry in the middle of each
> stage, even though it can be executed at the same time, would be faster
> than just doing a single overflow check at the end.

You need to do half as many reads and adds.

> I can just revert
> back to the non-tree add version since there is no improvement on riscv.

The 'tree' version is only likely to be faster on cpu (like x86)
that can (at least sometimes) do two memory reads in one clock
and can do two adds and two read in the same clock.
Even then, without out of order execution, it is hard to get right.

Oh, you might want to try getting the default csum_fold() to
be the faster 'arc' version rather than adding your own version.

	David

> I can also revert back to the default version that uses carry += sum < val
> as well.
> 
> - Charlie

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Charlie Jenkins Sept. 13, 2023, 11:18 p.m. UTC | #4

On Wed, Sep 13, 2023 at 08:47:49AM +0000, David Laight wrote:
> From: Charlie Jenkins
> > Sent: 13 September 2023 04:10
> > 
> > On Tue, Sep 12, 2023 at 08:45:38AM +0000, David Laight wrote:
> > > From: Charlie Jenkins
> > > > Sent: 11 September 2023 23:57
> > > >
> > > > Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
> > > > will load from the buffer in groups of 32 bits, and when compiled for
> > > > 64-bit will load in groups of 64 bits. Benchmarking by proxy compiling
> > > > csum_ipv6_magic (64-bit version) for an x86 chip as well as running
> > > > the riscv generated code in QEMU, discovered that summing in a
> > > > tree-like structure is about 4% faster than doing 64-bit reads.
> > > >
> > > ...
> > > > +	sum   = saddr->s6_addr32[0];
> > > > +	sum  += saddr->s6_addr32[1];
> > > > +	sum1  = saddr->s6_addr32[2];
> > > > +	sum1 += saddr->s6_addr32[3];
> > > > +
> > > > +	sum2  = daddr->s6_addr32[0];
> > > > +	sum2 += daddr->s6_addr32[1];
> > > > +	sum3  = daddr->s6_addr32[2];
> > > > +	sum3 += daddr->s6_addr32[3];
> > > > +
> > > > +	sum4  = csum;
> > > > +	sum4 += ulen;
> > > > +	sum4 += uproto;
> > > > +
> > > > +	sum  += sum1;
> > > > +	sum2 += sum3;
> > > > +
> > > > +	sum += sum2;
> > > > +	sum += sum4;
> > >
> > > Have you got gcc to compile that as-is?
> > >
> > > Whenever I've tried to get a 'tree add' compiled so that the
> > > early adds can be executed in parallel gcc always pessimises
> > > it to a linear sequence of adds.
> > >
> > > But I agree that adding 32bit values to a 64bit register
> > > may be no slower than trying to do an 'add carry' sequence
> > > that is guaranteed to only do one add/clock.
> > > (And on Intel cpu from core-2 until IIRC Haswell adc took 2 clocks!)
> > >
> > > IIRC RISCV doesn't have a carry flag, so the adc sequence
> > > is hard - probably takes two extra instructions per value.
> > > Although with parallel execute it may not matter.
> > > Consider:
> > > 	val = buf[offset];
> > > 	sum += val;
> > > 	carry += sum < val;
> > > 	val = buf[offset1];
> > > 	sum += val;
> > > 	...
> > > the compare and 'carry +=' can be executed at the same time
> > > as the following two instructions.
> > > You do then a final sum += carry; sum += sum < carry;
> > >
> > > Assuming all instructions are 1 clock and any read delays
> > > get filled with other instructions (by source or hardware
> > > instruction re-ordering) even without parallel execute
> > > that is 4 clocks for 64 bits, which is much the same as the
> > > 2 clocks for 32 bits.
> > >
> > > Remember that all the 32bit values can summed first as
> > > they won't overflow.
> > >
> > > 	David
> 
> > Yeah it does seem like the tree-add does just do a linear add. All three
> > of them were pretty much the same on riscv so I used the version that
> > did best on x86 with the knowledge that my QEMU setup does not
> > accurately represent real hardware.
> 
> The problem there is that any measurement on x86 has pretty much
> no relevance to what any RISCV cpu might do.
> The multiple execution units and out of order execution on x86
> are far different from anything any RISCV cpu is likely to have
> for many years.
> You might get nearer running on one of the Atom cpu - but it won't
> really match.
> There are too many fundamental differences between the architectures.
> 
> All you can do is to find and read the instruction timings for
> a target physical cpu and look for things like:
> - Whether arithmetic results are available next clock.
>   (It probably is)
> - How many clocks it takes for read data to be available.
>   I suspect the cpu will stall if the data is needed.
>   A block of sequential reads is one way to avoid the stall.
>   On x86 the instruction that needs the data is just deferred
>   until it is available, the following instructions execute
>   (provided their input are all available).
> - Clock delays for taken/not taken predicted/not predicted branches.
>   
> > I don't quite understand how doing the carry in the middle of each
> > stage, even though it can be executed at the same time, would be faster
> > than just doing a single overflow check at the end.
> 
> You need to do half as many reads and adds.
> 
I missed that you were talking about 64-bit reads. I was talking to
somebody about this a couple weeks ago and they mentioned a counter
example that showed that adding the carry after was not the same as
adding it in the middle. Even though addition is commutative, I wasn't
sure if the overflow checking was. I can't rememeber what the counter
example was, but I have a feeling it was flawed.
> > I can just revert
> > back to the non-tree add version since there is no improvement on riscv.
> 
> The 'tree' version is only likely to be faster on cpu (like x86)
> that can (at least sometimes) do two memory reads in one clock
> and can do two adds and two read in the same clock.
> Even then, without out of order execution, it is hard to get right.
> 
> Oh, you might want to try getting the default csum_fold() to
> be the faster 'arc' version rather than adding your own version.
> 
> 	David
> 
> > I can also revert back to the default version that uses carry += sum < val
> > as well.
> > 
> > - Charlie
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

Charlie Jenkins Sept. 14, 2023, 12:41 a.m. UTC | #5

On Wed, Sep 13, 2023 at 07:18:18PM -0400, Charlie Jenkins wrote:
> On Wed, Sep 13, 2023 at 08:47:49AM +0000, David Laight wrote:
> > From: Charlie Jenkins
> > > Sent: 13 September 2023 04:10
> > > 
> > > On Tue, Sep 12, 2023 at 08:45:38AM +0000, David Laight wrote:
> > > > From: Charlie Jenkins
> > > > > Sent: 11 September 2023 23:57
> > > > >
> > > > > Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
> > > > > will load from the buffer in groups of 32 bits, and when compiled for
> > > > > 64-bit will load in groups of 64 bits. Benchmarking by proxy compiling
> > > > > csum_ipv6_magic (64-bit version) for an x86 chip as well as running
> > > > > the riscv generated code in QEMU, discovered that summing in a
> > > > > tree-like structure is about 4% faster than doing 64-bit reads.
> > > > >
> > > > ...
> > > > > +	sum   = saddr->s6_addr32[0];
> > > > > +	sum  += saddr->s6_addr32[1];
> > > > > +	sum1  = saddr->s6_addr32[2];
> > > > > +	sum1 += saddr->s6_addr32[3];
> > > > > +
> > > > > +	sum2  = daddr->s6_addr32[0];
> > > > > +	sum2 += daddr->s6_addr32[1];
> > > > > +	sum3  = daddr->s6_addr32[2];
> > > > > +	sum3 += daddr->s6_addr32[3];
> > > > > +
> > > > > +	sum4  = csum;
> > > > > +	sum4 += ulen;
> > > > > +	sum4 += uproto;
> > > > > +
> > > > > +	sum  += sum1;
> > > > > +	sum2 += sum3;
> > > > > +
> > > > > +	sum += sum2;
> > > > > +	sum += sum4;
> > > >
> > > > Have you got gcc to compile that as-is?
> > > >
> > > > Whenever I've tried to get a 'tree add' compiled so that the
> > > > early adds can be executed in parallel gcc always pessimises
> > > > it to a linear sequence of adds.
> > > >
> > > > But I agree that adding 32bit values to a 64bit register
> > > > may be no slower than trying to do an 'add carry' sequence
> > > > that is guaranteed to only do one add/clock.
> > > > (And on Intel cpu from core-2 until IIRC Haswell adc took 2 clocks!)
> > > >
> > > > IIRC RISCV doesn't have a carry flag, so the adc sequence
> > > > is hard - probably takes two extra instructions per value.
> > > > Although with parallel execute it may not matter.
> > > > Consider:
> > > > 	val = buf[offset];
> > > > 	sum += val;
> > > > 	carry += sum < val;
> > > > 	val = buf[offset1];
> > > > 	sum += val;
> > > > 	...
> > > > the compare and 'carry +=' can be executed at the same time
> > > > as the following two instructions.
> > > > You do then a final sum += carry; sum += sum < carry;
> > > >
> > > > Assuming all instructions are 1 clock and any read delays
> > > > get filled with other instructions (by source or hardware
> > > > instruction re-ordering) even without parallel execute
> > > > that is 4 clocks for 64 bits, which is much the same as the
> > > > 2 clocks for 32 bits.
> > > >
> > > > Remember that all the 32bit values can summed first as
> > > > they won't overflow.
> > > >
> > > > 	David
> > 
> > > Yeah it does seem like the tree-add does just do a linear add. All three
> > > of them were pretty much the same on riscv so I used the version that
> > > did best on x86 with the knowledge that my QEMU setup does not
> > > accurately represent real hardware.
> > 
> > The problem there is that any measurement on x86 has pretty much
> > no relevance to what any RISCV cpu might do.
> > The multiple execution units and out of order execution on x86
> > are far different from anything any RISCV cpu is likely to have
> > for many years.
> > You might get nearer running on one of the Atom cpu - but it won't
> > really match.
> > There are too many fundamental differences between the architectures.
> > 
> > All you can do is to find and read the instruction timings for
> > a target physical cpu and look for things like:
> > - Whether arithmetic results are available next clock.
> >   (It probably is)
> > - How many clocks it takes for read data to be available.
> >   I suspect the cpu will stall if the data is needed.
> >   A block of sequential reads is one way to avoid the stall.
> >   On x86 the instruction that needs the data is just deferred
> >   until it is available, the following instructions execute
> >   (provided their input are all available).
> > - Clock delays for taken/not taken predicted/not predicted branches.
> >   
> > > I don't quite understand how doing the carry in the middle of each
> > > stage, even though it can be executed at the same time, would be faster
> > > than just doing a single overflow check at the end.
> > 
> > You need to do half as many reads and adds.
> > 
> I missed that you were talking about 64-bit reads. I was talking to
> somebody about this a couple weeks ago and they mentioned a counter
> example that showed that adding the carry after was not the same as
> adding it in the middle. Even though addition is commutative, I wasn't
> sure if the overflow checking was. I can't rememeber what the counter
> example was, but I have a feeling it was flawed.

Sorry to double respond to this. It seems like it is the same. However
it seems like it is still slower. After cleaning up my benchmarking more,
it seems like the best way to go is to use the 32-bit adds and
accumulate the overflow in the upper 32 bits.

> > > I can just revert
> > > back to the non-tree add version since there is no improvement on riscv.
> > 
> > The 'tree' version is only likely to be faster on cpu (like x86)
> > that can (at least sometimes) do two memory reads in one clock
> > and can do two adds and two read in the same clock.
> > Even then, without out of order execution, it is hard to get right.
> > 
> > Oh, you might want to try getting the default csum_fold() to
> > be the faster 'arc' version rather than adding your own version.
I do like this idea. I can extract out the changes into the default
version.

- Charlie
> > 
> > 	David
> > 
> > > I can also revert back to the default version that uses carry += sum < val
> > > as well.
> > > 
> > > - Charlie
> > 
> > -
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > Registration No: 1397386 (Wales)
> >

Conor Dooley Sept. 14, 2023, 12:25 p.m. UTC | #6

On Mon, Sep 11, 2023 at 03:57:12PM -0700, Charlie Jenkins wrote:
> Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
> will load from the buffer in groups of 32 bits, and when compiled for
> 64-bit will load in groups of 64 bits. Benchmarking by proxy compiling
> csum_ipv6_magic (64-bit version) for an x86 chip as well as running
> the riscv generated code in QEMU, discovered that summing in a
> tree-like structure is about 4% faster than doing 64-bit reads.
> 
> Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
> ---
>  arch/riscv/include/asm/checksum.h |  11 ++
>  arch/riscv/lib/Makefile           |   1 +
>  arch/riscv/lib/csum.c             | 210 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 222 insertions(+)
> 
> diff --git a/arch/riscv/include/asm/checksum.h b/arch/riscv/include/asm/checksum.h
> index 0d7fc8275a5e..a09a4053fb87 100644
> --- a/arch/riscv/include/asm/checksum.h
> +++ b/arch/riscv/include/asm/checksum.h
> @@ -16,6 +16,14 @@ typedef unsigned int csum_t;
>  typedef unsigned long csum_t;
>  #endif
>  
> +/* Default version is sufficient for 32 bit */
> +#ifdef CONFIG_64BIT
> +#define _HAVE_ARCH_IPV6_CSUM
> +__sum16 csum_ipv6_magic(const struct in6_addr *saddr,
> +			const struct in6_addr *daddr,
> +			__u32 len, __u8 proto, __wsum sum);
> +#endif
> +
>  /*
>   *	Fold a partial checksum without adding pseudo headers
>   */
> @@ -90,6 +98,9 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
>  
>  #define ip_fast_csum ip_fast_csum
>  
> +extern unsigned int do_csum(const unsigned char *buff, int len);
> +#define do_csum do_csum
> +
>  #include <asm-generic/checksum.h>
>  
>  #endif // __ASM_RISCV_CHECKSUM_H
> diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
> index 26cb2502ecf8..2aa1a4ad361f 100644
> --- a/arch/riscv/lib/Makefile
> +++ b/arch/riscv/lib/Makefile
> @@ -6,6 +6,7 @@ lib-y			+= memmove.o
>  lib-y			+= strcmp.o
>  lib-y			+= strlen.o
>  lib-y			+= strncmp.o
> +lib-y			+= csum.o
>  lib-$(CONFIG_MMU)	+= uaccess.o
>  lib-$(CONFIG_64BIT)	+= tishift.o
>  lib-$(CONFIG_RISCV_ISA_ZICBOZ)	+= clear_page.o
> diff --git a/arch/riscv/lib/csum.c b/arch/riscv/lib/csum.c
> new file mode 100644
> index 000000000000..47d98c51bab2
> --- /dev/null
> +++ b/arch/riscv/lib/csum.c
> @@ -0,0 +1,210 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * IP checksum library
> + *
> + * Influenced by arch/arm64/lib/csum.c
> + * Copyright (C) 2023 Rivos Inc.
> + */
> +#include <linux/bitops.h>
> +#include <linux/compiler.h>
> +#include <linux/kasan-checks.h>
> +#include <linux/kernel.h>
> +
> +#include <net/checksum.h>
> +
> +/* Default version is sufficient for 32 bit */
> +#ifndef CONFIG_32BIT
> +__sum16 csum_ipv6_magic(const struct in6_addr *saddr,
> +			const struct in6_addr *daddr,
> +			__u32 len, __u8 proto, __wsum csum)
> +{
> +	/*
> +	 * Inform the compiler/processor that the operation we are performing is
> +	 * "Commutative and Associative" by summing parts of the checksum in a
> +	 * tree-like structure (Section 2(A) of "Computing the Internet
> +	 * Checksum"). Furthermore, defer the overflow until the end of the
> +	 * computation which is shown to be valid in Section 2(C)(1) of the
> +	 * same handbook.
> +	 */
> +	unsigned long sum, sum1, sum2, sum3, sum4, ulen, uproto;
> +
> +	uproto = htonl(proto);
> +	ulen = htonl(len);
> +
> +	sum   = saddr->s6_addr32[0];
> +	sum  += saddr->s6_addr32[1];
> +	sum1  = saddr->s6_addr32[2];
> +	sum1 += saddr->s6_addr32[3];
> +
> +	sum2  = daddr->s6_addr32[0];
> +	sum2 += daddr->s6_addr32[1];
> +	sum3  = daddr->s6_addr32[2];
> +	sum3 += daddr->s6_addr32[3];
> +
> +	sum4  = csum;
> +	sum4 += ulen;
> +	sum4 += uproto;
> +
> +	sum  += sum1;
> +	sum2 += sum3;
> +
> +	sum += sum2;
> +	sum += sum4;
> +
> +	if (IS_ENABLED(CONFIG_RISCV_ISA_ZBB) &&
> +	    IS_ENABLED(CONFIG_RISCV_ALTERNATIVE)) {
> +		csum_t fold_temp;
> +
> +		/*
> +		 * Zbb is likely available when the kernel is compiled with Zbb
> +		 * support, so nop when Zbb is available and jump when Zbb is
> +		 * not available.
> +		 */
> +		asm_volatile_goto(ALTERNATIVE("j %l[no_zbb]", "nop", 0,
> +					      RISCV_ISA_EXT_ZBB, 1)
> +				  :
> +				  :
> +				  :
> +				  : no_zbb);
> +		asm(".option push					\n\
> +		.option arch,+zbb					\n\
> +			rori	%[fold_temp], %[sum], 32		\n\
> +			add	%[sum], %[fold_temp], %[sum]		\n\
> +			srli	%[sum], %[sum], 32			\n\
> +			not	%[fold_temp], %[sum]			\n\
> +			roriw	%[sum], %[sum], 16			\n\
> +			subw	%[sum], %[fold_temp], %[sum]		\n\
> +		.option pop"
> +		: [sum] "+r" (sum), [fold_temp] "=&r" (fold_temp));
> +		return (__force __sum16)(sum >> 16);
> +	}
> +no_zbb:
> +	sum += (sum >> 32) | (sum << 32);
> +	sum >>= 32;
> +	return csum_fold((__force __wsum)sum);
> +}
> +EXPORT_SYMBOL(csum_ipv6_magic);
> +#endif // !CONFIG_32BIT
> +
> +#ifdef CONFIG_32BIT
> +#define OFFSET_MASK 3
> +#elif CONFIG_64BIT
> +#define OFFSET_MASK 7
> +#endif
> +
> +/*
> + * Perform a checksum on an arbitrary memory address.
> + * Algorithm accounts for buff being misaligned.
> + * If buff is not aligned, will over-read bytes but not use the bytes that it
> + * shouldn't. The same thing will occur on the tail-end of the read.
> + */
> +unsigned int __no_sanitize_address do_csum(const unsigned char *buff, int len)
> +{
> +	unsigned int offset, shift;
> +	csum_t csum, data;
> +	const csum_t *ptr;
> +
> +	if (unlikely(len <= 0))
> +		return 0;
> +	/*
> +	 * To align the address, grab the whole first byte in buff.
> +	 * Since it is inside of a same byte, it will never cross pages or cache
> +	 * lines.
> +	 * Directly call KASAN with the alignment we will be using.
> +	 */
> +	offset = (csum_t)buff & OFFSET_MASK;
> +	kasan_check_read(buff, len);
> +	ptr = (const csum_t *)(buff - offset);
> +	len = len + offset - sizeof(csum_t);
> +
> +	/*
> +	 * Clear the most signifant bits that were over-read if buff was not
> +	 * aligned.
> +	 */
> +	shift = offset * 8;
> +	data = *ptr;
> +#ifdef __LITTLE_ENDIAN
> +	data = (data >> shift) << shift;
> +#else
> +	data = (data << shift) >> shift;
> +#endif
> +	/*
> +	 * Do 32-bit reads on RV32 and 64-bit reads otherwise. This should be
> +	 * faster than doing 32-bit reads on architectures that support larger
> +	 * reads.
> +	 */
> +	while (len > 0) {
> +		csum += data;

arch/riscv/lib/csum.c:137:3: warning: variable 'csum' is uninitialized when used here [-Wuninitialized]
                csum += data;
                ^~~~
arch/riscv/lib/csum.c:104:13: note: initialize the variable 'csum' to silence this warning
        csum_t csum, data;
                   ^
                    = 0
> +		csum += csum < data;
> +		len -= sizeof(csum_t);
> +		ptr += 1;
> +		data = *ptr;
> +	}
> +
> +	/*
> +	 * Perform alignment (and over-read) bytes on the tail if any bytes
> +	 * leftover.
> +	 */
> +	shift = len * -8;
> +#ifdef __LITTLE_ENDIAN
> +	data = (data << shift) >> shift;
> +#else
> +	data = (data >> shift) << shift;
> +#endif
> +	csum += data;
> +	csum += csum < data;
> +
> +	if (!riscv_has_extension_likely(RISCV_ISA_EXT_ZBB))
> +		goto no_zbb;

I think this is missing a change for IS_ENABLED(CONFIG_RISCV_ISA_ZBB)?
arch/riscv/lib/csum.c:180:1: warning: unknown option, expected 'push', 'pop', 'rvc', 'norvc', 'relax' or 'norelax' [-Winline-asm]
                .option arch,+zbb                               \n\
^
<inline asm>:2:11: note: instantiated into assembly here
                .option arch,+zbb                               
                        ^
arch/riscv/lib/csum.c:181:1: error: instruction requires the following: 'Zbb' (Basic Bit-Manipulation) or 'Zbkb' (Bitmanip instructions for Cryptography)
                        rori    %[fold_temp], %[csum], 32       \n\
^
<inline asm>:3:4: note: instantiated into assembly here
                        rori    a2, a0, 32      
                        ^
arch/riscv/lib/csum.c:184:1: error: instruction requires the following: 'Zbb' (Basic Bit-Manipulation) or 'Zbkb' (Bitmanip instructions for Cryptography)
                        roriw   %[fold_temp], %[csum], 16       \n\
^
<inline asm>:6:4: note: instantiated into assembly here
                        roriw   a2, a0, 16      
                        ^
arch/riscv/lib/csum.c:188:1: error: instruction requires the following: 'Zbb' (Basic Bit-Manipulation) or 'Zbkb' (Bitmanip instructions for Cryptography)
                        rev8    %[csum], %[csum]                \n\
^
<inline asm>:10:4: note: instantiated into assembly here
                        rev8    a0, a0          
                        ^
2 warnings and 3 errors generated.

clang before 17 doesn't support `.option arch`, so the guard is required
around any code using it. You've got the guard on the other `.option
arch` user above, so that one seems to have escaped notice ;)

Going forward, it'd be good to test this stuff with non-latest clang to
make sure you appropriately consider the !Zbb cases.


> +
> +	unsigned int fold_temp;
> +
> +	if (IS_ENABLED(CONFIG_32BIT)) {
> +		asm_volatile_goto(".option push			\n\
> +		.option arch,+zbb				\n\
> +			rori	%[fold_temp], %[csum], 16	\n\
> +			andi	%[offset], %[offset], 1		\n\
> +			add	%[csum], %[fold_temp], %[csum]	\n\
> +			beq	%[offset], zero, %l[end]	\n\
> +			rev8	%[csum], %[csum]		\n\
> +			zext.h	%[csum], %[csum]		\n\
> +		.option pop"
> +			: [csum] "+r" (csum), [fold_temp] "=&r" (fold_temp)
> +			: [offset] "r" (offset)
> +			:
> +			: end);
> +
> +		return csum;
> +	} else {
> +		asm_volatile_goto(".option push			\n\
> +		.option arch,+zbb				\n\
> +			rori	%[fold_temp], %[csum], 32	\n\
> +			add	%[csum], %[fold_temp], %[csum]	\n\
> +			srli	%[csum], %[csum], 32		\n\
> +			roriw	%[fold_temp], %[csum], 16	\n\
> +			addw	%[csum], %[fold_temp], %[csum]	\n\
> +			andi	%[offset], %[offset], 1		\n\
> +			beq	%[offset], zero, %l[end]	\n\
> +			rev8	%[csum], %[csum]		\n\
> +			srli	%[csum], %[csum], 32		\n\
> +			zext.h	%[csum], %[csum]		\n\
> +		.option pop"
> +			: [csum] "+r" (csum), [fold_temp] "=&r" (fold_temp)
> +			: [offset] "r" (offset)
> +			:
> +			: end);
> +
> +		return csum;
> +	}
> +end:
> +		return csum >> 16;
> +no_zbb:
> +#ifndef CONFIG_32BIT

These can also be moved to IS_ENABLED() FYI, since there's no 32-bit
stuff here that'd break the build for 64-bit. Ditto elsewhere where
you've got similar stuff.

Cheers,
Conor.

> +		csum += (csum >> 32) | (csum << 32);
> +		csum >>= 32;
> +#endif
> +	csum = (unsigned int)csum + (((unsigned int)csum >> 16) | ((unsigned int)csum << 16));
> +	if (offset & 1)
> +		return (unsigned short)swab32(csum);
> +	return csum >> 16;
> +}
> 
> -- 
> 2.42.0
>

Charlie Jenkins Sept. 14, 2023, 5:58 p.m. UTC | #7

On Thu, Sep 14, 2023 at 01:25:23PM +0100, Conor Dooley wrote:
> On Mon, Sep 11, 2023 at 03:57:12PM -0700, Charlie Jenkins wrote:
> > Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
> > will load from the buffer in groups of 32 bits, and when compiled for
> > 64-bit will load in groups of 64 bits. Benchmarking by proxy compiling
> > csum_ipv6_magic (64-bit version) for an x86 chip as well as running
> > the riscv generated code in QEMU, discovered that summing in a
> > tree-like structure is about 4% faster than doing 64-bit reads.
> > 
> > Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
> > ---
> >  arch/riscv/include/asm/checksum.h |  11 ++
> >  arch/riscv/lib/Makefile           |   1 +
> >  arch/riscv/lib/csum.c             | 210 ++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 222 insertions(+)
> > 
> > diff --git a/arch/riscv/include/asm/checksum.h b/arch/riscv/include/asm/checksum.h
> > index 0d7fc8275a5e..a09a4053fb87 100644
> > --- a/arch/riscv/include/asm/checksum.h
> > +++ b/arch/riscv/include/asm/checksum.h
> > @@ -16,6 +16,14 @@ typedef unsigned int csum_t;
> >  typedef unsigned long csum_t;
> >  #endif
> >  
> > +/* Default version is sufficient for 32 bit */
> > +#ifdef CONFIG_64BIT
> > +#define _HAVE_ARCH_IPV6_CSUM
> > +__sum16 csum_ipv6_magic(const struct in6_addr *saddr,
> > +			const struct in6_addr *daddr,
> > +			__u32 len, __u8 proto, __wsum sum);
> > +#endif
> > +
> >  /*
> >   *	Fold a partial checksum without adding pseudo headers
> >   */
> > @@ -90,6 +98,9 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
> >  
> >  #define ip_fast_csum ip_fast_csum
> >  
> > +extern unsigned int do_csum(const unsigned char *buff, int len);
> > +#define do_csum do_csum
> > +
> >  #include <asm-generic/checksum.h>
> >  
> >  #endif // __ASM_RISCV_CHECKSUM_H
> > diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
> > index 26cb2502ecf8..2aa1a4ad361f 100644
> > --- a/arch/riscv/lib/Makefile
> > +++ b/arch/riscv/lib/Makefile
> > @@ -6,6 +6,7 @@ lib-y			+= memmove.o
> >  lib-y			+= strcmp.o
> >  lib-y			+= strlen.o
> >  lib-y			+= strncmp.o
> > +lib-y			+= csum.o
> >  lib-$(CONFIG_MMU)	+= uaccess.o
> >  lib-$(CONFIG_64BIT)	+= tishift.o
> >  lib-$(CONFIG_RISCV_ISA_ZICBOZ)	+= clear_page.o
> > diff --git a/arch/riscv/lib/csum.c b/arch/riscv/lib/csum.c
> > new file mode 100644
> > index 000000000000..47d98c51bab2
> > --- /dev/null
> > +++ b/arch/riscv/lib/csum.c
> > @@ -0,0 +1,210 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * IP checksum library
> > + *
> > + * Influenced by arch/arm64/lib/csum.c
> > + * Copyright (C) 2023 Rivos Inc.
> > + */
> > +#include <linux/bitops.h>
> > +#include <linux/compiler.h>
> > +#include <linux/kasan-checks.h>
> > +#include <linux/kernel.h>
> > +
> > +#include <net/checksum.h>
> > +
> > +/* Default version is sufficient for 32 bit */
> > +#ifndef CONFIG_32BIT
> > +__sum16 csum_ipv6_magic(const struct in6_addr *saddr,
> > +			const struct in6_addr *daddr,
> > +			__u32 len, __u8 proto, __wsum csum)
> > +{
> > +	/*
> > +	 * Inform the compiler/processor that the operation we are performing is
> > +	 * "Commutative and Associative" by summing parts of the checksum in a
> > +	 * tree-like structure (Section 2(A) of "Computing the Internet
> > +	 * Checksum"). Furthermore, defer the overflow until the end of the
> > +	 * computation which is shown to be valid in Section 2(C)(1) of the
> > +	 * same handbook.
> > +	 */
> > +	unsigned long sum, sum1, sum2, sum3, sum4, ulen, uproto;
> > +
> > +	uproto = htonl(proto);
> > +	ulen = htonl(len);
> > +
> > +	sum   = saddr->s6_addr32[0];
> > +	sum  += saddr->s6_addr32[1];
> > +	sum1  = saddr->s6_addr32[2];
> > +	sum1 += saddr->s6_addr32[3];
> > +
> > +	sum2  = daddr->s6_addr32[0];
> > +	sum2 += daddr->s6_addr32[1];
> > +	sum3  = daddr->s6_addr32[2];
> > +	sum3 += daddr->s6_addr32[3];
> > +
> > +	sum4  = csum;
> > +	sum4 += ulen;
> > +	sum4 += uproto;
> > +
> > +	sum  += sum1;
> > +	sum2 += sum3;
> > +
> > +	sum += sum2;
> > +	sum += sum4;
> > +
> > +	if (IS_ENABLED(CONFIG_RISCV_ISA_ZBB) &&
> > +	    IS_ENABLED(CONFIG_RISCV_ALTERNATIVE)) {
> > +		csum_t fold_temp;
> > +
> > +		/*
> > +		 * Zbb is likely available when the kernel is compiled with Zbb
> > +		 * support, so nop when Zbb is available and jump when Zbb is
> > +		 * not available.
> > +		 */
> > +		asm_volatile_goto(ALTERNATIVE("j %l[no_zbb]", "nop", 0,
> > +					      RISCV_ISA_EXT_ZBB, 1)
> > +				  :
> > +				  :
> > +				  :
> > +				  : no_zbb);
> > +		asm(".option push					\n\
> > +		.option arch,+zbb					\n\
> > +			rori	%[fold_temp], %[sum], 32		\n\
> > +			add	%[sum], %[fold_temp], %[sum]		\n\
> > +			srli	%[sum], %[sum], 32			\n\
> > +			not	%[fold_temp], %[sum]			\n\
> > +			roriw	%[sum], %[sum], 16			\n\
> > +			subw	%[sum], %[fold_temp], %[sum]		\n\
> > +		.option pop"
> > +		: [sum] "+r" (sum), [fold_temp] "=&r" (fold_temp));
> > +		return (__force __sum16)(sum >> 16);
> > +	}
> > +no_zbb:
> > +	sum += (sum >> 32) | (sum << 32);
> > +	sum >>= 32;
> > +	return csum_fold((__force __wsum)sum);
> > +}
> > +EXPORT_SYMBOL(csum_ipv6_magic);
> > +#endif // !CONFIG_32BIT
> > +
> > +#ifdef CONFIG_32BIT
> > +#define OFFSET_MASK 3
> > +#elif CONFIG_64BIT
> > +#define OFFSET_MASK 7
> > +#endif
> > +
> > +/*
> > + * Perform a checksum on an arbitrary memory address.
> > + * Algorithm accounts for buff being misaligned.
> > + * If buff is not aligned, will over-read bytes but not use the bytes that it
> > + * shouldn't. The same thing will occur on the tail-end of the read.
> > + */
> > +unsigned int __no_sanitize_address do_csum(const unsigned char *buff, int len)
> > +{
> > +	unsigned int offset, shift;
> > +	csum_t csum, data;
> > +	const csum_t *ptr;
> > +
> > +	if (unlikely(len <= 0))
> > +		return 0;
> > +	/*
> > +	 * To align the address, grab the whole first byte in buff.
> > +	 * Since it is inside of a same byte, it will never cross pages or cache
> > +	 * lines.
> > +	 * Directly call KASAN with the alignment we will be using.
> > +	 */
> > +	offset = (csum_t)buff & OFFSET_MASK;
> > +	kasan_check_read(buff, len);
> > +	ptr = (const csum_t *)(buff - offset);
> > +	len = len + offset - sizeof(csum_t);
> > +
> > +	/*
> > +	 * Clear the most signifant bits that were over-read if buff was not
> > +	 * aligned.
> > +	 */
> > +	shift = offset * 8;
> > +	data = *ptr;
> > +#ifdef __LITTLE_ENDIAN
> > +	data = (data >> shift) << shift;
> > +#else
> > +	data = (data << shift) >> shift;
> > +#endif
> > +	/*
> > +	 * Do 32-bit reads on RV32 and 64-bit reads otherwise. This should be
> > +	 * faster than doing 32-bit reads on architectures that support larger
> > +	 * reads.
> > +	 */
> > +	while (len > 0) {
> > +		csum += data;
> 
> arch/riscv/lib/csum.c:137:3: warning: variable 'csum' is uninitialized when used here [-Wuninitialized]
>                 csum += data;
>                 ^~~~
> arch/riscv/lib/csum.c:104:13: note: initialize the variable 'csum' to silence this warning
>         csum_t csum, data;
>                    ^
>                     = 0
> > +		csum += csum < data;
> > +		len -= sizeof(csum_t);
> > +		ptr += 1;
> > +		data = *ptr;
> > +	}
> > +
> > +	/*
> > +	 * Perform alignment (and over-read) bytes on the tail if any bytes
> > +	 * leftover.
> > +	 */
> > +	shift = len * -8;
> > +#ifdef __LITTLE_ENDIAN
> > +	data = (data << shift) >> shift;
> > +#else
> > +	data = (data >> shift) << shift;
> > +#endif
> > +	csum += data;
> > +	csum += csum < data;
> > +
> > +	if (!riscv_has_extension_likely(RISCV_ISA_EXT_ZBB))
> > +		goto no_zbb;
> 
> I think this is missing a change for IS_ENABLED(CONFIG_RISCV_ISA_ZBB)?
> arch/riscv/lib/csum.c:180:1: warning: unknown option, expected 'push', 'pop', 'rvc', 'norvc', 'relax' or 'norelax' [-Winline-asm]
>                 .option arch,+zbb                               \n\
> ^
> <inline asm>:2:11: note: instantiated into assembly here
>                 .option arch,+zbb                               
>                         ^
> arch/riscv/lib/csum.c:181:1: error: instruction requires the following: 'Zbb' (Basic Bit-Manipulation) or 'Zbkb' (Bitmanip instructions for Cryptography)
>                         rori    %[fold_temp], %[csum], 32       \n\
> ^
> <inline asm>:3:4: note: instantiated into assembly here
>                         rori    a2, a0, 32      
>                         ^
> arch/riscv/lib/csum.c:184:1: error: instruction requires the following: 'Zbb' (Basic Bit-Manipulation) or 'Zbkb' (Bitmanip instructions for Cryptography)
>                         roriw   %[fold_temp], %[csum], 16       \n\
> ^
> <inline asm>:6:4: note: instantiated into assembly here
>                         roriw   a2, a0, 16      
>                         ^
> arch/riscv/lib/csum.c:188:1: error: instruction requires the following: 'Zbb' (Basic Bit-Manipulation) or 'Zbkb' (Bitmanip instructions for Cryptography)
>                         rev8    %[csum], %[csum]                \n\
> ^
> <inline asm>:10:4: note: instantiated into assembly here
>                         rev8    a0, a0          
>                         ^
> 2 warnings and 3 errors generated.
> 
> clang before 17 doesn't support `.option arch`, so the guard is required
> around any code using it. You've got the guard on the other `.option
> arch` user above, so that one seems to have escaped notice ;)
> 
> Going forward, it'd be good to test this stuff with non-latest clang to
> make sure you appropriately consider the !Zbb cases.
> 
Yes this was an oversight to drop the guard.
> 
> > +
> > +	unsigned int fold_temp;
> > +
> > +	if (IS_ENABLED(CONFIG_32BIT)) {
> > +		asm_volatile_goto(".option push			\n\
> > +		.option arch,+zbb				\n\
> > +			rori	%[fold_temp], %[csum], 16	\n\
> > +			andi	%[offset], %[offset], 1		\n\
> > +			add	%[csum], %[fold_temp], %[csum]	\n\
> > +			beq	%[offset], zero, %l[end]	\n\
> > +			rev8	%[csum], %[csum]		\n\
> > +			zext.h	%[csum], %[csum]		\n\
> > +		.option pop"
> > +			: [csum] "+r" (csum), [fold_temp] "=&r" (fold_temp)
> > +			: [offset] "r" (offset)
> > +			:
> > +			: end);
> > +
> > +		return csum;
> > +	} else {
> > +		asm_volatile_goto(".option push			\n\
> > +		.option arch,+zbb				\n\
> > +			rori	%[fold_temp], %[csum], 32	\n\
> > +			add	%[csum], %[fold_temp], %[csum]	\n\
> > +			srli	%[csum], %[csum], 32		\n\
> > +			roriw	%[fold_temp], %[csum], 16	\n\
> > +			addw	%[csum], %[fold_temp], %[csum]	\n\
> > +			andi	%[offset], %[offset], 1		\n\
> > +			beq	%[offset], zero, %l[end]	\n\
> > +			rev8	%[csum], %[csum]		\n\
> > +			srli	%[csum], %[csum], 32		\n\
> > +			zext.h	%[csum], %[csum]		\n\
> > +		.option pop"
> > +			: [csum] "+r" (csum), [fold_temp] "=&r" (fold_temp)
> > +			: [offset] "r" (offset)
> > +			:
> > +			: end);
> > +
> > +		return csum;
> > +	}
> > +end:
> > +		return csum >> 16;
> > +no_zbb:
> > +#ifndef CONFIG_32BIT
> 
> These can also be moved to IS_ENABLED() FYI, since there's no 32-bit
> stuff here that'd break the build for 64-bit. Ditto elsewhere where
> you've got similar stuff.
> 
> Cheers,
> Conor.
This is an ifndef, so 32-bit compilation would throw a warning about
shifting by 32 bits if IS_ENABLED was used instead.

- Charlie
> 
> > +		csum += (csum >> 32) | (csum << 32);
> > +		csum >>= 32;
> > +#endif
> > +	csum = (unsigned int)csum + (((unsigned int)csum >> 16) | ((unsigned int)csum << 16));
> > +	if (offset & 1)
> > +		return (unsigned short)swab32(csum);
> > +	return csum >> 16;
> > +}
> > 
> > -- 
> > 2.42.0
> >

Conor Dooley Sept. 14, 2023, 6:02 p.m. UTC | #8

> > > +#ifndef CONFIG_32BIT
> > 
> > These can also be moved to IS_ENABLED() FYI, since there's no 32-bit
> > stuff here that'd break the build for 64-bit. Ditto elsewhere where
> > you've got similar stuff.
> > 
> > Cheers,
> > Conor.
> This is an ifndef, so 32-bit compilation would throw a warning about
> shifting by 32 bits if IS_ENABLED was used instead.

 Fair enough. I did accidentally invert things in my mail, I did notice
 the n, I just thought it did the elimination beforehand those checks,
 sorry for the noise.

Charlie Jenkins Sept. 14, 2023, 11:30 p.m. UTC | #9

On Thu, Sep 14, 2023 at 07:02:29PM +0100, Conor Dooley wrote:
> > > > +#ifndef CONFIG_32BIT
> > > 
> > > These can also be moved to IS_ENABLED() FYI, since there's no 32-bit
> > > stuff here that'd break the build for 64-bit. Ditto elsewhere where
> > > you've got similar stuff.
> > > 
> > > Cheers,
> > > Conor.
> > This is an ifndef, so 32-bit compilation would throw a warning about
> > shifting by 32 bits if IS_ENABLED was used instead.
> 
>  Fair enough. I did accidentally invert things in my mail, I did notice
>  the n, I just thought it did the elimination beforehand those checks,
>  sorry for the noise.

It appears that LLVM is smart enough to not attempt to compile code that
will never execute but GCC is not. Pretty interesting because it allows the
".option arch" code to be encased in IS_ENABLED because it is only not
supported on LLVM, but the shifting code needs to be in ifdef because
otherwise gcc will complain.

[v4,2/5] riscv: Add checksum library

Checks

Commit Message

Comments

Patch