mbox series

[GIT,PULL] CRC updates for 6.14

Message ID 20250119225118.GA15398@sol.localdomain (mailing list archive)
State New
Headers show
Series [GIT,PULL] CRC updates for 6.14 | expand

Pull-request

https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git tags/crc-for-linus

Message

Eric Biggers Jan. 19, 2025, 10:51 p.m. UTC
The following changes since commit 40384c840ea1944d7c5a392e8975ed088ecf0b37:

  Linux 6.13-rc1 (2024-12-01 14:28:56 -0800)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git tags/crc-for-linus

for you to fetch changes up to 72914faebaabd77d8a471af4662ca0b938011c49:

  MAINTAINERS: add entry for CRC library (2024-12-09 22:09:44 -0800)

----------------------------------------------------------------

- Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be
  directly accessible via the library API, instead of requiring the
  crypto API.  This is much simpler and more efficient.

- Convert some users such as ext4 to use the CRC32 library API instead
  of the crypto API.  More conversions like this will come later.

- Add a KUnit test that tests and benchmarks multiple CRC variants.
  Remove older, less-comprehensive tests that are made redundant by
  this.

- Add an entry to MAINTAINERS for the kernel's CRC library code.  I'm
  volunteering to maintain it.  I have additional cleanups and
  optimizations planned for future cycles.

These patches have been in linux-next since -rc1.

----------------------------------------------------------------
Eric Biggers (31):
      lib/crc32: drop leading underscores from __crc32c_le_base
      lib/crc32: improve support for arch-specific overrides
      lib/crc32: expose whether the lib is really optimized at runtime
      crypto: crc32 - don't unnecessarily register arch algorithms
      arm/crc32: expose CRC32 functions through lib
      loongarch/crc32: expose CRC32 functions through lib
      mips/crc32: expose CRC32 functions through lib
      powerpc/crc32: expose CRC32 functions through lib
      s390/crc32: expose CRC32 functions through lib
      sparc/crc32: expose CRC32 functions through lib
      x86/crc32: update prototype for crc_pcl()
      x86/crc32: update prototype for crc32_pclmul_le_16()
      x86/crc32: expose CRC32 functions through lib
      bcachefs: Explicitly select CRYPTO from BCACHEFS_FS
      lib/crc32: make crc32c() go directly to lib
      ext4: switch to using the crc32c library
      jbd2: switch to using the crc32c library
      f2fs: switch to using the crc32 library
      scsi: target: iscsi: switch to using the crc32c library
      lib/crc-t10dif: stop wrapping the crypto API
      lib/crc-t10dif: add support for arch overrides
      crypto: crct10dif - expose arch-optimized lib function
      x86/crc-t10dif: expose CRC-T10DIF function through lib
      arm/crc-t10dif: expose CRC-T10DIF function through lib
      arm64/crc-t10dif: expose CRC-T10DIF function through lib
      powerpc/crc-t10dif: expose CRC-T10DIF function through lib
      lib/crc_kunit.c: add KUnit test suite for CRC library functions
      lib/crc16_kunit: delete obsolete crc16_kunit.c
      lib/crc32test: delete obsolete crc32test.c
      powerpc/crc: delete obsolete crc-vpmsum_test.c
      MAINTAINERS: add entry for CRC library

 MAINTAINERS                                        |  11 +
 arch/arm/Kconfig                                   |   2 +
 arch/arm/configs/milbeaut_m10v_defconfig           |   1 -
 arch/arm/configs/multi_v7_defconfig                |   1 -
 arch/arm/crypto/Kconfig                            |  25 -
 arch/arm/crypto/Makefile                           |   4 -
 arch/arm/crypto/crc32-ce-glue.c                    | 247 ------
 arch/arm/crypto/crct10dif-ce-glue.c                | 124 ---
 arch/arm/lib/Makefile                              |   6 +
 .../crct10dif-ce-core.S => lib/crc-t10dif-core.S}  |   0
 arch/arm/lib/crc-t10dif-glue.c                     |  80 ++
 .../{crypto/crc32-ce-core.S => lib/crc32-core.S}   |   5 +-
 arch/arm/lib/crc32-glue.c                          | 123 +++
 arch/arm64/Kconfig                                 |   2 +
 arch/arm64/configs/defconfig                       |   1 -
 arch/arm64/crypto/Kconfig                          |  10 -
 arch/arm64/crypto/Makefile                         |   3 -
 arch/arm64/crypto/crct10dif-ce-glue.c              | 132 ----
 arch/arm64/lib/Makefile                            |   6 +-
 .../crct10dif-ce-core.S => lib/crc-t10dif-core.S}  |   0
 arch/arm64/lib/crc-t10dif-glue.c                   |  81 ++
 arch/arm64/lib/crc32-glue.c                        |  25 +-
 arch/loongarch/Kconfig                             |   1 +
 arch/loongarch/configs/loongson3_defconfig         |   1 -
 arch/loongarch/crypto/Kconfig                      |   9 -
 arch/loongarch/crypto/Makefile                     |   2 -
 arch/loongarch/crypto/crc32-loongarch.c            | 300 --------
 arch/loongarch/lib/Makefile                        |   2 +
 arch/loongarch/lib/crc32-loongarch.c               | 135 ++++
 arch/m68k/configs/amiga_defconfig                  |   1 -
 arch/m68k/configs/apollo_defconfig                 |   1 -
 arch/m68k/configs/atari_defconfig                  |   1 -
 arch/m68k/configs/bvme6000_defconfig               |   1 -
 arch/m68k/configs/hp300_defconfig                  |   1 -
 arch/m68k/configs/mac_defconfig                    |   1 -
 arch/m68k/configs/multi_defconfig                  |   1 -
 arch/m68k/configs/mvme147_defconfig                |   1 -
 arch/m68k/configs/mvme16x_defconfig                |   1 -
 arch/m68k/configs/q40_defconfig                    |   1 -
 arch/m68k/configs/sun3_defconfig                   |   1 -
 arch/m68k/configs/sun3x_defconfig                  |   1 -
 arch/mips/Kconfig                                  |   5 +-
 arch/mips/configs/eyeq5_defconfig                  |   1 -
 arch/mips/configs/eyeq6_defconfig                  |   1 -
 arch/mips/configs/generic/32r6.config              |   2 -
 arch/mips/configs/generic/64r6.config              |   1 -
 arch/mips/crypto/Kconfig                           |   9 -
 arch/mips/crypto/Makefile                          |   2 -
 arch/mips/crypto/crc32-mips.c                      | 354 ---------
 arch/mips/lib/Makefile                             |   2 +
 arch/mips/lib/crc32-mips.c                         | 192 +++++
 arch/powerpc/Kconfig                               |   2 +
 arch/powerpc/configs/powernv_defconfig             |   2 -
 arch/powerpc/configs/ppc64_defconfig               |   3 -
 arch/powerpc/crypto/Kconfig                        |  33 -
 arch/powerpc/crypto/Makefile                       |   5 -
 arch/powerpc/crypto/crc-vpmsum_test.c              | 133 ----
 arch/powerpc/crypto/crc32c-vpmsum_glue.c           | 173 -----
 arch/powerpc/lib/Makefile                          |   6 +
 .../crc-t10dif-glue.c}                             |  69 +-
 arch/powerpc/lib/crc32-glue.c                      |  92 +++
 arch/powerpc/{crypto => lib}/crc32-vpmsum_core.S   |   0
 arch/powerpc/{crypto => lib}/crc32c-vpmsum_asm.S   |   0
 .../powerpc/{crypto => lib}/crct10dif-vpmsum_asm.S |   0
 arch/riscv/Kconfig                                 |   1 +
 arch/riscv/lib/Makefile                            |   3 +-
 arch/riscv/lib/{crc32.c => crc32-riscv.c}          |  25 +-
 arch/s390/Kconfig                                  |   1 +
 arch/s390/configs/debug_defconfig                  |   2 -
 arch/s390/configs/defconfig                        |   1 -
 arch/s390/crypto/Kconfig                           |  12 -
 arch/s390/crypto/Makefile                          |   2 -
 arch/s390/crypto/crc32-vx.c                        | 306 --------
 arch/s390/lib/Makefile                             |   3 +
 arch/s390/lib/crc32-glue.c                         |  92 +++
 arch/s390/{crypto => lib}/crc32-vx.h               |   0
 arch/s390/{crypto => lib}/crc32be-vx.c             |   0
 arch/s390/{crypto => lib}/crc32le-vx.c             |   0
 arch/sparc/Kconfig                                 |   1 +
 arch/sparc/crypto/Kconfig                          |  10 -
 arch/sparc/crypto/Makefile                         |   4 -
 arch/sparc/crypto/crc32c_glue.c                    | 184 -----
 arch/sparc/lib/Makefile                            |   2 +
 arch/sparc/lib/crc32_glue.c                        |  93 +++
 arch/sparc/{crypto => lib}/crc32c_asm.S            |   2 +-
 arch/x86/Kconfig                                   |   2 +
 arch/x86/crypto/Kconfig                            |  32 -
 arch/x86/crypto/Makefile                           |  10 -
 arch/x86/crypto/crc32-pclmul_glue.c                | 202 -----
 arch/x86/crypto/crc32c-intel_glue.c                | 250 ------
 arch/x86/crypto/crct10dif-pclmul_glue.c            | 143 ----
 arch/x86/lib/Makefile                              |   7 +
 arch/x86/lib/crc-t10dif-glue.c                     |  51 ++
 arch/x86/lib/crc32-glue.c                          | 124 +++
 .../crc32-pclmul_asm.S => lib/crc32-pclmul.S}      |  19 +-
 .../crc32c-3way.S}                                 |  63 +-
 arch/x86/{crypto => lib}/crct10dif-pcl-asm_64.S    |   0
 crypto/Kconfig                                     |   1 +
 crypto/Makefile                                    |   3 +-
 crypto/crc32_generic.c                             |   8 +-
 crypto/crc32c_generic.c                            |  12 +-
 crypto/crct10dif_common.c                          |  82 --
 crypto/crct10dif_generic.c                         |  82 +-
 drivers/target/iscsi/Kconfig                       |   4 +-
 drivers/target/iscsi/iscsi_target.c                | 153 ++--
 drivers/target/iscsi/iscsi_target_login.c          |  50 --
 drivers/target/iscsi/iscsi_target_login.h          |   1 -
 drivers/target/iscsi/iscsi_target_nego.c           |  21 +-
 fs/bcachefs/Kconfig                                |   1 +
 fs/ext4/Kconfig                                    |   3 +-
 fs/ext4/ext4.h                                     |  25 +-
 fs/ext4/super.c                                    |  15 -
 fs/f2fs/Kconfig                                    |   3 +-
 fs/f2fs/f2fs.h                                     |  20 +-
 fs/f2fs/super.c                                    |  15 -
 fs/jbd2/Kconfig                                    |   2 -
 fs/jbd2/journal.c                                  |  30 +-
 include/linux/crc-t10dif.h                         |  28 +-
 include/linux/crc32.h                              |  50 +-
 include/linux/crc32c.h                             |   7 +-
 include/linux/jbd2.h                               |  33 +-
 include/target/iscsi/iscsi_target_core.h           |   3 -
 lib/Kconfig                                        | 121 ++-
 lib/Kconfig.debug                                  |  29 +-
 lib/Makefile                                       |   4 +-
 lib/crc-t10dif.c                                   | 156 ++--
 lib/crc16_kunit.c                                  | 155 ----
 lib/crc32.c                                        |  24 +-
 lib/crc32test.c                                    | 852 ---------------------
 lib/crc_kunit.c                                    | 435 +++++++++++
 lib/libcrc32c.c                                    |  74 --
 tools/testing/selftests/arm64/fp/kernel-test.c     |   3 +-
 132 files changed, 2035 insertions(+), 4555 deletions(-)
 delete mode 100644 arch/arm/crypto/crc32-ce-glue.c
 delete mode 100644 arch/arm/crypto/crct10dif-ce-glue.c
 rename arch/arm/{crypto/crct10dif-ce-core.S => lib/crc-t10dif-core.S} (100%)
 create mode 100644 arch/arm/lib/crc-t10dif-glue.c
 rename arch/arm/{crypto/crc32-ce-core.S => lib/crc32-core.S} (98%)
 create mode 100644 arch/arm/lib/crc32-glue.c
 delete mode 100644 arch/arm64/crypto/crct10dif-ce-glue.c
 rename arch/arm64/{crypto/crct10dif-ce-core.S => lib/crc-t10dif-core.S} (100%)
 create mode 100644 arch/arm64/lib/crc-t10dif-glue.c
 delete mode 100644 arch/loongarch/crypto/crc32-loongarch.c
 create mode 100644 arch/loongarch/lib/crc32-loongarch.c
 delete mode 100644 arch/mips/crypto/crc32-mips.c
 create mode 100644 arch/mips/lib/crc32-mips.c
 delete mode 100644 arch/powerpc/crypto/crc-vpmsum_test.c
 delete mode 100644 arch/powerpc/crypto/crc32c-vpmsum_glue.c
 rename arch/powerpc/{crypto/crct10dif-vpmsum_glue.c => lib/crc-t10dif-glue.c} (50%)
 create mode 100644 arch/powerpc/lib/crc32-glue.c
 rename arch/powerpc/{crypto => lib}/crc32-vpmsum_core.S (100%)
 rename arch/powerpc/{crypto => lib}/crc32c-vpmsum_asm.S (100%)
 rename arch/powerpc/{crypto => lib}/crct10dif-vpmsum_asm.S (100%)
 rename arch/riscv/lib/{crc32.c => crc32-riscv.c} (91%)
 delete mode 100644 arch/s390/crypto/crc32-vx.c
 create mode 100644 arch/s390/lib/crc32-glue.c
 rename arch/s390/{crypto => lib}/crc32-vx.h (100%)
 rename arch/s390/{crypto => lib}/crc32be-vx.c (100%)
 rename arch/s390/{crypto => lib}/crc32le-vx.c (100%)
 delete mode 100644 arch/sparc/crypto/crc32c_glue.c
 create mode 100644 arch/sparc/lib/crc32_glue.c
 rename arch/sparc/{crypto => lib}/crc32c_asm.S (92%)
 delete mode 100644 arch/x86/crypto/crc32-pclmul_glue.c
 delete mode 100644 arch/x86/crypto/crc32c-intel_glue.c
 delete mode 100644 arch/x86/crypto/crct10dif-pclmul_glue.c
 create mode 100644 arch/x86/lib/crc-t10dif-glue.c
 create mode 100644 arch/x86/lib/crc32-glue.c
 rename arch/x86/{crypto/crc32-pclmul_asm.S => lib/crc32-pclmul.S} (95%)
 rename arch/x86/{crypto/crc32c-pcl-intel-asm_64.S => lib/crc32c-3way.S} (92%)
 rename arch/x86/{crypto => lib}/crct10dif-pcl-asm_64.S (100%)
 delete mode 100644 crypto/crct10dif_common.c
 delete mode 100644 lib/crc16_kunit.c
 delete mode 100644 lib/crc32test.c
 create mode 100644 lib/crc_kunit.c
 delete mode 100644 lib/libcrc32c.c

Comments

Linus Torvalds Jan. 23, 2025, 4:13 a.m. UTC | #1
On Sun, 19 Jan 2025 at 14:51, Eric Biggers <ebiggers@kernel.org> wrote:
>
> - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be
>   directly accessible via the library API, instead of requiring the
>   crypto API.  This is much simpler and more efficient.

I'm not a fan of the crazy crypto interfaces for simple hashes that
only complicate and slow things down, so I'm all in favor of this and
have pulled it.

HOWEVER.

I'm also very much not a fan of asking users pointless questions.

What does this patch-set ask users idiotic questions like

  CRC-T10DIF implementation
  > 1. Architecture-optimized (CRC_T10DIF_IMPL_ARCH) (NEW)
    2. Generic implementation (CRC_T10DIF_IMPL_GENERIC) (NEW)

and

  CRC32 implementation
  > 1. Arch-optimized, with fallback to slice-by-8
(CRC32_IMPL_ARCH_PLUS_SLICEBY8) (NEW)
    2. Arch-optimized, with fallback to slice-by-1
(CRC32_IMPL_ARCH_PLUS_SLICEBY1) (NEW)
    3. Slice by 8 bytes (CRC32_IMPL_SLICEBY8) (NEW)
    4. Slice by 4 bytes (CRC32_IMPL_SLICEBY4) (NEW)
    5. Slice by 1 byte (Sarwate's algorithm) (CRC32_IMPL_SLICEBY1) (NEW)
    6. Classic Algorithm (one bit at a time) (CRC32_IMPL_BIT) (NEW)

because *nobody* wants to see that completely pointless noise.

Pick the best one. Don't ask the user to pick the best one.

If you have some really strong argument for why users need to be able
to override the sane choice, make the question it at *least* depend on
EXPERT.

And honestly, I don't see how there could possibly ever be any point.
If there is an arch-optimized version, just use it.

And if the "optimized" version is crap and worse than some generic
one, it just needs to be removed.

None of this "make the user make the choice because kernel developers
can't deal with the responsibility of just saying what is best".

             Linus
pr-tracker-bot@kernel.org Jan. 23, 2025, 4:49 a.m. UTC | #2
The pull request you sent on Sun, 19 Jan 2025 14:51:18 -0800:

> https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git tags/crc-for-linus

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/37b33c68b00089a574ebd0a856a5d554eb3001b7

Thank you!
Eric Biggers Jan. 23, 2025, 5:16 a.m. UTC | #3
Hi Linus,

On Wed, Jan 22, 2025 at 08:13:07PM -0800, Linus Torvalds wrote:
> On Sun, 19 Jan 2025 at 14:51, Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be
> >   directly accessible via the library API, instead of requiring the
> >   crypto API.  This is much simpler and more efficient.
> 
> I'm not a fan of the crazy crypto interfaces for simple hashes that
> only complicate and slow things down, so I'm all in favor of this and
> have pulled it.
> 
> HOWEVER.
> 
> I'm also very much not a fan of asking users pointless questions.
> 
> What does this patch-set ask users idiotic questions like
> 
>   CRC-T10DIF implementation
>   > 1. Architecture-optimized (CRC_T10DIF_IMPL_ARCH) (NEW)
>     2. Generic implementation (CRC_T10DIF_IMPL_GENERIC) (NEW)
> 
> and
> 
>   CRC32 implementation
>   > 1. Arch-optimized, with fallback to slice-by-8
> (CRC32_IMPL_ARCH_PLUS_SLICEBY8) (NEW)
>     2. Arch-optimized, with fallback to slice-by-1
> (CRC32_IMPL_ARCH_PLUS_SLICEBY1) (NEW)
>     3. Slice by 8 bytes (CRC32_IMPL_SLICEBY8) (NEW)
>     4. Slice by 4 bytes (CRC32_IMPL_SLICEBY4) (NEW)
>     5. Slice by 1 byte (Sarwate's algorithm) (CRC32_IMPL_SLICEBY1) (NEW)
>     6. Classic Algorithm (one bit at a time) (CRC32_IMPL_BIT) (NEW)
> 
> because *nobody* wants to see that completely pointless noise.
> 
> Pick the best one. Don't ask the user to pick the best one.
> 
> If you have some really strong argument for why users need to be able
> to override the sane choice, make the question it at *least* depend on
> EXPERT.
> 
> And honestly, I don't see how there could possibly ever be any point.
> If there is an arch-optimized version, just use it.
> 
> And if the "optimized" version is crap and worse than some generic
> one, it just needs to be removed.
> 
> None of this "make the user make the choice because kernel developers
> can't deal with the responsibility of just saying what is best".

Yes, I agree, and the kconfig options are already on my list of things to clean
up.  Thanks for giving your thoughts on how to do it.  To be clarify, this
initial set of changes removed the existing arch-specific CRC32 and CRC-T10DIF
options (on x86 that was CRYPTO_CRC32C_INTEL, CRYPTO_CRC32_PCLMUL, and
CRYPTO_CRCT10DIF_PCLMUL) and added the equivalent functionality to two choices
in lib, one of which already existed.  So for now the changes to the options
were just meant to consolidate them, not add to or remove from them per se.

I do think that to support kernel size minimization efforts we should continue
to allow omitting the arch-specific CRC code.  One of the CRC options, usually
CONFIG_CRC32, gets built into almost every kernel.  Some options already group
together multiple CRC variants (e.g. there are three different CRC32's), and
each can need multiple implementations targeting different instruction set
extensions (e.g. both PCLMULQDQ and VPCLMULQDQ on x86).  So it does add up.

But it makes sense to make the code be included by default, and make the choice
to omit it be conditional on CONFIG_EXPERT.

I'm also thinking of just doing a single option that affects all enabled CRC
variants, e.g. CRC_OPTIMIZATIONS instead of both CRC32_OPTIMIZATIONS and
CRC_T10DIF_OPTIMIZATIONS.  Let me know if you think that would be reasonable.

As you probably noticed, the other problem is that CRC32 has 4 generic
implementations: bit-by-bit, and slice by 1, 4, or 8 bytes.

Bit-by-bit is useless.  Slice by 4 and slice by 8 are too similar to have both.

It's not straightforward to choose between slice by 1 and slice by 4/8, though.
When benchmarking slice-by-n, a higher n will always be faster in
microbenchmarks (up to about n=16), but the required table size also increases
accordingly.  E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8
uses a 8192-byte table.  This table is accessed randomly, which is really bad on
the dcache, and can be really bad for performance in real world scenarios where
the system is bottlenecked on memory.

I'm tentatively planning to just say that slice-by-4 is a good enough compromise
and have that be the only generic CRC32 implementation.

But I need to try an interleaved implementation too, since it's possible that
could give the best of both worlds.

- Eric
Eric Biggers Jan. 23, 2025, 7:46 a.m. UTC | #4
On Wed, Jan 22, 2025 at 09:16:33PM -0800, Eric Biggers wrote:
> As you probably noticed, the other problem is that CRC32 has 4 generic
> implementations: bit-by-bit, and slice by 1, 4, or 8 bytes.
> 
> Bit-by-bit is useless.  Slice by 4 and slice by 8 are too similar to have both.
> 
> It's not straightforward to choose between slice by 1 and slice by 4/8, though.
> When benchmarking slice-by-n, a higher n will always be faster in
> microbenchmarks (up to about n=16), but the required table size also increases
> accordingly.  E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8
> uses a 8192-byte table.  This table is accessed randomly, which is really bad on
> the dcache, and can be really bad for performance in real world scenarios where
> the system is bottlenecked on memory.
> 
> I'm tentatively planning to just say that slice-by-4 is a good enough compromise
> and have that be the only generic CRC32 implementation.
> 
> But I need to try an interleaved implementation too, since it's possible that
> could give the best of both worlds.

Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the
only generic CRC32 implementation.  The generic code has become increasingly
irrelevant due to the arch-optimized code existing.  The arch-optimized code
tends to be 10 to 100 times faster on long messages.

The generic CRC32 code is still needed when the CPU features needed by the arch
code are unavailable.  But that's rare these days.  It's also still needed when
the CPU has no scalar instructions to accelerate the CRC (e.g. on x86_64, the
"regular" CRC32 as opposed to the Castagnoli CRC32) *and* the message is too
short for the overhead of saving and restoring the vector registers to be worth
it -- typically < 64 bytes or so.  And it's still needed when the CRC is done in
a context where vector registers can't be used at all.

But those don't feel like very strong motivations for the huge tables anymore.
I think the huge tables were really intended for optimizing CRCs of long
messages back when CPUs didn't have any better way to do it.

- Eric
Geert Uytterhoeven Jan. 23, 2025, 8:16 a.m. UTC | #5
Hi Eric,

On Thu, Jan 23, 2025 at 6:16 AM Eric Biggers <ebiggers@kernel.org> wrote:
> On Wed, Jan 22, 2025 at 08:13:07PM -0800, Linus Torvalds wrote:
> > On Sun, 19 Jan 2025 at 14:51, Eric Biggers <ebiggers@kernel.org> wrote:
> > >
> > > - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be
> > >   directly accessible via the library API, instead of requiring the
> > >   crypto API.  This is much simpler and more efficient.
> >
> > I'm not a fan of the crazy crypto interfaces for simple hashes that
> > only complicate and slow things down, so I'm all in favor of this and
> > have pulled it.
> >
> > HOWEVER.
> >
> > I'm also very much not a fan of asking users pointless questions.
> >
> > What does this patch-set ask users idiotic questions like
> >
> >   CRC-T10DIF implementation
> >   > 1. Architecture-optimized (CRC_T10DIF_IMPL_ARCH) (NEW)
> >     2. Generic implementation (CRC_T10DIF_IMPL_GENERIC) (NEW)
> >
> > and
> >
> >   CRC32 implementation
> >   > 1. Arch-optimized, with fallback to slice-by-8
> > (CRC32_IMPL_ARCH_PLUS_SLICEBY8) (NEW)
> >     2. Arch-optimized, with fallback to slice-by-1
> > (CRC32_IMPL_ARCH_PLUS_SLICEBY1) (NEW)
> >     3. Slice by 8 bytes (CRC32_IMPL_SLICEBY8) (NEW)
> >     4. Slice by 4 bytes (CRC32_IMPL_SLICEBY4) (NEW)
> >     5. Slice by 1 byte (Sarwate's algorithm) (CRC32_IMPL_SLICEBY1) (NEW)
> >     6. Classic Algorithm (one bit at a time) (CRC32_IMPL_BIT) (NEW)
> >
> > because *nobody* wants to see that completely pointless noise.
> >
> > Pick the best one. Don't ask the user to pick the best one.
> >
> > If you have some really strong argument for why users need to be able
> > to override the sane choice, make the question it at *least* depend on
> > EXPERT.
> >
> > And honestly, I don't see how there could possibly ever be any point.
> > If there is an arch-optimized version, just use it.
> >
> > And if the "optimized" version is crap and worse than some generic
> > one, it just needs to be removed.
> >
> > None of this "make the user make the choice because kernel developers
> > can't deal with the responsibility of just saying what is best".
>
> Yes, I agree, and the kconfig options are already on my list of things to clean
> up.  Thanks for giving your thoughts on how to do it.  To be clarify, this
> initial set of changes removed the existing arch-specific CRC32 and CRC-T10DIF
> options (on x86 that was CRYPTO_CRC32C_INTEL, CRYPTO_CRC32_PCLMUL, and
> CRYPTO_CRCT10DIF_PCLMUL) and added the equivalent functionality to two choices
> in lib, one of which already existed.  So for now the changes to the options
> were just meant to consolidate them, not add to or remove from them per se.
>
> I do think that to support kernel size minimization efforts we should continue
> to allow omitting the arch-specific CRC code.  One of the CRC options, usually
> CONFIG_CRC32, gets built into almost every kernel.  Some options already group
> together multiple CRC variants (e.g. there are three different CRC32's), and
> each can need multiple implementations targeting different instruction set
> extensions (e.g. both PCLMULQDQ and VPCLMULQDQ on x86).  So it does add up.
>
> But it makes sense to make the code be included by default, and make the choice
> to omit it be conditional on CONFIG_EXPERT.
>
> I'm also thinking of just doing a single option that affects all enabled CRC
> variants, e.g. CRC_OPTIMIZATIONS instead of both CRC32_OPTIMIZATIONS and
> CRC_T10DIF_OPTIMIZATIONS.  Let me know if you think that would be reasonable.
>
> As you probably noticed, the other problem is that CRC32 has 4 generic
> implementations: bit-by-bit, and slice by 1, 4, or 8 bytes.
>
> Bit-by-bit is useless.  Slice by 4 and slice by 8 are too similar to have both.
>
> It's not straightforward to choose between slice by 1 and slice by 4/8, though.
> When benchmarking slice-by-n, a higher n will always be faster in
> microbenchmarks (up to about n=16), but the required table size also increases
> accordingly.  E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8
> uses a 8192-byte table.  This table is accessed randomly, which is really bad on
> the dcache, and can be really bad for performance in real world scenarios where
> the system is bottlenecked on memory.
>
> I'm tentatively planning to just say that slice-by-4 is a good enough compromise
> and have that be the only generic CRC32 implementation.

So I guess I want slice-by-1 on m68k. Or

    default CRC32_IMPL_SLICEBY1 if CONFIG_CC_OPTIMIZE_FOR_SIZE

so I don't have to touch all defconfigs? ;-)

BTW, shouldn't all existing defconfigs that enable
CONFIG_CRC32_SLICEBY[48], CONFIG_CRC32_SARWATE, or CRC32_BIT be updated,
as the logic has changed (these symbols are now enabled based on
CRC32_IMPL*)?

Gr{oetje,eeting}s,

                        Geert
Geert Uytterhoeven Jan. 23, 2025, 8:19 a.m. UTC | #6
Hi Eric,

On Thu, Jan 23, 2025 at 9:16 AM Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> On Thu, Jan 23, 2025 at 6:16 AM Eric Biggers <ebiggers@kernel.org> wrote:
> > On Wed, Jan 22, 2025 at 08:13:07PM -0800, Linus Torvalds wrote:
> > > On Sun, 19 Jan 2025 at 14:51, Eric Biggers <ebiggers@kernel.org> wrote:
> > > > - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be
> > > >   directly accessible via the library API, instead of requiring the
> > > >   crypto API.  This is much simpler and more efficient.
> > >
> > > I'm not a fan of the crazy crypto interfaces for simple hashes that
> > > only complicate and slow things down, so I'm all in favor of this and
> > > have pulled it.
> > >
> > > HOWEVER.
> > >
> > > I'm also very much not a fan of asking users pointless questions.
> > >
> > > What does this patch-set ask users idiotic questions like
> > >
> > >   CRC-T10DIF implementation
> > >   > 1. Architecture-optimized (CRC_T10DIF_IMPL_ARCH) (NEW)
> > >     2. Generic implementation (CRC_T10DIF_IMPL_GENERIC) (NEW)
> > >
> > > and
> > >
> > >   CRC32 implementation
> > >   > 1. Arch-optimized, with fallback to slice-by-8
> > > (CRC32_IMPL_ARCH_PLUS_SLICEBY8) (NEW)
> > >     2. Arch-optimized, with fallback to slice-by-1
> > > (CRC32_IMPL_ARCH_PLUS_SLICEBY1) (NEW)
> > >     3. Slice by 8 bytes (CRC32_IMPL_SLICEBY8) (NEW)
> > >     4. Slice by 4 bytes (CRC32_IMPL_SLICEBY4) (NEW)
> > >     5. Slice by 1 byte (Sarwate's algorithm) (CRC32_IMPL_SLICEBY1) (NEW)
> > >     6. Classic Algorithm (one bit at a time) (CRC32_IMPL_BIT) (NEW)
> > >
> > > because *nobody* wants to see that completely pointless noise.
> > >
> > > Pick the best one. Don't ask the user to pick the best one.
> > >
> > > If you have some really strong argument for why users need to be able
> > > to override the sane choice, make the question it at *least* depend on
> > > EXPERT.
> > >
> > > And honestly, I don't see how there could possibly ever be any point.
> > > If there is an arch-optimized version, just use it.
> > >
> > > And if the "optimized" version is crap and worse than some generic
> > > one, it just needs to be removed.
> > >
> > > None of this "make the user make the choice because kernel developers
> > > can't deal with the responsibility of just saying what is best".
> >
> > Yes, I agree, and the kconfig options are already on my list of things to clean
> > up.  Thanks for giving your thoughts on how to do it.  To be clarify, this
> > initial set of changes removed the existing arch-specific CRC32 and CRC-T10DIF
> > options (on x86 that was CRYPTO_CRC32C_INTEL, CRYPTO_CRC32_PCLMUL, and
> > CRYPTO_CRCT10DIF_PCLMUL) and added the equivalent functionality to two choices
> > in lib, one of which already existed.  So for now the changes to the options
> > were just meant to consolidate them, not add to or remove from them per se.
> >
> > I do think that to support kernel size minimization efforts we should continue
> > to allow omitting the arch-specific CRC code.  One of the CRC options, usually
> > CONFIG_CRC32, gets built into almost every kernel.  Some options already group
> > together multiple CRC variants (e.g. there are three different CRC32's), and
> > each can need multiple implementations targeting different instruction set
> > extensions (e.g. both PCLMULQDQ and VPCLMULQDQ on x86).  So it does add up.
> >
> > But it makes sense to make the code be included by default, and make the choice
> > to omit it be conditional on CONFIG_EXPERT.
> >
> > I'm also thinking of just doing a single option that affects all enabled CRC
> > variants, e.g. CRC_OPTIMIZATIONS instead of both CRC32_OPTIMIZATIONS and
> > CRC_T10DIF_OPTIMIZATIONS.  Let me know if you think that would be reasonable.
> >
> > As you probably noticed, the other problem is that CRC32 has 4 generic
> > implementations: bit-by-bit, and slice by 1, 4, or 8 bytes.
> >
> > Bit-by-bit is useless.  Slice by 4 and slice by 8 are too similar to have both.
> >
> > It's not straightforward to choose between slice by 1 and slice by 4/8, though.
> > When benchmarking slice-by-n, a higher n will always be faster in
> > microbenchmarks (up to about n=16), but the required table size also increases
> > accordingly.  E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8
> > uses a 8192-byte table.  This table is accessed randomly, which is really bad on
> > the dcache, and can be really bad for performance in real world scenarios where
> > the system is bottlenecked on memory.
> >
> > I'm tentatively planning to just say that slice-by-4 is a good enough compromise
> > and have that be the only generic CRC32 implementation.
>
> So I guess I want slice-by-1 on m68k. Or
>
>     default CRC32_IMPL_SLICEBY1 if CONFIG_CC_OPTIMIZE_FOR_SIZE
>
> so I don't have to touch all defconfigs? ;-)
>
> BTW, shouldn't all existing defconfigs that enable
> CONFIG_CRC32_SLICEBY[48], CONFIG_CRC32_SARWATE, or CRC32_BIT be updated,
> as the logic has changed (these symbols are now enabled based on
> CRC32_IMPL*)?

Oh, I just realized m68k used CONFIG_CRC32_SLICEBY8=y before, as that
was the default. So all these questions are not new, just churn because
of the changed logic?

Gr{oetje,eeting}s,

                        Geert
Eric Biggers Jan. 23, 2025, 8:22 a.m. UTC | #7
On Thu, Jan 23, 2025 at 09:16:21AM +0100, Geert Uytterhoeven wrote:
> Hi Eric,
> 
> On Thu, Jan 23, 2025 at 6:16 AM Eric Biggers <ebiggers@kernel.org> wrote:
> > On Wed, Jan 22, 2025 at 08:13:07PM -0800, Linus Torvalds wrote:
> > > On Sun, 19 Jan 2025 at 14:51, Eric Biggers <ebiggers@kernel.org> wrote:
> > > >
> > > > - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be
> > > >   directly accessible via the library API, instead of requiring the
> > > >   crypto API.  This is much simpler and more efficient.
> > >
> > > I'm not a fan of the crazy crypto interfaces for simple hashes that
> > > only complicate and slow things down, so I'm all in favor of this and
> > > have pulled it.
> > >
> > > HOWEVER.
> > >
> > > I'm also very much not a fan of asking users pointless questions.
> > >
> > > What does this patch-set ask users idiotic questions like
> > >
> > >   CRC-T10DIF implementation
> > >   > 1. Architecture-optimized (CRC_T10DIF_IMPL_ARCH) (NEW)
> > >     2. Generic implementation (CRC_T10DIF_IMPL_GENERIC) (NEW)
> > >
> > > and
> > >
> > >   CRC32 implementation
> > >   > 1. Arch-optimized, with fallback to slice-by-8
> > > (CRC32_IMPL_ARCH_PLUS_SLICEBY8) (NEW)
> > >     2. Arch-optimized, with fallback to slice-by-1
> > > (CRC32_IMPL_ARCH_PLUS_SLICEBY1) (NEW)
> > >     3. Slice by 8 bytes (CRC32_IMPL_SLICEBY8) (NEW)
> > >     4. Slice by 4 bytes (CRC32_IMPL_SLICEBY4) (NEW)
> > >     5. Slice by 1 byte (Sarwate's algorithm) (CRC32_IMPL_SLICEBY1) (NEW)
> > >     6. Classic Algorithm (one bit at a time) (CRC32_IMPL_BIT) (NEW)
> > >
> > > because *nobody* wants to see that completely pointless noise.
> > >
> > > Pick the best one. Don't ask the user to pick the best one.
> > >
> > > If you have some really strong argument for why users need to be able
> > > to override the sane choice, make the question it at *least* depend on
> > > EXPERT.
> > >
> > > And honestly, I don't see how there could possibly ever be any point.
> > > If there is an arch-optimized version, just use it.
> > >
> > > And if the "optimized" version is crap and worse than some generic
> > > one, it just needs to be removed.
> > >
> > > None of this "make the user make the choice because kernel developers
> > > can't deal with the responsibility of just saying what is best".
> >
> > Yes, I agree, and the kconfig options are already on my list of things to clean
> > up.  Thanks for giving your thoughts on how to do it.  To be clarify, this
> > initial set of changes removed the existing arch-specific CRC32 and CRC-T10DIF
> > options (on x86 that was CRYPTO_CRC32C_INTEL, CRYPTO_CRC32_PCLMUL, and
> > CRYPTO_CRCT10DIF_PCLMUL) and added the equivalent functionality to two choices
> > in lib, one of which already existed.  So for now the changes to the options
> > were just meant to consolidate them, not add to or remove from them per se.
> >
> > I do think that to support kernel size minimization efforts we should continue
> > to allow omitting the arch-specific CRC code.  One of the CRC options, usually
> > CONFIG_CRC32, gets built into almost every kernel.  Some options already group
> > together multiple CRC variants (e.g. there are three different CRC32's), and
> > each can need multiple implementations targeting different instruction set
> > extensions (e.g. both PCLMULQDQ and VPCLMULQDQ on x86).  So it does add up.
> >
> > But it makes sense to make the code be included by default, and make the choice
> > to omit it be conditional on CONFIG_EXPERT.
> >
> > I'm also thinking of just doing a single option that affects all enabled CRC
> > variants, e.g. CRC_OPTIMIZATIONS instead of both CRC32_OPTIMIZATIONS and
> > CRC_T10DIF_OPTIMIZATIONS.  Let me know if you think that would be reasonable.
> >
> > As you probably noticed, the other problem is that CRC32 has 4 generic
> > implementations: bit-by-bit, and slice by 1, 4, or 8 bytes.
> >
> > Bit-by-bit is useless.  Slice by 4 and slice by 8 are too similar to have both.
> >
> > It's not straightforward to choose between slice by 1 and slice by 4/8, though.
> > When benchmarking slice-by-n, a higher n will always be faster in
> > microbenchmarks (up to about n=16), but the required table size also increases
> > accordingly.  E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8
> > uses a 8192-byte table.  This table is accessed randomly, which is really bad on
> > the dcache, and can be really bad for performance in real world scenarios where
> > the system is bottlenecked on memory.
> >
> > I'm tentatively planning to just say that slice-by-4 is a good enough compromise
> > and have that be the only generic CRC32 implementation.
> 
> So I guess I want slice-by-1 on m68k. Or
> 
>     default CRC32_IMPL_SLICEBY1 if CONFIG_CC_OPTIMIZE_FOR_SIZE
> 
> so I don't have to touch all defconfigs? ;-)

As I mentioned in my next reply I'm actually leaning towards slice-by-1 only
now.

> BTW, shouldn't all existing defconfigs that enable
> CONFIG_CRC32_SLICEBY[48], CONFIG_CRC32_SARWATE, or CRC32_BIT be updated,
> as the logic has changed (these symbols are now enabled based on
> CRC32_IMPL*)?

Yes, though I doubt that anyone who was selecting a specific generic CRC32
implementation really put much thought into it.  And if we standardize on one
implementation then the choice will go away and it won't matter anyway.

- Eric
Eric Biggers Jan. 23, 2025, 8:26 a.m. UTC | #8
On Thu, Jan 23, 2025 at 09:19:53AM +0100, Geert Uytterhoeven wrote:
> Hi Eric,
> 
> On Thu, Jan 23, 2025 at 9:16 AM Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> > On Thu, Jan 23, 2025 at 6:16 AM Eric Biggers <ebiggers@kernel.org> wrote:
> > > On Wed, Jan 22, 2025 at 08:13:07PM -0800, Linus Torvalds wrote:
> > > > On Sun, 19 Jan 2025 at 14:51, Eric Biggers <ebiggers@kernel.org> wrote:
> > > > > - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be
> > > > >   directly accessible via the library API, instead of requiring the
> > > > >   crypto API.  This is much simpler and more efficient.
> > > >
> > > > I'm not a fan of the crazy crypto interfaces for simple hashes that
> > > > only complicate and slow things down, so I'm all in favor of this and
> > > > have pulled it.
> > > >
> > > > HOWEVER.
> > > >
> > > > I'm also very much not a fan of asking users pointless questions.
> > > >
> > > > What does this patch-set ask users idiotic questions like
> > > >
> > > >   CRC-T10DIF implementation
> > > >   > 1. Architecture-optimized (CRC_T10DIF_IMPL_ARCH) (NEW)
> > > >     2. Generic implementation (CRC_T10DIF_IMPL_GENERIC) (NEW)
> > > >
> > > > and
> > > >
> > > >   CRC32 implementation
> > > >   > 1. Arch-optimized, with fallback to slice-by-8
> > > > (CRC32_IMPL_ARCH_PLUS_SLICEBY8) (NEW)
> > > >     2. Arch-optimized, with fallback to slice-by-1
> > > > (CRC32_IMPL_ARCH_PLUS_SLICEBY1) (NEW)
> > > >     3. Slice by 8 bytes (CRC32_IMPL_SLICEBY8) (NEW)
> > > >     4. Slice by 4 bytes (CRC32_IMPL_SLICEBY4) (NEW)
> > > >     5. Slice by 1 byte (Sarwate's algorithm) (CRC32_IMPL_SLICEBY1) (NEW)
> > > >     6. Classic Algorithm (one bit at a time) (CRC32_IMPL_BIT) (NEW)
> > > >
> > > > because *nobody* wants to see that completely pointless noise.
> > > >
> > > > Pick the best one. Don't ask the user to pick the best one.
> > > >
> > > > If you have some really strong argument for why users need to be able
> > > > to override the sane choice, make the question it at *least* depend on
> > > > EXPERT.
> > > >
> > > > And honestly, I don't see how there could possibly ever be any point.
> > > > If there is an arch-optimized version, just use it.
> > > >
> > > > And if the "optimized" version is crap and worse than some generic
> > > > one, it just needs to be removed.
> > > >
> > > > None of this "make the user make the choice because kernel developers
> > > > can't deal with the responsibility of just saying what is best".
> > >
> > > Yes, I agree, and the kconfig options are already on my list of things to clean
> > > up.  Thanks for giving your thoughts on how to do it.  To be clarify, this
> > > initial set of changes removed the existing arch-specific CRC32 and CRC-T10DIF
> > > options (on x86 that was CRYPTO_CRC32C_INTEL, CRYPTO_CRC32_PCLMUL, and
> > > CRYPTO_CRCT10DIF_PCLMUL) and added the equivalent functionality to two choices
> > > in lib, one of which already existed.  So for now the changes to the options
> > > were just meant to consolidate them, not add to or remove from them per se.
> > >
> > > I do think that to support kernel size minimization efforts we should continue
> > > to allow omitting the arch-specific CRC code.  One of the CRC options, usually
> > > CONFIG_CRC32, gets built into almost every kernel.  Some options already group
> > > together multiple CRC variants (e.g. there are three different CRC32's), and
> > > each can need multiple implementations targeting different instruction set
> > > extensions (e.g. both PCLMULQDQ and VPCLMULQDQ on x86).  So it does add up.
> > >
> > > But it makes sense to make the code be included by default, and make the choice
> > > to omit it be conditional on CONFIG_EXPERT.
> > >
> > > I'm also thinking of just doing a single option that affects all enabled CRC
> > > variants, e.g. CRC_OPTIMIZATIONS instead of both CRC32_OPTIMIZATIONS and
> > > CRC_T10DIF_OPTIMIZATIONS.  Let me know if you think that would be reasonable.
> > >
> > > As you probably noticed, the other problem is that CRC32 has 4 generic
> > > implementations: bit-by-bit, and slice by 1, 4, or 8 bytes.
> > >
> > > Bit-by-bit is useless.  Slice by 4 and slice by 8 are too similar to have both.
> > >
> > > It's not straightforward to choose between slice by 1 and slice by 4/8, though.
> > > When benchmarking slice-by-n, a higher n will always be faster in
> > > microbenchmarks (up to about n=16), but the required table size also increases
> > > accordingly.  E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8
> > > uses a 8192-byte table.  This table is accessed randomly, which is really bad on
> > > the dcache, and can be really bad for performance in real world scenarios where
> > > the system is bottlenecked on memory.
> > >
> > > I'm tentatively planning to just say that slice-by-4 is a good enough compromise
> > > and have that be the only generic CRC32 implementation.
> >
> > So I guess I want slice-by-1 on m68k. Or
> >
> >     default CRC32_IMPL_SLICEBY1 if CONFIG_CC_OPTIMIZE_FOR_SIZE
> >
> > so I don't have to touch all defconfigs? ;-)
> >
> > BTW, shouldn't all existing defconfigs that enable
> > CONFIG_CRC32_SLICEBY[48], CONFIG_CRC32_SARWATE, or CRC32_BIT be updated,
> > as the logic has changed (these symbols are now enabled based on
> > CRC32_IMPL*)?
> 
> Oh, I just realized m68k used CONFIG_CRC32_SLICEBY8=y before, as that
> was the default. So all these questions are not new, just churn because
> of the changed logic?
> 
> Gr{oetje,eeting}s,

Yes, there was already a choice for CRC32 implementation before, with SLICEBY8
still being the default.  It just came up again because I changed the options in
the choice by adding the arch-specific options (which previously were in a
different location) to it.

- Eric
Theodore Ts'o Jan. 23, 2025, 2:07 p.m. UTC | #9
On Wed, Jan 22, 2025 at 11:46:18PM -0800, Eric Biggers wrote:
> 
> Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the
> only generic CRC32 implementation.  The generic code has become increasingly
> irrelevant due to the arch-optimized code existing.  The arch-optimized code
> tends to be 10 to 100 times faster on long messages.

Yeah, that's my intuition as well; I would think the CPU's that
don't have a CRC32 optimization instruction(s) would probably be the
most sensitive to dcache thrashing.

But given that Geert ran into this on m68k (I assume), maybe we could
have him benchmark the various crc32 generic implementation to see if
we is the best for him?  That is, assuming that he cares (which he
might not. :-).

						- Ted
Eric Biggers Jan. 23, 2025, 6:18 p.m. UTC | #10
On Thu, Jan 23, 2025 at 09:07:44AM -0500, Theodore Ts'o wrote:
> On Wed, Jan 22, 2025 at 11:46:18PM -0800, Eric Biggers wrote:
> > 
> > Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the
> > only generic CRC32 implementation.  The generic code has become increasingly
> > irrelevant due to the arch-optimized code existing.  The arch-optimized code
> > tends to be 10 to 100 times faster on long messages.
> 
> Yeah, that's my intuition as well; I would think the CPU's that
> don't have a CRC32 optimization instruction(s) would probably be the
> most sensitive to dcache thrashing.
> 
> But given that Geert ran into this on m68k (I assume), maybe we could
> have him benchmark the various crc32 generic implementation to see if
> we is the best for him?  That is, assuming that he cares (which he
> might not. :-).

FWIW, benchmarking the CRC library functions is easy now; just enable
CONFIG_CRC_KUNIT_TEST=y and CONFIG_CRC_BENCHMARK=y.

But, it's just a traditional benchmark that calls the functions in a loop, and
doesn't account for dcache thrashing.  It's exactly the sort of benchmark I
mentioned doesn't tell the whole story about the drawbacks of using a huge
table.  So focusing only on microbenchmarks of slice-by-n generally leads to a
value n > 1 seeming optimal --- potentially as high as n=16 depending on the
CPU, but really old CPUs like m68k should need much less.  So the rationale of
choosing "slice-by-1" in the kernel would be to consider the reduced dcache use
and code size, and the fact that arch-optimized code is usually used instead
these days anyway, to be more important than microbenchmark results.  (And also
the other CRC variants in the kernel like CRC64, CRC-T10DIF, CRC16, etc. already
just have slice-by-1, so this would make CRC32 consistent with that.)

- Eric
Linus Torvalds Jan. 23, 2025, 8:52 p.m. UTC | #11
On Thu, 23 Jan 2025 at 10:18, Eric Biggers <ebiggers@kernel.org> wrote:
>
> FWIW, benchmarking the CRC library functions is easy now; just enable
> CONFIG_CRC_KUNIT_TEST=y and CONFIG_CRC_BENCHMARK=y.
>
> But, it's just a traditional benchmark that calls the functions in a loop, and
> doesn't account for dcache thrashing.

Yeah. I suspect the x86 vector version in particular is just not even
worth it. If you have the crc instruction, the basic arch-optimized
case is presumably already pretty good (and *that* code is tiny).

Honestly, I took a quick look at the "by-4" and "by-8" cases, and
considering that you still have to do per-byte lookups of the words
_anyway_, I would expect that the regular by-1 is presumably not that
much worse.

IOW, maybe we could try to just do the simple by-1 for the generic
case, and cut the x86 version down to the simple "use crc32b" case.
And see if anybody even notices...

              Linus
David Laight Jan. 23, 2025, 8:58 p.m. UTC | #12
On Thu, 23 Jan 2025 09:07:44 -0500
"Theodore Ts'o" <tytso@mit.edu> wrote:

> On Wed, Jan 22, 2025 at 11:46:18PM -0800, Eric Biggers wrote:
> > 
> > Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the
> > only generic CRC32 implementation.  The generic code has become increasingly
> > irrelevant due to the arch-optimized code existing.  The arch-optimized code
> > tends to be 10 to 100 times faster on long messages.  
> 
> Yeah, that's my intuition as well; I would think the CPU's that
> don't have a CRC32 optimization instruction(s) would probably be the
> most sensitive to dcache thrashing.
> 
> But given that Geert ran into this on m68k (I assume), maybe we could
> have him benchmark the various crc32 generic implementation to see if
> we is the best for him?  That is, assuming that he cares (which he
> might not. :-).

The difference between the clock speed and main memory speed on an m68k will
be a lot less than on anything more recent.
So I suspect the effect of cache misses is much less (or more likely it is
pretty much always getting a cache miss).
Brain wakes up, does the m68k even have a D-cache?
Checks the m68k user manual section 6 - it only has a I-cache (64 32-bit words).
So the important thing is probably keeping the loop small.
A cpu board might have an external data cache.

For a small memory footprint it might be worth considering 4 bits at a time.
So a 16 word (64 byte) lookup table.
Thinks....
You can xor a data byte onto the crc 'accumulator' and then do two separate
table lookups for each of the high nibbles and xor both onto it before the rotate.
That is probably a reasonable compromise.

	David
Eric Biggers Jan. 23, 2025, 9:13 p.m. UTC | #13
On Thu, Jan 23, 2025 at 12:52:30PM -0800, Linus Torvalds wrote:
> On Thu, 23 Jan 2025 at 10:18, Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > FWIW, benchmarking the CRC library functions is easy now; just enable
> > CONFIG_CRC_KUNIT_TEST=y and CONFIG_CRC_BENCHMARK=y.
> >
> > But, it's just a traditional benchmark that calls the functions in a loop, and
> > doesn't account for dcache thrashing.
> 
> Yeah. I suspect the x86 vector version in particular is just not even
> worth it. If you have the crc instruction, the basic arch-optimized
> case is presumably already pretty good (and *that* code is tiny).

x86 unfortunately only has an instruction for crc32c, i.e. the variant of CRC32
that uses the Castagnoli polynomial.  So it works great for crc32c().  But any
other variant of CRC such as the regular crc32() or crc_t10dif_update() need
carryless multiplication (PCLMULQDQ) which uses the vector registers.  It is
super fast on sufficiently long messages, but it does use the vector registers.

FWIW, arm64 has an instruction for both crc32c and crc32.  And RISC-V has
carryless multiplication using scalar registers.  So things are a bit easier
there.

> Honestly, I took a quick look at the "by-4" and "by-8" cases, and
> considering that you still have to do per-byte lookups of the words
> _anyway_, I would expect that the regular by-1 is presumably not that
> much worse.

The difference I'm seeing on x86_64 (Ryzen 9 9950X) is 690 MB/s for slice-by-1
vs. 3091 MB/s for slice-by-8.  It's significant since the latter gives much more
instruction-level parallelism.  But of course, CPUs on which it matters tend to
have *much* faster arch-optimized implementations anyway.  Currently the
x86_64-optimized crc32c_le() is up to 43607 MB/s on the same CPU, and crc32_le()
is up to 22664 MB/s.  (By adding VPCLMULQDQ support we could actually achieve
over 80000 MB/s.)  The caveat is that in the [V]PCLMULQDQ case the data length
has to be long enough for it to be worthwhile, but then again having a 8 KiB
randomly-accessed data table just to micro-optimize short messages seems not too
worthwhile.

> IOW, maybe we could try to just do the simple by-1 for the generic
> case, and cut the x86 version down to the simple "use crc32b" case.
> And see if anybody even notices...
> 
>               Linus

- Eric
Eric Biggers Jan. 23, 2025, 9:16 p.m. UTC | #14
On Thu, Jan 23, 2025 at 08:58:10PM +0000, David Laight wrote:
> On Thu, 23 Jan 2025 09:07:44 -0500
> "Theodore Ts'o" <tytso@mit.edu> wrote:
> 
> > On Wed, Jan 22, 2025 at 11:46:18PM -0800, Eric Biggers wrote:
> > > 
> > > Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the
> > > only generic CRC32 implementation.  The generic code has become increasingly
> > > irrelevant due to the arch-optimized code existing.  The arch-optimized code
> > > tends to be 10 to 100 times faster on long messages.  
> > 
> > Yeah, that's my intuition as well; I would think the CPU's that
> > don't have a CRC32 optimization instruction(s) would probably be the
> > most sensitive to dcache thrashing.
> > 
> > But given that Geert ran into this on m68k (I assume), maybe we could
> > have him benchmark the various crc32 generic implementation to see if
> > we is the best for him?  That is, assuming that he cares (which he
> > might not. :-).
> 
> The difference between the clock speed and main memory speed on an m68k will
> be a lot less than on anything more recent.
> So I suspect the effect of cache misses is much less (or more likely it is
> pretty much always getting a cache miss).
> Brain wakes up, does the m68k even have a D-cache?
> Checks the m68k user manual section 6 - it only has a I-cache (64 32-bit words).
> So the important thing is probably keeping the loop small.
> A cpu board might have an external data cache.
> 
> For a small memory footprint it might be worth considering 4 bits at a time.
> So a 16 word (64 byte) lookup table.
> Thinks....
> You can xor a data byte onto the crc 'accumulator' and then do two separate
> table lookups for each of the high nibbles and xor both onto it before the rotate.
> That is probably a reasonable compromise.

Yes, you can do less than a byte at a time (currently one of the choices is even
one *bit* at a time!), but I think byte-at-a-time is small enough already.

- Eric
Linus Torvalds Jan. 23, 2025, 9:16 p.m. UTC | #15
On Thu, 23 Jan 2025 at 13:13, Eric Biggers <ebiggers@kernel.org> wrote:
>
> x86 unfortunately only has an instruction for crc32c

Yeah, but isn't that like 90% of the uses?

IOW, if you'd make the "select" statements a bit more specific, and
make ext4 (and others) do "select CRC32C" instead, the other ones
wouldn't even get selected?

              Linus
Eric Biggers Jan. 23, 2025, 9:22 p.m. UTC | #16
On Thu, Jan 23, 2025 at 01:16:57PM -0800, Linus Torvalds wrote:
> On Thu, 23 Jan 2025 at 13:13, Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > x86 unfortunately only has an instruction for crc32c
> 
> Yeah, but isn't that like 90% of the uses?
> 
> IOW, if you'd make the "select" statements a bit more specific, and
> make ext4 (and others) do "select CRC32C" instead, the other ones
> wouldn't even get selected?
> 
>               Linus

There's quite a lot unfortunately.  Try this which finds the regular crc32 (not
crc32c):

    git grep -E '\<(crc32|crc32_le|__crc32_le)\(|"crc32"'

- Eric
Kent Overstreet Jan. 23, 2025, 10:36 p.m. UTC | #17
On Wed, Jan 22, 2025 at 11:46:18PM -0800, Eric Biggers wrote:
> On Wed, Jan 22, 2025 at 09:16:33PM -0800, Eric Biggers wrote:
> > As you probably noticed, the other problem is that CRC32 has 4 generic
> > implementations: bit-by-bit, and slice by 1, 4, or 8 bytes.
> > 
> > Bit-by-bit is useless.  Slice by 4 and slice by 8 are too similar to have both.
> > 
> > It's not straightforward to choose between slice by 1 and slice by 4/8, though.
> > When benchmarking slice-by-n, a higher n will always be faster in
> > microbenchmarks (up to about n=16), but the required table size also increases
> > accordingly.  E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8
> > uses a 8192-byte table.  This table is accessed randomly, which is really bad on
> > the dcache, and can be really bad for performance in real world scenarios where
> > the system is bottlenecked on memory.
> > 
> > I'm tentatively planning to just say that slice-by-4 is a good enough compromise
> > and have that be the only generic CRC32 implementation.
> > 
> > But I need to try an interleaved implementation too, since it's possible that
> > could give the best of both worlds.
> 
> Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the
> only generic CRC32 implementation.  The generic code has become increasingly
> irrelevant due to the arch-optimized code existing.  The arch-optimized code
> tends to be 10 to 100 times faster on long messages.
> 
> The generic CRC32 code is still needed when the CPU features needed by the arch
> code are unavailable.  But that's rare these days.  It's also still needed when
> the CPU has no scalar instructions to accelerate the CRC (e.g. on x86_64, the
> "regular" CRC32 as opposed to the Castagnoli CRC32) *and* the message is too
> short for the overhead of saving and restoring the vector registers to be worth
> it -- typically < 64 bytes or so.  And it's still needed when the CRC is done in
> a context where vector registers can't be used at all.
> 
> But those don't feel like very strong motivations for the huge tables anymore.
> I think the huge tables were really intended for optimizing CRCs of long
> messages back when CPUs didn't have any better way to do it.

Have you by chance been looking at performance of 64 bit crc algorithms,
or have any recommendations there?

I've been starting to consider switching to a 64 bit crc for the
default for bcachefs - and we do want it to be a regular crc so we can
combine/add them together, not one of the newer fancier add/xor based
hashes.

I thought we'd gotten a faster PCLMULQDQ based implementation of a
crc64, but looking again it appears not, hrm.
David Laight Jan. 23, 2025, 11:17 p.m. UTC | #18
On Thu, 23 Jan 2025 13:16:03 -0800
Eric Biggers <ebiggers@kernel.org> wrote:

> On Thu, Jan 23, 2025 at 08:58:10PM +0000, David Laight wrote:
...
> > For a small memory footprint it might be worth considering 4 bits at a time.
> > So a 16 word (64 byte) lookup table.
> > Thinks....
> > You can xor a data byte onto the crc 'accumulator' and then do two separate
> > table lookups for each of the high nibbles and xor both onto it before the rotate.
> > That is probably a reasonable compromise.  
> 
> Yes, you can do less than a byte at a time (currently one of the choices is even
> one *bit* at a time!), but I think byte-at-a-time is small enough already.

I used '1 bit at a time' for a crc64 of a 5MB file.
Actually fast enough during a 'compile' phase (verified by a serial eeprom).

But the paired nibble one is something like:
	crc ^= *data++ << 24;
	crc ^= table[crc >> 28] ^ table1[(crc >> 24) & 15];
	crc = rol(crc, 8);
which isn't going to be significantly slower than the byte one
where the middle line is:	
	crc ^= table[crc >> 24];
especially for a multi-issue cpu,
and the table drops from 1k to 128 bytes.
That is quite a lot of D-cache misses.
(Since you'll probably get them all twice when the program's working
set is reloaded!)

Actually you need to rol() the table[]s.
Then do:
	crc = rol(crc, 8) ^ table[] ...
to reduce the register dependency chain to 5 per byte.

	David
Eric Biggers Jan. 23, 2025, 11:42 p.m. UTC | #19
On Thu, Jan 23, 2025 at 05:36:40PM -0500, Kent Overstreet wrote:
> On Wed, Jan 22, 2025 at 11:46:18PM -0800, Eric Biggers wrote:
> > On Wed, Jan 22, 2025 at 09:16:33PM -0800, Eric Biggers wrote:
> > > As you probably noticed, the other problem is that CRC32 has 4 generic
> > > implementations: bit-by-bit, and slice by 1, 4, or 8 bytes.
> > > 
> > > Bit-by-bit is useless.  Slice by 4 and slice by 8 are too similar to have both.
> > > 
> > > It's not straightforward to choose between slice by 1 and slice by 4/8, though.
> > > When benchmarking slice-by-n, a higher n will always be faster in
> > > microbenchmarks (up to about n=16), but the required table size also increases
> > > accordingly.  E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8
> > > uses a 8192-byte table.  This table is accessed randomly, which is really bad on
> > > the dcache, and can be really bad for performance in real world scenarios where
> > > the system is bottlenecked on memory.
> > > 
> > > I'm tentatively planning to just say that slice-by-4 is a good enough compromise
> > > and have that be the only generic CRC32 implementation.
> > > 
> > > But I need to try an interleaved implementation too, since it's possible that
> > > could give the best of both worlds.
> > 
> > Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the
> > only generic CRC32 implementation.  The generic code has become increasingly
> > irrelevant due to the arch-optimized code existing.  The arch-optimized code
> > tends to be 10 to 100 times faster on long messages.
> > 
> > The generic CRC32 code is still needed when the CPU features needed by the arch
> > code are unavailable.  But that's rare these days.  It's also still needed when
> > the CPU has no scalar instructions to accelerate the CRC (e.g. on x86_64, the
> > "regular" CRC32 as opposed to the Castagnoli CRC32) *and* the message is too
> > short for the overhead of saving and restoring the vector registers to be worth
> > it -- typically < 64 bytes or so.  And it's still needed when the CRC is done in
> > a context where vector registers can't be used at all.
> > 
> > But those don't feel like very strong motivations for the huge tables anymore.
> > I think the huge tables were really intended for optimizing CRCs of long
> > messages back when CPUs didn't have any better way to do it.
> 
> Have you by chance been looking at performance of 64 bit crc algorithms,
> or have any recommendations there?
> 
> I've been starting to consider switching to a 64 bit crc for the
> default for bcachefs - and we do want it to be a regular crc so we can
> combine/add them together, not one of the newer fancier add/xor based
> hashes.
> 
> I thought we'd gotten a faster PCLMULQDQ based implementation of a
> crc64, but looking again it appears not, hrm.

Yep!  I've written an assembly template that expands into a PCLMULQDQ or
VPCLMULQDQ implementation of any variant of CRC-8, CRC-16, CRC-32, or CRC-64.
The latest work-in-progress version can be found at
https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=crc-x86.
We just need to decide which CRC variants to wire it up for.  My tentative plan,
which is implemented in that git branch, is crc32_le, crc_t10dif, crc64_be, and
crc64_nvme.  (crc64_nvme is currently called crc64_rocksoft, but that name makes
no sense.)  For crc32_le and crc_t10dif these would replace the existing
PCLMULQDQ implementations.  crc64* on the other hand would gain acceleration for
the first time, improving performance for those by as much as 100x.  I'm also
fixing the CRC64 lib to be organized the way I did CRC32 and CRC-T10DIF.

That work is targeting 6.15, since there was already enough for 6.14.

Though this approach makes it easy to wire up arbitrary CRC variants for x86, we
do need to make sure that anyone we wire up is actually useful.  The users of
the optimized crc64_be would be drivers/md/bcache/ and fs/bcachefs/.  So if you
could confirm that that would indeed be useful for you (and in particular you
haven't deprecated the CRC64 support in favor of xxHash), that would be helpful.

Note, I haven't worked on arm64 yet, but a similar approach could be used there.

As far as performance goes, the [V]PCLMULQDQ based code performs about the same
across all CRC variants.  The "le" or least-significant-bit first CRCs are
*slightly* faster than the "be" or most-significant-bit first CRCs since they
avoid a byteswap of the data, but it's a small difference.  If you happen to be
changing CRC variant anyway, but still want a 64-bit CRC, crc64_nvme may be a
slightly better choice than crc64_be it is a "le" CRC and is already needed for
NVME data integrity
(https://lore.kernel.org/linux-block/20220303201312.3255347-1-kbusch@kernel.org/).
But otherwise staying with crc64_be is fine.

- Eric
Kent Overstreet Jan. 24, 2025, 12:32 a.m. UTC | #20
On Thu, Jan 23, 2025 at 03:42:09PM -0800, Eric Biggers wrote:
> On Thu, Jan 23, 2025 at 05:36:40PM -0500, Kent Overstreet wrote:
> > On Wed, Jan 22, 2025 at 11:46:18PM -0800, Eric Biggers wrote:
> > > On Wed, Jan 22, 2025 at 09:16:33PM -0800, Eric Biggers wrote:
> > > > As you probably noticed, the other problem is that CRC32 has 4 generic
> > > > implementations: bit-by-bit, and slice by 1, 4, or 8 bytes.
> > > > 
> > > > Bit-by-bit is useless.  Slice by 4 and slice by 8 are too similar to have both.
> > > > 
> > > > It's not straightforward to choose between slice by 1 and slice by 4/8, though.
> > > > When benchmarking slice-by-n, a higher n will always be faster in
> > > > microbenchmarks (up to about n=16), but the required table size also increases
> > > > accordingly.  E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8
> > > > uses a 8192-byte table.  This table is accessed randomly, which is really bad on
> > > > the dcache, and can be really bad for performance in real world scenarios where
> > > > the system is bottlenecked on memory.
> > > > 
> > > > I'm tentatively planning to just say that slice-by-4 is a good enough compromise
> > > > and have that be the only generic CRC32 implementation.
> > > > 
> > > > But I need to try an interleaved implementation too, since it's possible that
> > > > could give the best of both worlds.
> > > 
> > > Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the
> > > only generic CRC32 implementation.  The generic code has become increasingly
> > > irrelevant due to the arch-optimized code existing.  The arch-optimized code
> > > tends to be 10 to 100 times faster on long messages.
> > > 
> > > The generic CRC32 code is still needed when the CPU features needed by the arch
> > > code are unavailable.  But that's rare these days.  It's also still needed when
> > > the CPU has no scalar instructions to accelerate the CRC (e.g. on x86_64, the
> > > "regular" CRC32 as opposed to the Castagnoli CRC32) *and* the message is too
> > > short for the overhead of saving and restoring the vector registers to be worth
> > > it -- typically < 64 bytes or so.  And it's still needed when the CRC is done in
> > > a context where vector registers can't be used at all.
> > > 
> > > But those don't feel like very strong motivations for the huge tables anymore.
> > > I think the huge tables were really intended for optimizing CRCs of long
> > > messages back when CPUs didn't have any better way to do it.
> > 
> > Have you by chance been looking at performance of 64 bit crc algorithms,
> > or have any recommendations there?
> > 
> > I've been starting to consider switching to a 64 bit crc for the
> > default for bcachefs - and we do want it to be a regular crc so we can
> > combine/add them together, not one of the newer fancier add/xor based
> > hashes.
> > 
> > I thought we'd gotten a faster PCLMULQDQ based implementation of a
> > crc64, but looking again it appears not, hrm.
> 
> Yep!  I've written an assembly template that expands into a PCLMULQDQ or
> VPCLMULQDQ implementation of any variant of CRC-8, CRC-16, CRC-32, or CRC-64.
> The latest work-in-progress version can be found at
> https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=crc-x86.
> We just need to decide which CRC variants to wire it up for.  My tentative plan,
> which is implemented in that git branch, is crc32_le, crc_t10dif, crc64_be, and
> crc64_nvme.  (crc64_nvme is currently called crc64_rocksoft, but that name makes
> no sense.)  For crc32_le and crc_t10dif these would replace the existing
> PCLMULQDQ implementations.  crc64* on the other hand would gain acceleration for
> the first time, improving performance for those by as much as 100x.  I'm also
> fixing the CRC64 lib to be organized the way I did CRC32 and CRC-T10DIF.
> 
> That work is targeting 6.15, since there was already enough for 6.14.
> 
> Though this approach makes it easy to wire up arbitrary CRC variants for x86, we
> do need to make sure that anyone we wire up is actually useful.  The users of
> the optimized crc64_be would be drivers/md/bcache/ and fs/bcachefs/.  So if you
> could confirm that that would indeed be useful for you (and in particular you
> haven't deprecated the CRC64 support in favor of xxHash), that would be helpful.

That's fantastic!

CRC64 isn't deprecated, and I suspect that will be our preferred option
at some point in the future. xxHash was added only because there is a
real desire for 64 bit checksums and crc64 wasn't a good option at the
time.

(It was also designed as a hash function, not for data protection, and
I'd want to see some actual literature evaluating it as such before I'd
consider making it the default. And like I mentioned, CRCs have other
properties we want).