Message ID | 20250119225118.GA15398@sol.localdomain (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [GIT,PULL] CRC updates for 6.14 | expand |
On Sun, 19 Jan 2025 at 14:51, Eric Biggers <ebiggers@kernel.org> wrote: > > - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be > directly accessible via the library API, instead of requiring the > crypto API. This is much simpler and more efficient. I'm not a fan of the crazy crypto interfaces for simple hashes that only complicate and slow things down, so I'm all in favor of this and have pulled it. HOWEVER. I'm also very much not a fan of asking users pointless questions. What does this patch-set ask users idiotic questions like CRC-T10DIF implementation > 1. Architecture-optimized (CRC_T10DIF_IMPL_ARCH) (NEW) 2. Generic implementation (CRC_T10DIF_IMPL_GENERIC) (NEW) and CRC32 implementation > 1. Arch-optimized, with fallback to slice-by-8 (CRC32_IMPL_ARCH_PLUS_SLICEBY8) (NEW) 2. Arch-optimized, with fallback to slice-by-1 (CRC32_IMPL_ARCH_PLUS_SLICEBY1) (NEW) 3. Slice by 8 bytes (CRC32_IMPL_SLICEBY8) (NEW) 4. Slice by 4 bytes (CRC32_IMPL_SLICEBY4) (NEW) 5. Slice by 1 byte (Sarwate's algorithm) (CRC32_IMPL_SLICEBY1) (NEW) 6. Classic Algorithm (one bit at a time) (CRC32_IMPL_BIT) (NEW) because *nobody* wants to see that completely pointless noise. Pick the best one. Don't ask the user to pick the best one. If you have some really strong argument for why users need to be able to override the sane choice, make the question it at *least* depend on EXPERT. And honestly, I don't see how there could possibly ever be any point. If there is an arch-optimized version, just use it. And if the "optimized" version is crap and worse than some generic one, it just needs to be removed. None of this "make the user make the choice because kernel developers can't deal with the responsibility of just saying what is best". Linus
The pull request you sent on Sun, 19 Jan 2025 14:51:18 -0800:
> https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git tags/crc-for-linus
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/37b33c68b00089a574ebd0a856a5d554eb3001b7
Thank you!
Hi Linus, On Wed, Jan 22, 2025 at 08:13:07PM -0800, Linus Torvalds wrote: > On Sun, 19 Jan 2025 at 14:51, Eric Biggers <ebiggers@kernel.org> wrote: > > > > - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be > > directly accessible via the library API, instead of requiring the > > crypto API. This is much simpler and more efficient. > > I'm not a fan of the crazy crypto interfaces for simple hashes that > only complicate and slow things down, so I'm all in favor of this and > have pulled it. > > HOWEVER. > > I'm also very much not a fan of asking users pointless questions. > > What does this patch-set ask users idiotic questions like > > CRC-T10DIF implementation > > 1. Architecture-optimized (CRC_T10DIF_IMPL_ARCH) (NEW) > 2. Generic implementation (CRC_T10DIF_IMPL_GENERIC) (NEW) > > and > > CRC32 implementation > > 1. Arch-optimized, with fallback to slice-by-8 > (CRC32_IMPL_ARCH_PLUS_SLICEBY8) (NEW) > 2. Arch-optimized, with fallback to slice-by-1 > (CRC32_IMPL_ARCH_PLUS_SLICEBY1) (NEW) > 3. Slice by 8 bytes (CRC32_IMPL_SLICEBY8) (NEW) > 4. Slice by 4 bytes (CRC32_IMPL_SLICEBY4) (NEW) > 5. Slice by 1 byte (Sarwate's algorithm) (CRC32_IMPL_SLICEBY1) (NEW) > 6. Classic Algorithm (one bit at a time) (CRC32_IMPL_BIT) (NEW) > > because *nobody* wants to see that completely pointless noise. > > Pick the best one. Don't ask the user to pick the best one. > > If you have some really strong argument for why users need to be able > to override the sane choice, make the question it at *least* depend on > EXPERT. > > And honestly, I don't see how there could possibly ever be any point. > If there is an arch-optimized version, just use it. > > And if the "optimized" version is crap and worse than some generic > one, it just needs to be removed. > > None of this "make the user make the choice because kernel developers > can't deal with the responsibility of just saying what is best". Yes, I agree, and the kconfig options are already on my list of things to clean up. Thanks for giving your thoughts on how to do it. To be clarify, this initial set of changes removed the existing arch-specific CRC32 and CRC-T10DIF options (on x86 that was CRYPTO_CRC32C_INTEL, CRYPTO_CRC32_PCLMUL, and CRYPTO_CRCT10DIF_PCLMUL) and added the equivalent functionality to two choices in lib, one of which already existed. So for now the changes to the options were just meant to consolidate them, not add to or remove from them per se. I do think that to support kernel size minimization efforts we should continue to allow omitting the arch-specific CRC code. One of the CRC options, usually CONFIG_CRC32, gets built into almost every kernel. Some options already group together multiple CRC variants (e.g. there are three different CRC32's), and each can need multiple implementations targeting different instruction set extensions (e.g. both PCLMULQDQ and VPCLMULQDQ on x86). So it does add up. But it makes sense to make the code be included by default, and make the choice to omit it be conditional on CONFIG_EXPERT. I'm also thinking of just doing a single option that affects all enabled CRC variants, e.g. CRC_OPTIMIZATIONS instead of both CRC32_OPTIMIZATIONS and CRC_T10DIF_OPTIMIZATIONS. Let me know if you think that would be reasonable. As you probably noticed, the other problem is that CRC32 has 4 generic implementations: bit-by-bit, and slice by 1, 4, or 8 bytes. Bit-by-bit is useless. Slice by 4 and slice by 8 are too similar to have both. It's not straightforward to choose between slice by 1 and slice by 4/8, though. When benchmarking slice-by-n, a higher n will always be faster in microbenchmarks (up to about n=16), but the required table size also increases accordingly. E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8 uses a 8192-byte table. This table is accessed randomly, which is really bad on the dcache, and can be really bad for performance in real world scenarios where the system is bottlenecked on memory. I'm tentatively planning to just say that slice-by-4 is a good enough compromise and have that be the only generic CRC32 implementation. But I need to try an interleaved implementation too, since it's possible that could give the best of both worlds. - Eric
On Wed, Jan 22, 2025 at 09:16:33PM -0800, Eric Biggers wrote: > As you probably noticed, the other problem is that CRC32 has 4 generic > implementations: bit-by-bit, and slice by 1, 4, or 8 bytes. > > Bit-by-bit is useless. Slice by 4 and slice by 8 are too similar to have both. > > It's not straightforward to choose between slice by 1 and slice by 4/8, though. > When benchmarking slice-by-n, a higher n will always be faster in > microbenchmarks (up to about n=16), but the required table size also increases > accordingly. E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8 > uses a 8192-byte table. This table is accessed randomly, which is really bad on > the dcache, and can be really bad for performance in real world scenarios where > the system is bottlenecked on memory. > > I'm tentatively planning to just say that slice-by-4 is a good enough compromise > and have that be the only generic CRC32 implementation. > > But I need to try an interleaved implementation too, since it's possible that > could give the best of both worlds. Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the only generic CRC32 implementation. The generic code has become increasingly irrelevant due to the arch-optimized code existing. The arch-optimized code tends to be 10 to 100 times faster on long messages. The generic CRC32 code is still needed when the CPU features needed by the arch code are unavailable. But that's rare these days. It's also still needed when the CPU has no scalar instructions to accelerate the CRC (e.g. on x86_64, the "regular" CRC32 as opposed to the Castagnoli CRC32) *and* the message is too short for the overhead of saving and restoring the vector registers to be worth it -- typically < 64 bytes or so. And it's still needed when the CRC is done in a context where vector registers can't be used at all. But those don't feel like very strong motivations for the huge tables anymore. I think the huge tables were really intended for optimizing CRCs of long messages back when CPUs didn't have any better way to do it. - Eric
Hi Eric, On Thu, Jan 23, 2025 at 6:16 AM Eric Biggers <ebiggers@kernel.org> wrote: > On Wed, Jan 22, 2025 at 08:13:07PM -0800, Linus Torvalds wrote: > > On Sun, 19 Jan 2025 at 14:51, Eric Biggers <ebiggers@kernel.org> wrote: > > > > > > - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be > > > directly accessible via the library API, instead of requiring the > > > crypto API. This is much simpler and more efficient. > > > > I'm not a fan of the crazy crypto interfaces for simple hashes that > > only complicate and slow things down, so I'm all in favor of this and > > have pulled it. > > > > HOWEVER. > > > > I'm also very much not a fan of asking users pointless questions. > > > > What does this patch-set ask users idiotic questions like > > > > CRC-T10DIF implementation > > > 1. Architecture-optimized (CRC_T10DIF_IMPL_ARCH) (NEW) > > 2. Generic implementation (CRC_T10DIF_IMPL_GENERIC) (NEW) > > > > and > > > > CRC32 implementation > > > 1. Arch-optimized, with fallback to slice-by-8 > > (CRC32_IMPL_ARCH_PLUS_SLICEBY8) (NEW) > > 2. Arch-optimized, with fallback to slice-by-1 > > (CRC32_IMPL_ARCH_PLUS_SLICEBY1) (NEW) > > 3. Slice by 8 bytes (CRC32_IMPL_SLICEBY8) (NEW) > > 4. Slice by 4 bytes (CRC32_IMPL_SLICEBY4) (NEW) > > 5. Slice by 1 byte (Sarwate's algorithm) (CRC32_IMPL_SLICEBY1) (NEW) > > 6. Classic Algorithm (one bit at a time) (CRC32_IMPL_BIT) (NEW) > > > > because *nobody* wants to see that completely pointless noise. > > > > Pick the best one. Don't ask the user to pick the best one. > > > > If you have some really strong argument for why users need to be able > > to override the sane choice, make the question it at *least* depend on > > EXPERT. > > > > And honestly, I don't see how there could possibly ever be any point. > > If there is an arch-optimized version, just use it. > > > > And if the "optimized" version is crap and worse than some generic > > one, it just needs to be removed. > > > > None of this "make the user make the choice because kernel developers > > can't deal with the responsibility of just saying what is best". > > Yes, I agree, and the kconfig options are already on my list of things to clean > up. Thanks for giving your thoughts on how to do it. To be clarify, this > initial set of changes removed the existing arch-specific CRC32 and CRC-T10DIF > options (on x86 that was CRYPTO_CRC32C_INTEL, CRYPTO_CRC32_PCLMUL, and > CRYPTO_CRCT10DIF_PCLMUL) and added the equivalent functionality to two choices > in lib, one of which already existed. So for now the changes to the options > were just meant to consolidate them, not add to or remove from them per se. > > I do think that to support kernel size minimization efforts we should continue > to allow omitting the arch-specific CRC code. One of the CRC options, usually > CONFIG_CRC32, gets built into almost every kernel. Some options already group > together multiple CRC variants (e.g. there are three different CRC32's), and > each can need multiple implementations targeting different instruction set > extensions (e.g. both PCLMULQDQ and VPCLMULQDQ on x86). So it does add up. > > But it makes sense to make the code be included by default, and make the choice > to omit it be conditional on CONFIG_EXPERT. > > I'm also thinking of just doing a single option that affects all enabled CRC > variants, e.g. CRC_OPTIMIZATIONS instead of both CRC32_OPTIMIZATIONS and > CRC_T10DIF_OPTIMIZATIONS. Let me know if you think that would be reasonable. > > As you probably noticed, the other problem is that CRC32 has 4 generic > implementations: bit-by-bit, and slice by 1, 4, or 8 bytes. > > Bit-by-bit is useless. Slice by 4 and slice by 8 are too similar to have both. > > It's not straightforward to choose between slice by 1 and slice by 4/8, though. > When benchmarking slice-by-n, a higher n will always be faster in > microbenchmarks (up to about n=16), but the required table size also increases > accordingly. E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8 > uses a 8192-byte table. This table is accessed randomly, which is really bad on > the dcache, and can be really bad for performance in real world scenarios where > the system is bottlenecked on memory. > > I'm tentatively planning to just say that slice-by-4 is a good enough compromise > and have that be the only generic CRC32 implementation. So I guess I want slice-by-1 on m68k. Or default CRC32_IMPL_SLICEBY1 if CONFIG_CC_OPTIMIZE_FOR_SIZE so I don't have to touch all defconfigs? ;-) BTW, shouldn't all existing defconfigs that enable CONFIG_CRC32_SLICEBY[48], CONFIG_CRC32_SARWATE, or CRC32_BIT be updated, as the logic has changed (these symbols are now enabled based on CRC32_IMPL*)? Gr{oetje,eeting}s, Geert
Hi Eric, On Thu, Jan 23, 2025 at 9:16 AM Geert Uytterhoeven <geert@linux-m68k.org> wrote: > On Thu, Jan 23, 2025 at 6:16 AM Eric Biggers <ebiggers@kernel.org> wrote: > > On Wed, Jan 22, 2025 at 08:13:07PM -0800, Linus Torvalds wrote: > > > On Sun, 19 Jan 2025 at 14:51, Eric Biggers <ebiggers@kernel.org> wrote: > > > > - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be > > > > directly accessible via the library API, instead of requiring the > > > > crypto API. This is much simpler and more efficient. > > > > > > I'm not a fan of the crazy crypto interfaces for simple hashes that > > > only complicate and slow things down, so I'm all in favor of this and > > > have pulled it. > > > > > > HOWEVER. > > > > > > I'm also very much not a fan of asking users pointless questions. > > > > > > What does this patch-set ask users idiotic questions like > > > > > > CRC-T10DIF implementation > > > > 1. Architecture-optimized (CRC_T10DIF_IMPL_ARCH) (NEW) > > > 2. Generic implementation (CRC_T10DIF_IMPL_GENERIC) (NEW) > > > > > > and > > > > > > CRC32 implementation > > > > 1. Arch-optimized, with fallback to slice-by-8 > > > (CRC32_IMPL_ARCH_PLUS_SLICEBY8) (NEW) > > > 2. Arch-optimized, with fallback to slice-by-1 > > > (CRC32_IMPL_ARCH_PLUS_SLICEBY1) (NEW) > > > 3. Slice by 8 bytes (CRC32_IMPL_SLICEBY8) (NEW) > > > 4. Slice by 4 bytes (CRC32_IMPL_SLICEBY4) (NEW) > > > 5. Slice by 1 byte (Sarwate's algorithm) (CRC32_IMPL_SLICEBY1) (NEW) > > > 6. Classic Algorithm (one bit at a time) (CRC32_IMPL_BIT) (NEW) > > > > > > because *nobody* wants to see that completely pointless noise. > > > > > > Pick the best one. Don't ask the user to pick the best one. > > > > > > If you have some really strong argument for why users need to be able > > > to override the sane choice, make the question it at *least* depend on > > > EXPERT. > > > > > > And honestly, I don't see how there could possibly ever be any point. > > > If there is an arch-optimized version, just use it. > > > > > > And if the "optimized" version is crap and worse than some generic > > > one, it just needs to be removed. > > > > > > None of this "make the user make the choice because kernel developers > > > can't deal with the responsibility of just saying what is best". > > > > Yes, I agree, and the kconfig options are already on my list of things to clean > > up. Thanks for giving your thoughts on how to do it. To be clarify, this > > initial set of changes removed the existing arch-specific CRC32 and CRC-T10DIF > > options (on x86 that was CRYPTO_CRC32C_INTEL, CRYPTO_CRC32_PCLMUL, and > > CRYPTO_CRCT10DIF_PCLMUL) and added the equivalent functionality to two choices > > in lib, one of which already existed. So for now the changes to the options > > were just meant to consolidate them, not add to or remove from them per se. > > > > I do think that to support kernel size minimization efforts we should continue > > to allow omitting the arch-specific CRC code. One of the CRC options, usually > > CONFIG_CRC32, gets built into almost every kernel. Some options already group > > together multiple CRC variants (e.g. there are three different CRC32's), and > > each can need multiple implementations targeting different instruction set > > extensions (e.g. both PCLMULQDQ and VPCLMULQDQ on x86). So it does add up. > > > > But it makes sense to make the code be included by default, and make the choice > > to omit it be conditional on CONFIG_EXPERT. > > > > I'm also thinking of just doing a single option that affects all enabled CRC > > variants, e.g. CRC_OPTIMIZATIONS instead of both CRC32_OPTIMIZATIONS and > > CRC_T10DIF_OPTIMIZATIONS. Let me know if you think that would be reasonable. > > > > As you probably noticed, the other problem is that CRC32 has 4 generic > > implementations: bit-by-bit, and slice by 1, 4, or 8 bytes. > > > > Bit-by-bit is useless. Slice by 4 and slice by 8 are too similar to have both. > > > > It's not straightforward to choose between slice by 1 and slice by 4/8, though. > > When benchmarking slice-by-n, a higher n will always be faster in > > microbenchmarks (up to about n=16), but the required table size also increases > > accordingly. E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8 > > uses a 8192-byte table. This table is accessed randomly, which is really bad on > > the dcache, and can be really bad for performance in real world scenarios where > > the system is bottlenecked on memory. > > > > I'm tentatively planning to just say that slice-by-4 is a good enough compromise > > and have that be the only generic CRC32 implementation. > > So I guess I want slice-by-1 on m68k. Or > > default CRC32_IMPL_SLICEBY1 if CONFIG_CC_OPTIMIZE_FOR_SIZE > > so I don't have to touch all defconfigs? ;-) > > BTW, shouldn't all existing defconfigs that enable > CONFIG_CRC32_SLICEBY[48], CONFIG_CRC32_SARWATE, or CRC32_BIT be updated, > as the logic has changed (these symbols are now enabled based on > CRC32_IMPL*)? Oh, I just realized m68k used CONFIG_CRC32_SLICEBY8=y before, as that was the default. So all these questions are not new, just churn because of the changed logic? Gr{oetje,eeting}s, Geert
On Thu, Jan 23, 2025 at 09:16:21AM +0100, Geert Uytterhoeven wrote: > Hi Eric, > > On Thu, Jan 23, 2025 at 6:16 AM Eric Biggers <ebiggers@kernel.org> wrote: > > On Wed, Jan 22, 2025 at 08:13:07PM -0800, Linus Torvalds wrote: > > > On Sun, 19 Jan 2025 at 14:51, Eric Biggers <ebiggers@kernel.org> wrote: > > > > > > > > - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be > > > > directly accessible via the library API, instead of requiring the > > > > crypto API. This is much simpler and more efficient. > > > > > > I'm not a fan of the crazy crypto interfaces for simple hashes that > > > only complicate and slow things down, so I'm all in favor of this and > > > have pulled it. > > > > > > HOWEVER. > > > > > > I'm also very much not a fan of asking users pointless questions. > > > > > > What does this patch-set ask users idiotic questions like > > > > > > CRC-T10DIF implementation > > > > 1. Architecture-optimized (CRC_T10DIF_IMPL_ARCH) (NEW) > > > 2. Generic implementation (CRC_T10DIF_IMPL_GENERIC) (NEW) > > > > > > and > > > > > > CRC32 implementation > > > > 1. Arch-optimized, with fallback to slice-by-8 > > > (CRC32_IMPL_ARCH_PLUS_SLICEBY8) (NEW) > > > 2. Arch-optimized, with fallback to slice-by-1 > > > (CRC32_IMPL_ARCH_PLUS_SLICEBY1) (NEW) > > > 3. Slice by 8 bytes (CRC32_IMPL_SLICEBY8) (NEW) > > > 4. Slice by 4 bytes (CRC32_IMPL_SLICEBY4) (NEW) > > > 5. Slice by 1 byte (Sarwate's algorithm) (CRC32_IMPL_SLICEBY1) (NEW) > > > 6. Classic Algorithm (one bit at a time) (CRC32_IMPL_BIT) (NEW) > > > > > > because *nobody* wants to see that completely pointless noise. > > > > > > Pick the best one. Don't ask the user to pick the best one. > > > > > > If you have some really strong argument for why users need to be able > > > to override the sane choice, make the question it at *least* depend on > > > EXPERT. > > > > > > And honestly, I don't see how there could possibly ever be any point. > > > If there is an arch-optimized version, just use it. > > > > > > And if the "optimized" version is crap and worse than some generic > > > one, it just needs to be removed. > > > > > > None of this "make the user make the choice because kernel developers > > > can't deal with the responsibility of just saying what is best". > > > > Yes, I agree, and the kconfig options are already on my list of things to clean > > up. Thanks for giving your thoughts on how to do it. To be clarify, this > > initial set of changes removed the existing arch-specific CRC32 and CRC-T10DIF > > options (on x86 that was CRYPTO_CRC32C_INTEL, CRYPTO_CRC32_PCLMUL, and > > CRYPTO_CRCT10DIF_PCLMUL) and added the equivalent functionality to two choices > > in lib, one of which already existed. So for now the changes to the options > > were just meant to consolidate them, not add to or remove from them per se. > > > > I do think that to support kernel size minimization efforts we should continue > > to allow omitting the arch-specific CRC code. One of the CRC options, usually > > CONFIG_CRC32, gets built into almost every kernel. Some options already group > > together multiple CRC variants (e.g. there are three different CRC32's), and > > each can need multiple implementations targeting different instruction set > > extensions (e.g. both PCLMULQDQ and VPCLMULQDQ on x86). So it does add up. > > > > But it makes sense to make the code be included by default, and make the choice > > to omit it be conditional on CONFIG_EXPERT. > > > > I'm also thinking of just doing a single option that affects all enabled CRC > > variants, e.g. CRC_OPTIMIZATIONS instead of both CRC32_OPTIMIZATIONS and > > CRC_T10DIF_OPTIMIZATIONS. Let me know if you think that would be reasonable. > > > > As you probably noticed, the other problem is that CRC32 has 4 generic > > implementations: bit-by-bit, and slice by 1, 4, or 8 bytes. > > > > Bit-by-bit is useless. Slice by 4 and slice by 8 are too similar to have both. > > > > It's not straightforward to choose between slice by 1 and slice by 4/8, though. > > When benchmarking slice-by-n, a higher n will always be faster in > > microbenchmarks (up to about n=16), but the required table size also increases > > accordingly. E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8 > > uses a 8192-byte table. This table is accessed randomly, which is really bad on > > the dcache, and can be really bad for performance in real world scenarios where > > the system is bottlenecked on memory. > > > > I'm tentatively planning to just say that slice-by-4 is a good enough compromise > > and have that be the only generic CRC32 implementation. > > So I guess I want slice-by-1 on m68k. Or > > default CRC32_IMPL_SLICEBY1 if CONFIG_CC_OPTIMIZE_FOR_SIZE > > so I don't have to touch all defconfigs? ;-) As I mentioned in my next reply I'm actually leaning towards slice-by-1 only now. > BTW, shouldn't all existing defconfigs that enable > CONFIG_CRC32_SLICEBY[48], CONFIG_CRC32_SARWATE, or CRC32_BIT be updated, > as the logic has changed (these symbols are now enabled based on > CRC32_IMPL*)? Yes, though I doubt that anyone who was selecting a specific generic CRC32 implementation really put much thought into it. And if we standardize on one implementation then the choice will go away and it won't matter anyway. - Eric
On Thu, Jan 23, 2025 at 09:19:53AM +0100, Geert Uytterhoeven wrote: > Hi Eric, > > On Thu, Jan 23, 2025 at 9:16 AM Geert Uytterhoeven <geert@linux-m68k.org> wrote: > > On Thu, Jan 23, 2025 at 6:16 AM Eric Biggers <ebiggers@kernel.org> wrote: > > > On Wed, Jan 22, 2025 at 08:13:07PM -0800, Linus Torvalds wrote: > > > > On Sun, 19 Jan 2025 at 14:51, Eric Biggers <ebiggers@kernel.org> wrote: > > > > > - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be > > > > > directly accessible via the library API, instead of requiring the > > > > > crypto API. This is much simpler and more efficient. > > > > > > > > I'm not a fan of the crazy crypto interfaces for simple hashes that > > > > only complicate and slow things down, so I'm all in favor of this and > > > > have pulled it. > > > > > > > > HOWEVER. > > > > > > > > I'm also very much not a fan of asking users pointless questions. > > > > > > > > What does this patch-set ask users idiotic questions like > > > > > > > > CRC-T10DIF implementation > > > > > 1. Architecture-optimized (CRC_T10DIF_IMPL_ARCH) (NEW) > > > > 2. Generic implementation (CRC_T10DIF_IMPL_GENERIC) (NEW) > > > > > > > > and > > > > > > > > CRC32 implementation > > > > > 1. Arch-optimized, with fallback to slice-by-8 > > > > (CRC32_IMPL_ARCH_PLUS_SLICEBY8) (NEW) > > > > 2. Arch-optimized, with fallback to slice-by-1 > > > > (CRC32_IMPL_ARCH_PLUS_SLICEBY1) (NEW) > > > > 3. Slice by 8 bytes (CRC32_IMPL_SLICEBY8) (NEW) > > > > 4. Slice by 4 bytes (CRC32_IMPL_SLICEBY4) (NEW) > > > > 5. Slice by 1 byte (Sarwate's algorithm) (CRC32_IMPL_SLICEBY1) (NEW) > > > > 6. Classic Algorithm (one bit at a time) (CRC32_IMPL_BIT) (NEW) > > > > > > > > because *nobody* wants to see that completely pointless noise. > > > > > > > > Pick the best one. Don't ask the user to pick the best one. > > > > > > > > If you have some really strong argument for why users need to be able > > > > to override the sane choice, make the question it at *least* depend on > > > > EXPERT. > > > > > > > > And honestly, I don't see how there could possibly ever be any point. > > > > If there is an arch-optimized version, just use it. > > > > > > > > And if the "optimized" version is crap and worse than some generic > > > > one, it just needs to be removed. > > > > > > > > None of this "make the user make the choice because kernel developers > > > > can't deal with the responsibility of just saying what is best". > > > > > > Yes, I agree, and the kconfig options are already on my list of things to clean > > > up. Thanks for giving your thoughts on how to do it. To be clarify, this > > > initial set of changes removed the existing arch-specific CRC32 and CRC-T10DIF > > > options (on x86 that was CRYPTO_CRC32C_INTEL, CRYPTO_CRC32_PCLMUL, and > > > CRYPTO_CRCT10DIF_PCLMUL) and added the equivalent functionality to two choices > > > in lib, one of which already existed. So for now the changes to the options > > > were just meant to consolidate them, not add to or remove from them per se. > > > > > > I do think that to support kernel size minimization efforts we should continue > > > to allow omitting the arch-specific CRC code. One of the CRC options, usually > > > CONFIG_CRC32, gets built into almost every kernel. Some options already group > > > together multiple CRC variants (e.g. there are three different CRC32's), and > > > each can need multiple implementations targeting different instruction set > > > extensions (e.g. both PCLMULQDQ and VPCLMULQDQ on x86). So it does add up. > > > > > > But it makes sense to make the code be included by default, and make the choice > > > to omit it be conditional on CONFIG_EXPERT. > > > > > > I'm also thinking of just doing a single option that affects all enabled CRC > > > variants, e.g. CRC_OPTIMIZATIONS instead of both CRC32_OPTIMIZATIONS and > > > CRC_T10DIF_OPTIMIZATIONS. Let me know if you think that would be reasonable. > > > > > > As you probably noticed, the other problem is that CRC32 has 4 generic > > > implementations: bit-by-bit, and slice by 1, 4, or 8 bytes. > > > > > > Bit-by-bit is useless. Slice by 4 and slice by 8 are too similar to have both. > > > > > > It's not straightforward to choose between slice by 1 and slice by 4/8, though. > > > When benchmarking slice-by-n, a higher n will always be faster in > > > microbenchmarks (up to about n=16), but the required table size also increases > > > accordingly. E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8 > > > uses a 8192-byte table. This table is accessed randomly, which is really bad on > > > the dcache, and can be really bad for performance in real world scenarios where > > > the system is bottlenecked on memory. > > > > > > I'm tentatively planning to just say that slice-by-4 is a good enough compromise > > > and have that be the only generic CRC32 implementation. > > > > So I guess I want slice-by-1 on m68k. Or > > > > default CRC32_IMPL_SLICEBY1 if CONFIG_CC_OPTIMIZE_FOR_SIZE > > > > so I don't have to touch all defconfigs? ;-) > > > > BTW, shouldn't all existing defconfigs that enable > > CONFIG_CRC32_SLICEBY[48], CONFIG_CRC32_SARWATE, or CRC32_BIT be updated, > > as the logic has changed (these symbols are now enabled based on > > CRC32_IMPL*)? > > Oh, I just realized m68k used CONFIG_CRC32_SLICEBY8=y before, as that > was the default. So all these questions are not new, just churn because > of the changed logic? > > Gr{oetje,eeting}s, Yes, there was already a choice for CRC32 implementation before, with SLICEBY8 still being the default. It just came up again because I changed the options in the choice by adding the arch-specific options (which previously were in a different location) to it. - Eric
On Wed, Jan 22, 2025 at 11:46:18PM -0800, Eric Biggers wrote: > > Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the > only generic CRC32 implementation. The generic code has become increasingly > irrelevant due to the arch-optimized code existing. The arch-optimized code > tends to be 10 to 100 times faster on long messages. Yeah, that's my intuition as well; I would think the CPU's that don't have a CRC32 optimization instruction(s) would probably be the most sensitive to dcache thrashing. But given that Geert ran into this on m68k (I assume), maybe we could have him benchmark the various crc32 generic implementation to see if we is the best for him? That is, assuming that he cares (which he might not. :-). - Ted
On Thu, Jan 23, 2025 at 09:07:44AM -0500, Theodore Ts'o wrote: > On Wed, Jan 22, 2025 at 11:46:18PM -0800, Eric Biggers wrote: > > > > Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the > > only generic CRC32 implementation. The generic code has become increasingly > > irrelevant due to the arch-optimized code existing. The arch-optimized code > > tends to be 10 to 100 times faster on long messages. > > Yeah, that's my intuition as well; I would think the CPU's that > don't have a CRC32 optimization instruction(s) would probably be the > most sensitive to dcache thrashing. > > But given that Geert ran into this on m68k (I assume), maybe we could > have him benchmark the various crc32 generic implementation to see if > we is the best for him? That is, assuming that he cares (which he > might not. :-). FWIW, benchmarking the CRC library functions is easy now; just enable CONFIG_CRC_KUNIT_TEST=y and CONFIG_CRC_BENCHMARK=y. But, it's just a traditional benchmark that calls the functions in a loop, and doesn't account for dcache thrashing. It's exactly the sort of benchmark I mentioned doesn't tell the whole story about the drawbacks of using a huge table. So focusing only on microbenchmarks of slice-by-n generally leads to a value n > 1 seeming optimal --- potentially as high as n=16 depending on the CPU, but really old CPUs like m68k should need much less. So the rationale of choosing "slice-by-1" in the kernel would be to consider the reduced dcache use and code size, and the fact that arch-optimized code is usually used instead these days anyway, to be more important than microbenchmark results. (And also the other CRC variants in the kernel like CRC64, CRC-T10DIF, CRC16, etc. already just have slice-by-1, so this would make CRC32 consistent with that.) - Eric
On Thu, 23 Jan 2025 at 10:18, Eric Biggers <ebiggers@kernel.org> wrote: > > FWIW, benchmarking the CRC library functions is easy now; just enable > CONFIG_CRC_KUNIT_TEST=y and CONFIG_CRC_BENCHMARK=y. > > But, it's just a traditional benchmark that calls the functions in a loop, and > doesn't account for dcache thrashing. Yeah. I suspect the x86 vector version in particular is just not even worth it. If you have the crc instruction, the basic arch-optimized case is presumably already pretty good (and *that* code is tiny). Honestly, I took a quick look at the "by-4" and "by-8" cases, and considering that you still have to do per-byte lookups of the words _anyway_, I would expect that the regular by-1 is presumably not that much worse. IOW, maybe we could try to just do the simple by-1 for the generic case, and cut the x86 version down to the simple "use crc32b" case. And see if anybody even notices... Linus
On Thu, 23 Jan 2025 09:07:44 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote: > On Wed, Jan 22, 2025 at 11:46:18PM -0800, Eric Biggers wrote: > > > > Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the > > only generic CRC32 implementation. The generic code has become increasingly > > irrelevant due to the arch-optimized code existing. The arch-optimized code > > tends to be 10 to 100 times faster on long messages. > > Yeah, that's my intuition as well; I would think the CPU's that > don't have a CRC32 optimization instruction(s) would probably be the > most sensitive to dcache thrashing. > > But given that Geert ran into this on m68k (I assume), maybe we could > have him benchmark the various crc32 generic implementation to see if > we is the best for him? That is, assuming that he cares (which he > might not. :-). The difference between the clock speed and main memory speed on an m68k will be a lot less than on anything more recent. So I suspect the effect of cache misses is much less (or more likely it is pretty much always getting a cache miss). Brain wakes up, does the m68k even have a D-cache? Checks the m68k user manual section 6 - it only has a I-cache (64 32-bit words). So the important thing is probably keeping the loop small. A cpu board might have an external data cache. For a small memory footprint it might be worth considering 4 bits at a time. So a 16 word (64 byte) lookup table. Thinks.... You can xor a data byte onto the crc 'accumulator' and then do two separate table lookups for each of the high nibbles and xor both onto it before the rotate. That is probably a reasonable compromise. David
On Thu, Jan 23, 2025 at 12:52:30PM -0800, Linus Torvalds wrote: > On Thu, 23 Jan 2025 at 10:18, Eric Biggers <ebiggers@kernel.org> wrote: > > > > FWIW, benchmarking the CRC library functions is easy now; just enable > > CONFIG_CRC_KUNIT_TEST=y and CONFIG_CRC_BENCHMARK=y. > > > > But, it's just a traditional benchmark that calls the functions in a loop, and > > doesn't account for dcache thrashing. > > Yeah. I suspect the x86 vector version in particular is just not even > worth it. If you have the crc instruction, the basic arch-optimized > case is presumably already pretty good (and *that* code is tiny). x86 unfortunately only has an instruction for crc32c, i.e. the variant of CRC32 that uses the Castagnoli polynomial. So it works great for crc32c(). But any other variant of CRC such as the regular crc32() or crc_t10dif_update() need carryless multiplication (PCLMULQDQ) which uses the vector registers. It is super fast on sufficiently long messages, but it does use the vector registers. FWIW, arm64 has an instruction for both crc32c and crc32. And RISC-V has carryless multiplication using scalar registers. So things are a bit easier there. > Honestly, I took a quick look at the "by-4" and "by-8" cases, and > considering that you still have to do per-byte lookups of the words > _anyway_, I would expect that the regular by-1 is presumably not that > much worse. The difference I'm seeing on x86_64 (Ryzen 9 9950X) is 690 MB/s for slice-by-1 vs. 3091 MB/s for slice-by-8. It's significant since the latter gives much more instruction-level parallelism. But of course, CPUs on which it matters tend to have *much* faster arch-optimized implementations anyway. Currently the x86_64-optimized crc32c_le() is up to 43607 MB/s on the same CPU, and crc32_le() is up to 22664 MB/s. (By adding VPCLMULQDQ support we could actually achieve over 80000 MB/s.) The caveat is that in the [V]PCLMULQDQ case the data length has to be long enough for it to be worthwhile, but then again having a 8 KiB randomly-accessed data table just to micro-optimize short messages seems not too worthwhile. > IOW, maybe we could try to just do the simple by-1 for the generic > case, and cut the x86 version down to the simple "use crc32b" case. > And see if anybody even notices... > > Linus - Eric
On Thu, Jan 23, 2025 at 08:58:10PM +0000, David Laight wrote: > On Thu, 23 Jan 2025 09:07:44 -0500 > "Theodore Ts'o" <tytso@mit.edu> wrote: > > > On Wed, Jan 22, 2025 at 11:46:18PM -0800, Eric Biggers wrote: > > > > > > Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the > > > only generic CRC32 implementation. The generic code has become increasingly > > > irrelevant due to the arch-optimized code existing. The arch-optimized code > > > tends to be 10 to 100 times faster on long messages. > > > > Yeah, that's my intuition as well; I would think the CPU's that > > don't have a CRC32 optimization instruction(s) would probably be the > > most sensitive to dcache thrashing. > > > > But given that Geert ran into this on m68k (I assume), maybe we could > > have him benchmark the various crc32 generic implementation to see if > > we is the best for him? That is, assuming that he cares (which he > > might not. :-). > > The difference between the clock speed and main memory speed on an m68k will > be a lot less than on anything more recent. > So I suspect the effect of cache misses is much less (or more likely it is > pretty much always getting a cache miss). > Brain wakes up, does the m68k even have a D-cache? > Checks the m68k user manual section 6 - it only has a I-cache (64 32-bit words). > So the important thing is probably keeping the loop small. > A cpu board might have an external data cache. > > For a small memory footprint it might be worth considering 4 bits at a time. > So a 16 word (64 byte) lookup table. > Thinks.... > You can xor a data byte onto the crc 'accumulator' and then do two separate > table lookups for each of the high nibbles and xor both onto it before the rotate. > That is probably a reasonable compromise. Yes, you can do less than a byte at a time (currently one of the choices is even one *bit* at a time!), but I think byte-at-a-time is small enough already. - Eric
On Thu, 23 Jan 2025 at 13:13, Eric Biggers <ebiggers@kernel.org> wrote: > > x86 unfortunately only has an instruction for crc32c Yeah, but isn't that like 90% of the uses? IOW, if you'd make the "select" statements a bit more specific, and make ext4 (and others) do "select CRC32C" instead, the other ones wouldn't even get selected? Linus
On Thu, Jan 23, 2025 at 01:16:57PM -0800, Linus Torvalds wrote: > On Thu, 23 Jan 2025 at 13:13, Eric Biggers <ebiggers@kernel.org> wrote: > > > > x86 unfortunately only has an instruction for crc32c > > Yeah, but isn't that like 90% of the uses? > > IOW, if you'd make the "select" statements a bit more specific, and > make ext4 (and others) do "select CRC32C" instead, the other ones > wouldn't even get selected? > > Linus There's quite a lot unfortunately. Try this which finds the regular crc32 (not crc32c): git grep -E '\<(crc32|crc32_le|__crc32_le)\(|"crc32"' - Eric
On Wed, Jan 22, 2025 at 11:46:18PM -0800, Eric Biggers wrote: > On Wed, Jan 22, 2025 at 09:16:33PM -0800, Eric Biggers wrote: > > As you probably noticed, the other problem is that CRC32 has 4 generic > > implementations: bit-by-bit, and slice by 1, 4, or 8 bytes. > > > > Bit-by-bit is useless. Slice by 4 and slice by 8 are too similar to have both. > > > > It's not straightforward to choose between slice by 1 and slice by 4/8, though. > > When benchmarking slice-by-n, a higher n will always be faster in > > microbenchmarks (up to about n=16), but the required table size also increases > > accordingly. E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8 > > uses a 8192-byte table. This table is accessed randomly, which is really bad on > > the dcache, and can be really bad for performance in real world scenarios where > > the system is bottlenecked on memory. > > > > I'm tentatively planning to just say that slice-by-4 is a good enough compromise > > and have that be the only generic CRC32 implementation. > > > > But I need to try an interleaved implementation too, since it's possible that > > could give the best of both worlds. > > Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the > only generic CRC32 implementation. The generic code has become increasingly > irrelevant due to the arch-optimized code existing. The arch-optimized code > tends to be 10 to 100 times faster on long messages. > > The generic CRC32 code is still needed when the CPU features needed by the arch > code are unavailable. But that's rare these days. It's also still needed when > the CPU has no scalar instructions to accelerate the CRC (e.g. on x86_64, the > "regular" CRC32 as opposed to the Castagnoli CRC32) *and* the message is too > short for the overhead of saving and restoring the vector registers to be worth > it -- typically < 64 bytes or so. And it's still needed when the CRC is done in > a context where vector registers can't be used at all. > > But those don't feel like very strong motivations for the huge tables anymore. > I think the huge tables were really intended for optimizing CRCs of long > messages back when CPUs didn't have any better way to do it. Have you by chance been looking at performance of 64 bit crc algorithms, or have any recommendations there? I've been starting to consider switching to a 64 bit crc for the default for bcachefs - and we do want it to be a regular crc so we can combine/add them together, not one of the newer fancier add/xor based hashes. I thought we'd gotten a faster PCLMULQDQ based implementation of a crc64, but looking again it appears not, hrm.
On Thu, 23 Jan 2025 13:16:03 -0800 Eric Biggers <ebiggers@kernel.org> wrote: > On Thu, Jan 23, 2025 at 08:58:10PM +0000, David Laight wrote: ... > > For a small memory footprint it might be worth considering 4 bits at a time. > > So a 16 word (64 byte) lookup table. > > Thinks.... > > You can xor a data byte onto the crc 'accumulator' and then do two separate > > table lookups for each of the high nibbles and xor both onto it before the rotate. > > That is probably a reasonable compromise. > > Yes, you can do less than a byte at a time (currently one of the choices is even > one *bit* at a time!), but I think byte-at-a-time is small enough already. I used '1 bit at a time' for a crc64 of a 5MB file. Actually fast enough during a 'compile' phase (verified by a serial eeprom). But the paired nibble one is something like: crc ^= *data++ << 24; crc ^= table[crc >> 28] ^ table1[(crc >> 24) & 15]; crc = rol(crc, 8); which isn't going to be significantly slower than the byte one where the middle line is: crc ^= table[crc >> 24]; especially for a multi-issue cpu, and the table drops from 1k to 128 bytes. That is quite a lot of D-cache misses. (Since you'll probably get them all twice when the program's working set is reloaded!) Actually you need to rol() the table[]s. Then do: crc = rol(crc, 8) ^ table[] ... to reduce the register dependency chain to 5 per byte. David
On Thu, Jan 23, 2025 at 05:36:40PM -0500, Kent Overstreet wrote: > On Wed, Jan 22, 2025 at 11:46:18PM -0800, Eric Biggers wrote: > > On Wed, Jan 22, 2025 at 09:16:33PM -0800, Eric Biggers wrote: > > > As you probably noticed, the other problem is that CRC32 has 4 generic > > > implementations: bit-by-bit, and slice by 1, 4, or 8 bytes. > > > > > > Bit-by-bit is useless. Slice by 4 and slice by 8 are too similar to have both. > > > > > > It's not straightforward to choose between slice by 1 and slice by 4/8, though. > > > When benchmarking slice-by-n, a higher n will always be faster in > > > microbenchmarks (up to about n=16), but the required table size also increases > > > accordingly. E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8 > > > uses a 8192-byte table. This table is accessed randomly, which is really bad on > > > the dcache, and can be really bad for performance in real world scenarios where > > > the system is bottlenecked on memory. > > > > > > I'm tentatively planning to just say that slice-by-4 is a good enough compromise > > > and have that be the only generic CRC32 implementation. > > > > > > But I need to try an interleaved implementation too, since it's possible that > > > could give the best of both worlds. > > > > Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the > > only generic CRC32 implementation. The generic code has become increasingly > > irrelevant due to the arch-optimized code existing. The arch-optimized code > > tends to be 10 to 100 times faster on long messages. > > > > The generic CRC32 code is still needed when the CPU features needed by the arch > > code are unavailable. But that's rare these days. It's also still needed when > > the CPU has no scalar instructions to accelerate the CRC (e.g. on x86_64, the > > "regular" CRC32 as opposed to the Castagnoli CRC32) *and* the message is too > > short for the overhead of saving and restoring the vector registers to be worth > > it -- typically < 64 bytes or so. And it's still needed when the CRC is done in > > a context where vector registers can't be used at all. > > > > But those don't feel like very strong motivations for the huge tables anymore. > > I think the huge tables were really intended for optimizing CRCs of long > > messages back when CPUs didn't have any better way to do it. > > Have you by chance been looking at performance of 64 bit crc algorithms, > or have any recommendations there? > > I've been starting to consider switching to a 64 bit crc for the > default for bcachefs - and we do want it to be a regular crc so we can > combine/add them together, not one of the newer fancier add/xor based > hashes. > > I thought we'd gotten a faster PCLMULQDQ based implementation of a > crc64, but looking again it appears not, hrm. Yep! I've written an assembly template that expands into a PCLMULQDQ or VPCLMULQDQ implementation of any variant of CRC-8, CRC-16, CRC-32, or CRC-64. The latest work-in-progress version can be found at https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=crc-x86. We just need to decide which CRC variants to wire it up for. My tentative plan, which is implemented in that git branch, is crc32_le, crc_t10dif, crc64_be, and crc64_nvme. (crc64_nvme is currently called crc64_rocksoft, but that name makes no sense.) For crc32_le and crc_t10dif these would replace the existing PCLMULQDQ implementations. crc64* on the other hand would gain acceleration for the first time, improving performance for those by as much as 100x. I'm also fixing the CRC64 lib to be organized the way I did CRC32 and CRC-T10DIF. That work is targeting 6.15, since there was already enough for 6.14. Though this approach makes it easy to wire up arbitrary CRC variants for x86, we do need to make sure that anyone we wire up is actually useful. The users of the optimized crc64_be would be drivers/md/bcache/ and fs/bcachefs/. So if you could confirm that that would indeed be useful for you (and in particular you haven't deprecated the CRC64 support in favor of xxHash), that would be helpful. Note, I haven't worked on arm64 yet, but a similar approach could be used there. As far as performance goes, the [V]PCLMULQDQ based code performs about the same across all CRC variants. The "le" or least-significant-bit first CRCs are *slightly* faster than the "be" or most-significant-bit first CRCs since they avoid a byteswap of the data, but it's a small difference. If you happen to be changing CRC variant anyway, but still want a 64-bit CRC, crc64_nvme may be a slightly better choice than crc64_be it is a "le" CRC and is already needed for NVME data integrity (https://lore.kernel.org/linux-block/20220303201312.3255347-1-kbusch@kernel.org/). But otherwise staying with crc64_be is fine. - Eric
On Thu, Jan 23, 2025 at 03:42:09PM -0800, Eric Biggers wrote: > On Thu, Jan 23, 2025 at 05:36:40PM -0500, Kent Overstreet wrote: > > On Wed, Jan 22, 2025 at 11:46:18PM -0800, Eric Biggers wrote: > > > On Wed, Jan 22, 2025 at 09:16:33PM -0800, Eric Biggers wrote: > > > > As you probably noticed, the other problem is that CRC32 has 4 generic > > > > implementations: bit-by-bit, and slice by 1, 4, or 8 bytes. > > > > > > > > Bit-by-bit is useless. Slice by 4 and slice by 8 are too similar to have both. > > > > > > > > It's not straightforward to choose between slice by 1 and slice by 4/8, though. > > > > When benchmarking slice-by-n, a higher n will always be faster in > > > > microbenchmarks (up to about n=16), but the required table size also increases > > > > accordingly. E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8 > > > > uses a 8192-byte table. This table is accessed randomly, which is really bad on > > > > the dcache, and can be really bad for performance in real world scenarios where > > > > the system is bottlenecked on memory. > > > > > > > > I'm tentatively planning to just say that slice-by-4 is a good enough compromise > > > > and have that be the only generic CRC32 implementation. > > > > > > > > But I need to try an interleaved implementation too, since it's possible that > > > > could give the best of both worlds. > > > > > > Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the > > > only generic CRC32 implementation. The generic code has become increasingly > > > irrelevant due to the arch-optimized code existing. The arch-optimized code > > > tends to be 10 to 100 times faster on long messages. > > > > > > The generic CRC32 code is still needed when the CPU features needed by the arch > > > code are unavailable. But that's rare these days. It's also still needed when > > > the CPU has no scalar instructions to accelerate the CRC (e.g. on x86_64, the > > > "regular" CRC32 as opposed to the Castagnoli CRC32) *and* the message is too > > > short for the overhead of saving and restoring the vector registers to be worth > > > it -- typically < 64 bytes or so. And it's still needed when the CRC is done in > > > a context where vector registers can't be used at all. > > > > > > But those don't feel like very strong motivations for the huge tables anymore. > > > I think the huge tables were really intended for optimizing CRCs of long > > > messages back when CPUs didn't have any better way to do it. > > > > Have you by chance been looking at performance of 64 bit crc algorithms, > > or have any recommendations there? > > > > I've been starting to consider switching to a 64 bit crc for the > > default for bcachefs - and we do want it to be a regular crc so we can > > combine/add them together, not one of the newer fancier add/xor based > > hashes. > > > > I thought we'd gotten a faster PCLMULQDQ based implementation of a > > crc64, but looking again it appears not, hrm. > > Yep! I've written an assembly template that expands into a PCLMULQDQ or > VPCLMULQDQ implementation of any variant of CRC-8, CRC-16, CRC-32, or CRC-64. > The latest work-in-progress version can be found at > https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=crc-x86. > We just need to decide which CRC variants to wire it up for. My tentative plan, > which is implemented in that git branch, is crc32_le, crc_t10dif, crc64_be, and > crc64_nvme. (crc64_nvme is currently called crc64_rocksoft, but that name makes > no sense.) For crc32_le and crc_t10dif these would replace the existing > PCLMULQDQ implementations. crc64* on the other hand would gain acceleration for > the first time, improving performance for those by as much as 100x. I'm also > fixing the CRC64 lib to be organized the way I did CRC32 and CRC-T10DIF. > > That work is targeting 6.15, since there was already enough for 6.14. > > Though this approach makes it easy to wire up arbitrary CRC variants for x86, we > do need to make sure that anyone we wire up is actually useful. The users of > the optimized crc64_be would be drivers/md/bcache/ and fs/bcachefs/. So if you > could confirm that that would indeed be useful for you (and in particular you > haven't deprecated the CRC64 support in favor of xxHash), that would be helpful. That's fantastic! CRC64 isn't deprecated, and I suspect that will be our preferred option at some point in the future. xxHash was added only because there is a real desire for 64 bit checksums and crc64 wasn't a good option at the time. (It was also designed as a hash function, not for data protection, and I'd want to see some actual literature evaluating it as such before I'd consider making it the default. And like I mentioned, CRCs have other properties we want).