Message ID | 20230704083206.693155-2-davidgow@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | x86: checksum: Fix unaligned checksums on < i686 | expand |
From: David Gow > Sent: 04 July 2023 09:32 > > The checksum_32 code was originally written to only handle 2-byte > aligned buffers, but was later extended to support arbitrary alignment. > However, the non-PPro variant doesn't apply the carry before jumping to > the 2- or 4-byte aligned versions, which clear CF. .... > I also tested it on a real 486DX2, with the same results. Which cpu does anyone really care about? The unrolled 'adcl' loop is horrid on intel cpu between (about) 'core' and 'haswell' because each u-op can only have two inputs and adc needs 3 - so is 2 u-ops. First fixed by summing to alternate registers. On anything modern (well I've not checked some Atom based servers) misaligned accesses are pretty near zero cost. So it really isn't worth the tests that align data. (I suspect it all got better a long time ago except for transfers that cross cache-line boundaries, with adc taking two cycles even that might be free.) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Mon, 10 Jul 2023 at 23:01, David Laight <David.Laight@aculab.com> wrote: > > From: David Gow > > Sent: 04 July 2023 09:32 > > > > The checksum_32 code was originally written to only handle 2-byte > > aligned buffers, but was later extended to support arbitrary alignment. > > However, the non-PPro variant doesn't apply the carry before jumping to > > the 2- or 4-byte aligned versions, which clear CF. > .... > > I also tested it on a real 486DX2, with the same results. > > Which cpu does anyone really care about? > > The unrolled 'adcl' loop is horrid on intel cpu between > (about) 'core' and 'haswell' because each u-op can only > have two inputs and adc needs 3 - so is 2 u-ops. > First fixed by summing to alternate registers. > > On anything modern (well I've not checked some Atom based > servers) misaligned accesses are pretty near zero cost. > So it really isn't worth the tests that align data. > > (I suspect it all got better a long time ago except > for transfers that cross cache-line boundaries, with > adc taking two cycles even that might be free.) > I agree that the implementation here is suitably ancient far from optimal, but think the alignment issue in this patch is a separate correctness issue. (Even if it's one which is unlikely to result in real-world problems at present.) There's probably a valid argument around whether or not the checksumming code should be alignment-agnostic, or whether a 2- or 4- byte alignment is something callers have to guarantee, but since some effort had gone into making these work with unaligned data, I think it's sensible to make sure those cases actually work, and so for the KUnit test to verify that all these different alignments all give correct results. If it seems worth cleaning up / optimising the code more significantly (maybe some of the people who really care have TCP header checksum offload anyway), that seems like a separate task to me. Personally, while I care quite a bit that 32-bit x86 is still working (even on ancient CPUs), I don't really have the time to spend optimising it. (Worst-case, if maintaining it gets too rough, we could possibly fall back to the C implementations, though I haven't benchmarked them.) -- David
diff --git a/arch/x86/lib/checksum_32.S b/arch/x86/lib/checksum_32.S index 23318c338db0..128287cea42d 100644 --- a/arch/x86/lib/checksum_32.S +++ b/arch/x86/lib/checksum_32.S @@ -62,6 +62,7 @@ SYM_FUNC_START(csum_partial) jl 8f movzbl (%esi), %ebx adcl %ebx, %eax + adcl $0, %eax roll $8, %eax inc %esi testl $2, %esi
The checksum_32 code was originally written to only handle 2-byte aligned buffers, but was later extended to support arbitrary alignment. However, the non-PPro variant doesn't apply the carry before jumping to the 2- or 4-byte aligned versions, which clear CF. This causes the new checksum_kunit test to fail, as it runs with a large number of different possible alignments and both with and without carries. For example: ./tools/testing/kunit/kunit.py run --arch i386 --kconfig_add CONFIG_M486=y checksum Gives: KTAP version 1 # Subtest: checksum 1..3 ok 1 test_csum_fixed_random_inputs # test_csum_all_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:267 Expected result == expec, but result == 65281 (0xff01) expec == 65280 (0xff00) not ok 2 test_csum_all_carry_inputs # test_csum_no_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:314 Expected result == expec, but result == 65535 (0xffff) expec == 65534 (0xfffe) not ok 3 test_csum_no_carry_inputs With this patch, it passes. KTAP version 1 # Subtest: checksum 1..3 ok 1 test_csum_fixed_random_inputs ok 2 test_csum_all_carry_inputs ok 3 test_csum_no_carry_inputs I also tested it on a real 486DX2, with the same results. Signed-off-by: David Gow <davidgow@google.com> --- This is a follow-up to the UML patch to use the common 32-bit x86 checksum implementations: https://lore.kernel.org/linux-um/20230704083022.692368-2-davidgow@google.com/ --- arch/x86/lib/checksum_32.S | 1 + 1 file changed, 1 insertion(+)