Message ID | 20230920222231.686275-1-dhowells@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | iov_iter: Convert the iterator macros into inline funcs | expand |
From: David Howells > Sent: 20 September 2023 23:22 ... > (8) Move the copy-and-csum code to net/ where it can be in proximity with > the code that uses it. This eliminates the code if CONFIG_NET=n and > allows for the slim possibility of it being inlined. > > (9) Fold memcpy_and_csum() in to its two users. > > (10) Move csum_and_copy_from_iter_full() out of line and merge in > csum_and_copy_from_iter() since the former is the only caller of the > latter. I thought that the real idea behind these was to do the checksum at the same time as the copy to avoid loading the data into the L1 data-cache twice - especially for long buffers. I wonder how often there are multiple iov[] that actually make it better than just check summing the linear buffer? I had a feeling that check summing of udp data was done during copy_to/from_user, but the code can't be the copy-and-csum here for that because it is missing support form odd-length buffers. Intel x86 desktop chips can easily checksum at 8 bytes/clock (But probably not with the current code!). (I've got ~12 bytes/clock using adox and adcx but that loop is entirely horrid and it would need run-time patching. Especially since I think some AMD cpu execute them very slowly.) OTOH 'rep movs[bq]' copy will copy 16 bytes/clock (32 if the destination is 32 byte aligned - it pretty much won't be). So you'd need a csum-and-copy loop that did 16 bytes every three clocks to get the same throughput for long buffers. In principle splitting the 'adc memory' into two instructions is the same number of u-ops - but I'm sure I've tried to do that and failed and the extra memory write can happen in parallel with everything else. So I don't think you'll get 16 bytes in two clocks - but you might get it is three. OTOH for a cpu where memcpy is code loop summing the data in the copy loop is likely to be a gain. But I suspect doing the checksum and copy at the same time got 'all to complicated' to actually implement fully. With most modern ethernet chips checksumming receive pacakets does it really get used enough for the additional complexity? David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
David Laight <David.Laight@ACULAB.COM> wrote: > > (8) Move the copy-and-csum code to net/ where it can be in proximity with > > the code that uses it. This eliminates the code if CONFIG_NET=n and > > allows for the slim possibility of it being inlined. > > > > (9) Fold memcpy_and_csum() in to its two users. > > > > (10) Move csum_and_copy_from_iter_full() out of line and merge in > > csum_and_copy_from_iter() since the former is the only caller of the > > latter. > > I thought that the real idea behind these was to do the checksum > at the same time as the copy to avoid loading the data into the L1 > data-cache twice - especially for long buffers. > I wonder how often there are multiple iov[] that actually make > it better than just check summing the linear buffer? It also reduces the overhead for finding the data to checksum in the case the packet gets split since we're doing the checksumming as we copy - but with a linear buffer, that's negligible. > I had a feeling that check summing of udp data was done during > copy_to/from_user, but the code can't be the copy-and-csum here > for that because it is missing support form odd-length buffers. Is there a bug there? > Intel x86 desktop chips can easily checksum at 8 bytes/clock > (But probably not with the current code!). > (I've got ~12 bytes/clock using adox and adcx but that loop > is entirely horrid and it would need run-time patching. > Especially since I think some AMD cpu execute them very slowly.) > > OTOH 'rep movs[bq]' copy will copy 16 bytes/clock (32 if the > destination is 32 byte aligned - it pretty much won't be). > > So you'd need a csum-and-copy loop that did 16 bytes every > three clocks to get the same throughput for long buffers. > In principle splitting the 'adc memory' into two instructions > is the same number of u-ops - but I'm sure I've tried to do > that and failed and the extra memory write can happen in > parallel with everything else. > So I don't think you'll get 16 bytes in two clocks - but you > might get it is three. > > OTOH for a cpu where memcpy is code loop summing the data in > the copy loop is likely to be a gain. > > But I suspect doing the checksum and copy at the same time > got 'all to complicated' to actually implement fully. > With most modern ethernet chips checksumming receive pacakets > does it really get used enough for the additional complexity? You may be right. That's more a question for the networking folks than for me. It's entirely possible that the checksumming code is just not used on modern systems these days. Maybe Willem can comment since he's the UDP maintainer? David
On Fri, Sep 22, 2023 at 2:01 PM David Howells <dhowells@redhat.com> wrote: > > David Laight <David.Laight@ACULAB.COM> wrote: > > > > (8) Move the copy-and-csum code to net/ where it can be in proximity with > > > the code that uses it. This eliminates the code if CONFIG_NET=n and > > > allows for the slim possibility of it being inlined. > > > > > > (9) Fold memcpy_and_csum() in to its two users. > > > > > > (10) Move csum_and_copy_from_iter_full() out of line and merge in > > > csum_and_copy_from_iter() since the former is the only caller of the > > > latter. > > > > I thought that the real idea behind these was to do the checksum > > at the same time as the copy to avoid loading the data into the L1 > > data-cache twice - especially for long buffers. > > I wonder how often there are multiple iov[] that actually make > > it better than just check summing the linear buffer? > > It also reduces the overhead for finding the data to checksum in the case the > packet gets split since we're doing the checksumming as we copy - but with a > linear buffer, that's negligible. > > > I had a feeling that check summing of udp data was done during > > copy_to/from_user, but the code can't be the copy-and-csum here > > for that because it is missing support form odd-length buffers. > > Is there a bug there? > > > Intel x86 desktop chips can easily checksum at 8 bytes/clock > > (But probably not with the current code!). > > (I've got ~12 bytes/clock using adox and adcx but that loop > > is entirely horrid and it would need run-time patching. > > Especially since I think some AMD cpu execute them very slowly.) > > > > OTOH 'rep movs[bq]' copy will copy 16 bytes/clock (32 if the > > destination is 32 byte aligned - it pretty much won't be). > > > > So you'd need a csum-and-copy loop that did 16 bytes every > > three clocks to get the same throughput for long buffers. > > In principle splitting the 'adc memory' into two instructions > > is the same number of u-ops - but I'm sure I've tried to do > > that and failed and the extra memory write can happen in > > parallel with everything else. > > So I don't think you'll get 16 bytes in two clocks - but you > > might get it is three. > > > > OTOH for a cpu where memcpy is code loop summing the data in > > the copy loop is likely to be a gain. > > > > But I suspect doing the checksum and copy at the same time > > got 'all to complicated' to actually implement fully. > > With most modern ethernet chips checksumming receive pacakets > > does it really get used enough for the additional complexity? > > You may be right. That's more a question for the networking folks than for > me. It's entirely possible that the checksumming code is just not used on > modern systems these days. > > Maybe Willem can comment since he's the UDP maintainer? Perhaps these days it is more relevant to embedded systems than high end servers.
From: Willem de Bruijn > Sent: 23 September 2023 07:59 > > On Fri, Sep 22, 2023 at 2:01 PM David Howells <dhowells@redhat.com> wrote: > > > > David Laight <David.Laight@ACULAB.COM> wrote: > > > > > > (8) Move the copy-and-csum code to net/ where it can be in proximity with > > > > the code that uses it. This eliminates the code if CONFIG_NET=n and > > > > allows for the slim possibility of it being inlined. > > > > > > > > (9) Fold memcpy_and_csum() in to its two users. > > > > > > > > (10) Move csum_and_copy_from_iter_full() out of line and merge in > > > > csum_and_copy_from_iter() since the former is the only caller of the > > > > latter. > > > > > > I thought that the real idea behind these was to do the checksum > > > at the same time as the copy to avoid loading the data into the L1 > > > data-cache twice - especially for long buffers. > > > I wonder how often there are multiple iov[] that actually make > > > it better than just check summing the linear buffer? > > > > It also reduces the overhead for finding the data to checksum in the case the > > packet gets split since we're doing the checksumming as we copy - but with a > > linear buffer, that's negligible. > > > > > I had a feeling that check summing of udp data was done during > > > copy_to/from_user, but the code can't be the copy-and-csum here > > > for that because it is missing support form odd-length buffers. > > > > Is there a bug there? No, I misread the code - i shouldn't scan patches when I'd got a viral head code... ... > > You may be right. That's more a question for the networking folks than for > > me. It's entirely possible that the checksumming code is just not used on > > modern systems these days. > > > > Maybe Willem can comment since he's the UDP maintainer? > > Perhaps these days it is more relevant to embedded systems than high > end servers. The checksum and copy are done together. I probably missed it because the function isn't passed the old checksum (which it can pretty much process for free). Instead the caller is adding it afterwards - which involves and extra explicit csum_add(). The x86-x84 ip checksum loops are all horrid though. The unrolling in them is so 1990's. With the out-of-order pipeline the memory accesses tend to take care of themselves. Not to mention that a whole raft of (now oldish) cpu take two clocks to execute 'adc'. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)