Message ID | 20230816120741.534415-3-dhowells@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | iov_iter: Convert the iterator macros into inline funcs | expand |
From: David Howells > Sent: Wednesday, August 16, 2023 1:08 PM Given: > iter->copy_mc is only used with a bvec iterator and only by > dump_emit_page() in fs/coredump.c so rather than handle this in > memcpy_from_iter_mc() where it is checked repeatedly by _copy_from_iter() and > copy_page_from_iter_atomic(), Instead of adding the extra check to every copy: > +static __always_inline > +size_t __copy_from_iter(void *addr, size_t bytes, struct iov_iter *i) > +{ > + if (unlikely(iov_iter_is_copy_mc(i))) > + return __copy_from_iter_mc(addr, bytes, i); > + return iterate_and_advance(i, bytes, addr, > + copy_from_user_iter, memcpy_from_iter); > } Couldn't the relevant code directly call __copy_frtom_iter_mc() ? Or a version then checked iov_is_copy_mc() and then fell back to the standard function. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
David Laight <David.Laight@ACULAB.COM> wrote: > > Couldn't the relevant code directly call __copy_from_iter_mc() ? > Or a version then checked iov_is_copy_mc() and then fell > back to the standard function. No, because the marked iterator is handed by the coredump code to __kernel_write_iter() and thence on to who-knows-what driver - which will call copy_from_iter() or some such. $DRIVER shouldn't need to know about ->copy_mc. One thing I do wonder about, though, is what should happen if they call, say, csum_and_copy_from_iter()? That doesn't have an _mc variant. Or what if they extract the pages and operate directly on those? David
From: David Howells > Sent: Wednesday, August 16, 2023 2:01 PM > > David Laight <David.Laight@ACULAB.COM> wrote: > > > > > Couldn't the relevant code directly call __copy_from_iter_mc() ? > > Or a version then checked iov_is_copy_mc() and then fell > > back to the standard function. > > No, because the marked iterator is handed by the coredump code to > __kernel_write_iter() and thence on to who-knows-what driver - which will call > copy_from_iter() or some such. $DRIVER shouldn't need to know about > ->copy_mc. What about ITER_BVEC_MC ?? David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Wed, 16 Aug 2023 at 14:19, David Laight <David.Laight@aculab.com> wrote: > > What about ITER_BVEC_MC ?? That probably would be the best option. Just make it a proper ITER_xyz, instead of an odd sub-case for one ITER (but set up in such a way that it looks like it might happen for other ITER_xyz cases). Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > > What about ITER_BVEC_MC ?? > > That probably would be the best option. Just make it a proper > ITER_xyz, instead of an odd sub-case for one ITER (but set up in such > a way that it looks like it might happen for other ITER_xyz cases). I'm not sure that buys us anything. It would then require every call to iov_iter_is_bvec()[*] to check for two values instead of one - including in iterate_and_advance() - and *still* we'd have to have the special-casing in _copy_from_iter() and copy_page_from_iter_atomic(). The issue is that ITER_xyz changes the iteration function - but we don't actually want to do that; rather, we need to change the step function. David [*] There's a bunch of them outside of iov_iter.c.
On Wed, 16 Aug 2023 at 22:35, David Howells <dhowells@redhat.com> wrote: > > I'm not sure that buys us anything. It would then require every call to > iov_iter_is_bvec()[*] to check for two values instead of one Well, that part is trivially fixable, and we should do that anyway for other reasons. See the attached patch. > The issue is that ITER_xyz changes the iteration function - but we don't > actually want to do that; rather, we need to change the step function. Yeah, that may be the fundamental issue. But making the ITER_xyz flags be bit masks would help - partly exactly because it makes it so trivial yo say "for this set of ITER_xyz, do this". This patch only does that for the 'user_backed' thing, which was a similar case. Hmm? Linus
> This patch only does that for the 'user_backed' thing, which was a similar case.
Yes, having two fields that have to be set to match is a recipe
for disaster.
Although I'm not sure the bit-fields really help.
There are 8 bytes at the start of the structure, might as well
use them :-)
OTOH the 'nofault' and 'copy_mc' flags could be put into much
higher bits of a 32bit value.
Alternatively put both/all the USER values first.
Then an ordered compare could be used.
If everything is actually inlined could the 'iter' be passed
through to the step() functions?
Although is might be better to pass a cached copy of iter->iter_type
(although that might just end up spilling to stack.)
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Thu, 17 Aug 2023 at 10:42, David Laight <David.Laight@aculab.com> wrote: > > Although I'm not sure the bit-fields really help. > There are 8 bytes at the start of the structure, might as well > use them :-) Actuallyç I wrote the patch that way because it seems to improve code generation. The bitfields are generally all set together as just plain one-time constants at initialization time, and gcc sees that it's a full byte write. And the reason 'data_source' is not a bitfield is that it's not a constant at iov_iter init time (it's an argument to all the init functions), so having that one as a separate byte at init time is good for code generation when you don't need to mask bits or anything like that. And once initialized, having things be dense and doing all the compares with a bitwise 'and' instead of doing them as some value compare again tends to generate good code. Then being able to test multiple bits at the same time is just gravy on top of that (ie that whole "remove user_backed, because it's easier to just test the bit combination"). > OTOH the 'nofault' and 'copy_mc' flags could be put into much > higher bits of a 32bit value. Once you start doing that, you often get bigger constants in the code stream. I didn't do any *extensive* testing of the code generation, but the stuff I looked at looked good. Linus
From: Linus Torvalds > Sent: Thursday, August 17, 2023 3:38 PM > > On Thu, 17 Aug 2023 at 10:42, David Laight <David.Laight@aculab.com> wrote: > > > > Although I'm not sure the bit-fields really help. > > There are 8 bytes at the start of the structure, might as well > > use them :-) > > Actuallyç I wrote the patch that way because it seems to improve code > generation. > > The bitfields are generally all set together as just plain one-time > constants at initialization time, and gcc sees that it's a full byte > write. I've just spent too long on godbolt (again) :-) Fiddling with: #define t1 unsigned char struct b { t1 b1:7; t1 b2:1; }; void ff(struct b *,int); void ff1(void) { struct b b = {.b1=3, .b2 = 1}; ff(&b, sizeof b); } gcc for x86-64 make a pigs-breakfast when the bitfields are 'char' and loads the constant from memory using pc-relative access. Otherwise pretty must all variants (with or without the bitfield) get initialised in a single write. (Although gcc seems to insist on loading a 32bit constant into %eax.) I can well imagine that keeping the constant below 32768 will help on architectures that have to construct large constants. > And the reason 'data_source' is not a bitfield is that it's not > a constant at iov_iter init time (it's an argument to all the init > functions), so having that one as a separate byte at init time is good > for code generation when you don't need to mask bits or anything like > that. > > And once initialized, having things be dense and doing all the > compares with a bitwise 'and' instead of doing them as some value > compare again tends to generate good code. > > Then being able to test multiple bits at the same time is just gravy > on top of that (ie that whole "remove user_backed, because it's easier > to just test the bit combination"). Indeed, they used to be bits but never got tested together. > > OTOH the 'nofault' and 'copy_mc' flags could be put into much > > higher bits of a 32bit value. > > Once you start doing that, you often get bigger constants in the code stream. I wasn't thinking of using 'really big' values :-) Even 32768 can be an issue because some cpu sign extend all constants. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Thu, 17 Aug 2023 at 17:16, David Laight <David.Laight@aculab.com> wrote: > > gcc for x86-64 make a pigs-breakfast when the bitfields are 'char' > and loads the constant from memory using pc-relative access. I think your godbolt tests must be with some other model than what the kernel uses. For example, for me, iov_iter_init generates testl %esi, %esi # direction test movb $1, (%rdi) # bitfields movq $0, 8(%rdi) movq %rdx, 16(%rdi) movq %r8, 24(%rdi) movq %rcx, 32(%rdi) setne 1(%rdi) # set the direction byte with my patch for me. Which is pretty much optimal. *Without& the patch, I get movzwl .LC1(%rip), %eax testl %esi, %esi movb $0, (%rdi) movb $1, 4(%rdi) movw %ax, 1(%rdi) movq $0, 8(%rdi) movq %rdx, 16(%rdi) movq %r8, 24(%rdi) movq %rcx, 32(%rdi) setne 3(%rdi) which is that disgusting "move two bytes from memory", and makes absolutely no sense as a way to "write 2 zero bytes": .LC1: .byte 0 .byte 0 I think that's some odd gcc bug, actually. > Otherwise pretty must all variants (with or without the bitfield) > get initialised in a single write. So there may well be some odd gcc code generation issue that is triggered by the fact that we use an initializer to set those things, and we then have two bytes (with my patch) followed by a hole, or three bytes (without it) followed by a hole. But that bitfield thing improves things at least for me. I think the example code you used for godbolt is actually something else than what the kernel does. Linus
From: Linus Torvalds > Sent: Thursday, August 17, 2023 4:31 PM ... > movzwl .LC1(%rip), %eax > testl %esi, %esi > movb $0, (%rdi) > movb $1, 4(%rdi) > movw %ax, 1(%rdi) > movq $0, 8(%rdi) > movq %rdx, 16(%rdi) > movq %r8, 24(%rdi) > movq %rcx, 32(%rdi) > setne 3(%rdi) > > which is that disgusting "move two bytes from memory", and makes > absolutely no sense as a way to "write 2 zero bytes": > > .LC1: > .byte 0 > .byte 0 > > I think that's some odd gcc bug, actually. I get that with some code, but not others. Seems to depend on random other stuff. Happens for: struct { unsigned char x:7, y:1; }; but not if I add anything after if (that gets zeroed). Which seems to be the opposite of what you see. If I use explicit assignments (rather than an initialiser) I still get merged writes (even if not a bitfield) but also lose the memory access. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Would it make sense to always check for MCE in _copy_from_iter() and always return a short transfer if we encounter one? It looks pretty cheap in terms of code size as the exception table stuff handles it, so we don't need to do anything in the normal path. I guess this would change the handling of memory errors and DAX errors. David --- iov_iter: Always handle MCE in _copy_to_iter() (incomplete) --- arch/x86/include/asm/mce.h | 22 ++++++++++++++++++++++ fs/coredump.c | 1 - include/linux/uio.h | 16 ---------------- lib/iov_iter.c | 17 +++++------------ 4 files changed, 27 insertions(+), 29 deletions(-) diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index 180b1cbfcc4e..ee3ff090360d 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -353,4 +353,26 @@ static inline void mce_hygon_feature_init(struct cpuinfo_x86 *c) { return mce_am unsigned long copy_mc_fragile_handle_tail(char *to, char *from, unsigned len); +static __always_inline __must_check +unsigned long memcpy_mc(void *to, const void *from, unsigned long len) +{ +#ifdef CONFIG_ARCH_HAS_COPY_MC + /* + * If CPU has FSRM feature, use 'rep movs'. + * Otherwise, use rep_movs_alternative. + */ + asm volatile( + "1:\n\t" + ALTERNATIVE("rep movsb", + "call rep_movs_alternative", ALT_NOT(X86_FEATURE_FSRM)) + "2:\n" + _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_DEFAULT_MCE_SAFE) + :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT + : : "memory", "rax", "r8", "r9", "r10", "r11"); +#else + return memcpy(to, from, len); +#endif + return len; +} + #endif /* _ASM_X86_MCE_H */ diff --git a/fs/coredump.c b/fs/coredump.c index 9d235fa14ab9..ad54102a5e14 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -884,7 +884,6 @@ static int dump_emit_page(struct coredump_params *cprm, struct page *page) pos = file->f_pos; bvec_set_page(&bvec, page, PAGE_SIZE, 0); iov_iter_bvec(&iter, ITER_SOURCE, &bvec, 1, PAGE_SIZE); - iov_iter_set_copy_mc(&iter); n = __kernel_write_iter(cprm->file, &iter, &pos); if (n != PAGE_SIZE) return 0; diff --git a/include/linux/uio.h b/include/linux/uio.h index 42bce38a8e87..73078ba297b7 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -40,7 +40,6 @@ struct iov_iter_state { struct iov_iter { u8 iter_type; - bool copy_mc; bool nofault; bool data_source; bool user_backed; @@ -252,22 +251,8 @@ size_t _copy_from_iter_flushcache(void *addr, size_t bytes, struct iov_iter *i); #ifdef CONFIG_ARCH_HAS_COPY_MC size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i); -static inline void iov_iter_set_copy_mc(struct iov_iter *i) -{ - i->copy_mc = true; -} - -static inline bool iov_iter_is_copy_mc(const struct iov_iter *i) -{ - return i->copy_mc; -} #else #define _copy_mc_to_iter _copy_to_iter -static inline void iov_iter_set_copy_mc(struct iov_iter *i) { } -static inline bool iov_iter_is_copy_mc(const struct iov_iter *i) -{ - return false; -} #endif size_t iov_iter_zero(size_t bytes, struct iov_iter *); @@ -382,7 +367,6 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction, WARN_ON(direction & ~(READ | WRITE)); *i = (struct iov_iter) { .iter_type = ITER_UBUF, - .copy_mc = false, .user_backed = true, .data_source = direction, .ubuf = buf, diff --git a/lib/iov_iter.c b/lib/iov_iter.c index d282fd4d348f..887d9cb9be4e 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -14,6 +14,7 @@ #include <linux/scatterlist.h> #include <linux/instrumented.h> #include <linux/iov_iter.h> +#include <asm/mce.h> static __always_inline size_t copy_to_user_iter(void __user *iter_to, size_t progress, @@ -168,7 +169,6 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction, WARN_ON(direction & ~(READ | WRITE)); *i = (struct iov_iter) { .iter_type = ITER_IOVEC, - .copy_mc = false, .nofault = false, .user_backed = true, .data_source = direction, @@ -254,14 +254,11 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i) EXPORT_SYMBOL_GPL(_copy_mc_to_iter); #endif /* CONFIG_ARCH_HAS_COPY_MC */ -static size_t memcpy_from_iter_mc(void *iter_from, size_t progress, - size_t len, void *to, void *priv2) +static __always_inline +size_t memcpy_from_iter_mc(void *iter_from, size_t progress, + size_t len, void *to, void *priv2) { - struct iov_iter *iter = priv2; - - if (iov_iter_is_copy_mc(iter)) - return copy_mc_to_kernel(to + progress, iter_from, len); - return memcpy_from_iter(iter_from, progress, len, to, priv2); + return memcpy_mc(to + progress, iter_from, len); } size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i) @@ -632,7 +629,6 @@ void iov_iter_kvec(struct iov_iter *i, unsigned int direction, WARN_ON(direction & ~(READ | WRITE)); *i = (struct iov_iter){ .iter_type = ITER_KVEC, - .copy_mc = false, .data_source = direction, .kvec = kvec, .nr_segs = nr_segs, @@ -649,7 +645,6 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int direction, WARN_ON(direction & ~(READ | WRITE)); *i = (struct iov_iter){ .iter_type = ITER_BVEC, - .copy_mc = false, .data_source = direction, .bvec = bvec, .nr_segs = nr_segs, @@ -678,7 +673,6 @@ void iov_iter_xarray(struct iov_iter *i, unsigned int direction, BUG_ON(direction & ~1); *i = (struct iov_iter) { .iter_type = ITER_XARRAY, - .copy_mc = false, .data_source = direction, .xarray = xarray, .xarray_start = start, @@ -702,7 +696,6 @@ void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count) BUG_ON(direction != READ); *i = (struct iov_iter){ .iter_type = ITER_DISCARD, - .copy_mc = false, .data_source = false, .count = count, .iov_offset = 0
Linus Torvalds <torvalds@linux-foundation.org> wrote: > Well, that part is trivially fixable, and we should do that anyway for > other reasons. > .. > enum iter_type { > /* iter types */ > - ITER_IOVEC, > - ITER_KVEC, > - ITER_BVEC, > - ITER_XARRAY, > - ITER_DISCARD, > - ITER_UBUF, > + ITER_IOVEC = 1, > + ITER_UBUF = 2, > + ITER_KVEC = 4, > + ITER_BVEC = 8, > + ITER_XARRAY = 16, > + ITER_DISCARD = 32, > }; It used to be this way, but Al switched it: 8cd54c1c848031a87820e58d772166ffdf8c08c0 iov_iter: separate direction from flavour David
From: David Howells > Sent: Friday, August 18, 2023 12:43 PM > > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > Well, that part is trivially fixable, and we should do that anyway for > > other reasons. > > .. > > enum iter_type { > > /* iter types */ > > - ITER_IOVEC, > > - ITER_KVEC, > > - ITER_BVEC, > > - ITER_XARRAY, > > - ITER_DISCARD, > > - ITER_UBUF, > > + ITER_IOVEC = 1, > > + ITER_UBUF = 2, > > + ITER_KVEC = 4, > > + ITER_BVEC = 8, > > + ITER_XARRAY = 16, > > + ITER_DISCARD = 32, That could be zero - no bits and default. > > }; > > It used to be this way, but Al switched it: > > 8cd54c1c848031a87820e58d772166ffdf8c08c0 > iov_iter: separate direction from flavour Except it also had the direction flag inside the enum. That caused its own piles of grief. IIRC Linus had type:6 - that doesn't leave any headroom for additional types (even though they shouldn't proliferate). It may be best to avoid bits 15+ (in a bitfield) due to issues with large constants and sign extension. On x86 (I think) 'and immediate' and 'bit test' are the same size for bit 0 to 7, BIT wins for higher bits. gcc generates strange code for some initialisers (see yesterday's thread) and you definitely mustn't leave unused bits in a bitfield. Might be better is the fields are assigned later! (I also saw clang carefully preserving %eax on stabck!) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, Aug 18, 2023 at 12:16:23PM +0000, David Laight wrote: > > > + ITER_IOVEC = 1, > > > + ITER_UBUF = 2, > > > + ITER_KVEC = 4, > > > + ITER_BVEC = 8, > > > + ITER_XARRAY = 16, > > > + ITER_DISCARD = 32, > > IIRC Linus had type:6 - that doesn't leave any headroom > for additional types (even though they shouldn't proliferate). I have proposed an ITER_KBUF in the past (it is to KVEC as UBUF is to IOVEC). I didn't care enough to keep pushing it, but it's clearly a common idiom.
From: Matthew Wilcox > Sent: Friday, August 18, 2023 1:27 PM > > On Fri, Aug 18, 2023 at 12:16:23PM +0000, David Laight wrote: > > > > + ITER_IOVEC = 1, > > > > + ITER_UBUF = 2, > > > > + ITER_KVEC = 4, > > > > + ITER_BVEC = 8, > > > > + ITER_XARRAY = 16, > > > > + ITER_DISCARD = 32, > > > > IIRC Linus had type:6 - that doesn't leave any headroom > > for additional types (even though they shouldn't proliferate). > > I have proposed an ITER_KBUF in the past (it is to KVEC as UBUF is > to IOVEC). I didn't care enough to keep pushing it, but it's clearly > a common idiom. Indeed, I didn't spot UBUF going in - I spot a lot of stuff. I did wonder if you could optimise for a vector length of 1 (inside the KVEC conditional). That would also pick up the cases where there only happens to be a single buffer. I also remember writing a patch that simplified import_iovec() by combining the iov_iter with a struct iovec iovec[UIO_FASTIOV]. All got bogged down in io_uring which would need changing first. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Linus Torvalds <torvalds@linux-foundation.org> wrote: > This patch only does that for the 'user_backed' thing, which was a similar > case. It makes some things a bit bigger, makes some a bit smaller: __iov_iter_get_pages_alloc dcr 0x331 -> 0x32a -0x7 _copy_from_iter dcr 0x36e -> 0x36a -0x4 _copy_from_iter_flushcache inc 0x359 -> 0x36b +0x12 _copy_mc_to_iter dcr 0x3a7 -> 0x39b -0xc _copy_to_iter inc 0x358 -> 0x359 +0x1 copy_page_to_iter_nofault.part.0 dcr 0x3f1 -> 0x3ef -0x2 csum_and_copy_from_iter dcr 0x3e8 -> 0x3e4 -0x4 csum_and_copy_to_iter inc 0x46a -> 0x46d +0x3 dup_iter inc 0x34 -> 0x39 +0x5 fault_in_iov_iter_readable inc 0x9b -> 0xa0 +0x5 fault_in_iov_iter_writeable inc 0x9b -> 0xa0 +0x5 first_iovec_segment inc 0x4a -> 0x51 +0x7 import_single_range dcr 0x62 -> 0x40 -0x22 import_ubuf dcr 0x65 -> 0x43 -0x22 iov_iter_advance inc 0xd7 -> 0x103 +0x2c iov_iter_alignment inc 0xe0 -> 0xe2 +0x2 iov_iter_extract_pages dcr 0x418 -> 0x416 -0x2 iov_iter_init dcr 0x31 -> 0x27 -0xa iov_iter_is_aligned inc 0xf3 -> 0x108 +0x15 iov_iter_npages inc 0x119 -> 0x11a +0x1 iov_iter_revert inc 0x88 -> 0x99 +0x11 iov_iter_single_seg_count inc 0x38 -> 0x3e +0x6 iov_iter_ubuf new 0x39 iov_iter_zero inc 0x34f -> 0x353 +0x4 iter_iov new 0x17 Adding an extra patch to get rid of the bitfields and using a u8 for the type and bools for the flags makes very little difference on top of the above: __iov_iter_get_pages_alloc inc 0x32a -> 0x32f +0x5 _copy_from_iter inc 0x36a -> 0x36d +0x3 copy_page_from_iter_atomic.part.0 inc 0x3cf -> 0x3d2 +0x3 csum_and_copy_to_iter dcr 0x46d -> 0x46a -0x3 iov_iter_advance dcr 0x103 -> 0xfd -0x6 iov_iter_extract_pages inc 0x416 -> 0x417 +0x1 iov_iter_init inc 0x27 -> 0x2d +0x6 iov_iter_revert dcr 0x99 -> 0x95 -0x4 For reference, I generated the stats with: nm build3/lib/iov_iter.o | sort >a ... change... nm build3/lib/iov_iter.o | sort >b perl analyse.pl a b where analyse.pl is attached. David --- #!/usr/bin/perl -w use strict; die "$0 <file_a> <file_b>" if ($#ARGV != 1); my ($file_a, $file_b) = @ARGV; die "$file_a: File not found\n" unless -r $file_a; die "$file_b: File not found\n" unless -r $file_b; my %a = (); my %b = (); my %c = (); sub read_one($$$) { my ($file, $list, $all) = @_; my $last = undef; open FD, "<$file" || die $file; while (<FD>) { if (/([0-9a-f][0-9a-f]+) [Tt] ([_a-zA-Z0-9.]*)/) { my $addr = hex $1; my $sym = $2; #print $addr, " ", $sym, "\n"; my %obj = ( sym => $sym, addr => $addr, size => 0 ); $list->{$sym} = \%obj; $all->{$sym} = 1; if ($last) { $last->{size} = $addr - $last->{addr}; } $last = \%obj; } } close(FD); } read_one($file_a, \%a, \%c); read_one($file_b, \%b, \%c); foreach my $sym (sort keys %c) { my $as = -1; my $bs = -1; $as = $a{$sym}->{size} if (exists($a{$sym})); $bs = $b{$sym}->{size} if (exists($b{$sym})); next if ($as == $bs); #next if ($sym =~ /__UNIQUE_ID/); if ($as == -1) { printf "%-40s new 0x%x\n", $sym, $bs; } elsif ($bs == -1) { printf "%-40s del 0x%x\n", $sym, $as; } elsif ($bs > $as) { printf "%-40s inc 0x%x -> 0x%x +0x%x\n", $sym, $as, $bs, $bs - $as; } else { printf "%-40s dcr 0x%x -> 0x%x -0x%x\n", $sym, $as, $bs, $as - $bs; } }
Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Although I'm not sure the bit-fields really help. > > There are 8 bytes at the start of the structure, might as well > > use them :-) > > Actuallyç I wrote the patch that way because it seems to improve code > generation. > > The bitfields are generally all set together as just plain one-time > constants at initialization time, and gcc sees that it's a full byte > write. And the reason 'data_source' is not a bitfield is that it's not > a constant at iov_iter init time (it's an argument to all the init > functions), so having that one as a separate byte at init time is good > for code generation when you don't need to mask bits or anything like > that. > > And once initialized, having things be dense and doing all the > compares with a bitwise 'and' instead of doing them as some value > compare again tends to generate good code. Actually... I said that switch(enum) seemed to generate suboptimal code... However, if the enum is renumbered such that the constants are in the same order as in the switch() it generates better code. So we want this order: enum iter_type { ITER_UBUF, ITER_IOVEC, ITER_BVEC, ITER_KVEC, ITER_XARRAY, ITER_DISCARD, }; to match: switch (iov_iter_type(iter)) { case ITER_UBUF: progress = iterate_ubuf(iter, len, priv, priv2, ustep); break; case ITER_IOVEC: progress = iterate_iovec(iter, len, priv, priv2, ustep); break; case ITER_BVEC: progress = iterate_bvec(iter, len, priv, priv2, step); break; case ITER_KVEC: progress = iterate_kvec(iter, len, priv, priv2, step); break; case ITER_XARRAY: progress = iterate_xarray(iter, len, priv, priv2, step); break; case ITER_DISCARD: default: progress = len; break; } then gcc should be able to generate a ternary tree, which it does here: <+77>: mov (%rdx),%al <+79>: cmp $0x2,%al <+81>: je 0xffffffff81779bcc <_copy_from_iter+394> <+87>: ja 0xffffffff81779aa9 <_copy_from_iter+103> though it really split the number space in the wrong place. It can then use one CMP (or TEST) to split again: <+89>: test %al,%al <+91>: mov 0x8(%rdx),%rdx <+95>: jne 0xffffffff81779b48 <_copy_from_iter+262> <+101>: jmp 0xffffffff81779b17 <_copy_from_iter+213> It then should only require one CMP at this point, since AL can only be 4, 5 or 6+: <+103>: cmp $0x3,%al <+105>: je 0xffffffff81779c6e <_copy_from_iter+556> <+111>: cmp $0x4,%al <+113>: jne 0xffffffff81779dd2 <_copy_from_iter+912> The end result being that it should have at most two CMP instructions for any number in the range 0 to 5 - though in this case, it will have three for AL>3. However, it doesn't do this with if-if-if with a number sequence or a bitmask, but rather generates an chain of cmp-cmp-cmp or test-test-test, presumably because it fails to spot the if-conditions are related. I should log a gcc bug on this on the poor switch() behaviour. Also, if we renumber the enum to put UBUF and IOVEC first, we can get rid of iter->user_backed in favour of: static inline bool user_backed_iter(const struct iov_iter *i) { return iter_is_ubuf(i) || iter_is_iovec(i); } which gcc just changes into something like a "CMP $1" and a "JA". Comparing Linus's bit patch (+ is better) to renumbering the switch (- is better): __iov_iter_get_pages_alloc inc 0x32a -> 0x331 +0x7 _copy_from_iter dcr 0x3c7 -> 0x3bf -0x8 _copy_from_iter_flushcache inc 0x364 -> 0x370 +0xc _copy_from_iter_nocache inc 0x33e -> 0x347 +0x9 _copy_mc_to_iter dcr 0x3bc -> 0x3b6 -0x6 _copy_to_iter inc 0x34a -> 0x34b +0x1 copy_page_from_iter_atomic.part.0 dcr 0x424 -> 0x41c -0x8 copy_page_to_iter_nofault.part.0 dcr 0x3a9 -> 0x3a5 -0x4 csum_and_copy_from_iter inc 0x3e5 -> 0x3e8 +0x3 csum_and_copy_to_iter dcr 0x449 -> 0x448 -0x1 dup_iter inc 0x34 -> 0x37 +0x3 fault_in_iov_iter_readable dcr 0x9c -> 0x9a -0x2 fault_in_iov_iter_writeable dcr 0x9c -> 0x9a -0x2 iov_iter_advance dcr 0xde -> 0xd8 -0x6 iov_iter_alignment inc 0xe0 -> 0xe3 +0x3 iov_iter_extract_pages inc 0x416 -> 0x418 +0x2 iov_iter_gap_alignment dcr 0x69 -> 0x67 -0x2 iov_iter_init inc 0x27 -> 0x31 +0xa iov_iter_is_aligned dcr 0x104 -> 0xf5 -0xf iov_iter_npages inc 0x119 -> 0x11d +0x4 iov_iter_revert dcr 0x93 -> 0x86 -0xd iov_iter_single_seg_count dcr 0x41 -> 0x3c -0x5 iov_iter_ubuf inc 0x39 -> 0x3a +0x1 iov_iter_zero dcr 0x338 -> 0x32e -0xa memcpy_from_iter_mc inc 0x2a -> 0x2b +0x1 I think there may be more savings to be made if I go and convert more of the functions to using switch(). David
From: David Howells > Sent: Friday, August 18, 2023 4:20 PM > > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > Although I'm not sure the bit-fields really help. > > > There are 8 bytes at the start of the structure, might as well > > > use them :-) > > > > Actuallyç I wrote the patch that way because it seems to improve code > > generation. > > > > The bitfields are generally all set together as just plain one-time > > constants at initialization time, and gcc sees that it's a full byte > > write. And the reason 'data_source' is not a bitfield is that it's not > > a constant at iov_iter init time (it's an argument to all the init > > functions), so having that one as a separate byte at init time is good > > for code generation when you don't need to mask bits or anything like > > that. > > > > And once initialized, having things be dense and doing all the > > compares with a bitwise 'and' instead of doing them as some value > > compare again tends to generate good code. > > Actually... I said that switch(enum) seemed to generate suboptimal code... > However, if the enum is renumbered such that the constants are in the same > order as in the switch() it generates better code. Hmmm.. the order of the switch labels really shouldn't matter. The advantage of the if-chain is that you can optimise for the most common case. > So we want this order: > > enum iter_type { > ITER_UBUF, > ITER_IOVEC, > ITER_BVEC, > ITER_KVEC, > ITER_XARRAY, > ITER_DISCARD, > }; Will gcc actually code this version without pessimising it? if (likely(type <= ITER_IOVEC) { if (likely(type != ITER_IOVEC)) iterate_ubuf(); else iterate_iovec(); } else if (likely(type) <= ITER_KVEC)) { if (type == ITER_KVEC) iterate_kvec(); else iterate_bvec(); } else if (type == ITER_XARRAY) { iterate_xarrar() } else { discard; } But I bet you can't stop it replicating the compares. (especially with the likely(). That has two mis-predicted (are they ever right!) branches in the common user-copy versions and three in the common kernel ones. In some architectures you might get the default 'fall through' to the UBUF code if the branches aren't predictable. But I believe current x86 cpu never do static prediction. So you always lose :-) ... > static inline bool user_backed_iter(const struct iov_iter *i) > { > return iter_is_ubuf(i) || iter_is_iovec(i); > } > > which gcc just changes into something like a "CMP $1" and a "JA". That makes sense... > Comparing Linus's bit patch (+ is better) to renumbering the switch (- is > better): > .... > iov_iter_init inc 0x27 -> 0x31 +0xa Are you hitting the gcc bug that loads the constant from memory? > I think there may be more savings to be made if I go and convert more of the > functions to using switch(). Size isn't everything, the code needs to be optimised for the hot paths. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
David Laight <David.Laight@ACULAB.COM> wrote: > > iov_iter_init inc 0x27 -> 0x31 +0xa > > Are you hitting the gcc bug that loads the constant from memory? I'm not sure what that looks like. For your perusal, here's a disassembly of the use-switch-on-enum variant: 0xffffffff8177726c <+0>: cmp $0x1,%esi 0xffffffff8177726f <+3>: jbe 0xffffffff81777273 <iov_iter_init+7> 0xffffffff81777271 <+5>: ud2 0xffffffff81777273 <+7>: test %esi,%esi 0xffffffff81777275 <+9>: movw $0x1,(%rdi) 0xffffffff8177727a <+14>: setne 0x3(%rdi) 0xffffffff8177727e <+18>: xor %eax,%eax 0xffffffff81777280 <+20>: movb $0x0,0x2(%rdi) 0xffffffff81777284 <+24>: movb $0x1,0x4(%rdi) 0xffffffff81777288 <+28>: mov %rax,0x8(%rdi) 0xffffffff8177728c <+32>: mov %rdx,0x10(%rdi) 0xffffffff81777290 <+36>: mov %r8,0x18(%rdi) 0xffffffff81777294 <+40>: mov %rcx,0x20(%rdi) 0xffffffff81777298 <+44>: jmp 0xffffffff81d728a0 <__x86_return_thunk> versus the use-bitmap variant: 0xffffffff81777311 <+0>: cmp $0x1,%esi 0xffffffff81777314 <+3>: jbe 0xffffffff81777318 <iov_iter_init+7> 0xffffffff81777316 <+5>: ud2 0xffffffff81777318 <+7>: test %esi,%esi 0xffffffff8177731a <+9>: movb $0x2,(%rdi) 0xffffffff8177731d <+12>: setne 0x1(%rdi) 0xffffffff81777321 <+16>: xor %eax,%eax 0xffffffff81777323 <+18>: mov %rdx,0x10(%rdi) 0xffffffff81777327 <+22>: mov %rax,0x8(%rdi) 0xffffffff8177732b <+26>: mov %r8,0x18(%rdi) 0xffffffff8177732f <+30>: mov %rcx,0x20(%rdi) 0xffffffff81777333 <+34>: jmp 0xffffffff81d72960 <__x86_return_thunk> It seems to be that the former is loading byte constants individually, whereas Linus combined all those fields into a single byte and eliminated one of them. David
From: David Howells > Sent: Friday, August 18, 2023 5:49 PM > > David Laight <David.Laight@ACULAB.COM> wrote: > > > > iov_iter_init inc 0x27 -> 0x31 +0xa > > > > Are you hitting the gcc bug that loads the constant from memory? > > I'm not sure what that looks like. For your perusal, here's a disassembly of > the use-switch-on-enum variant: > > 0xffffffff8177726c <+0>: cmp $0x1,%esi > 0xffffffff8177726f <+3>: jbe 0xffffffff81777273 <iov_iter_init+7> > 0xffffffff81777271 <+5>: ud2 > 0xffffffff81777273 <+7>: test %esi,%esi > 0xffffffff81777275 <+9>: movw $0x1,(%rdi) > 0xffffffff8177727a <+14>: setne 0x3(%rdi) > 0xffffffff8177727e <+18>: xor %eax,%eax > 0xffffffff81777280 <+20>: movb $0x0,0x2(%rdi) > 0xffffffff81777284 <+24>: movb $0x1,0x4(%rdi) > 0xffffffff81777288 <+28>: mov %rax,0x8(%rdi) > 0xffffffff8177728c <+32>: mov %rdx,0x10(%rdi) > 0xffffffff81777290 <+36>: mov %r8,0x18(%rdi) > 0xffffffff81777294 <+40>: mov %rcx,0x20(%rdi) > 0xffffffff81777298 <+44>: jmp 0xffffffff81d728a0 <__x86_return_thunk> > > versus the use-bitmap variant: > > 0xffffffff81777311 <+0>: cmp $0x1,%esi > 0xffffffff81777314 <+3>: jbe 0xffffffff81777318 <iov_iter_init+7> > 0xffffffff81777316 <+5>: ud2 > 0xffffffff81777318 <+7>: test %esi,%esi > 0xffffffff8177731a <+9>: movb $0x2,(%rdi) > 0xffffffff8177731d <+12>: setne 0x1(%rdi) > 0xffffffff81777321 <+16>: xor %eax,%eax > 0xffffffff81777323 <+18>: mov %rdx,0x10(%rdi) > 0xffffffff81777327 <+22>: mov %rax,0x8(%rdi) > 0xffffffff8177732b <+26>: mov %r8,0x18(%rdi) > 0xffffffff8177732f <+30>: mov %rcx,0x20(%rdi) > 0xffffffff81777333 <+34>: jmp 0xffffffff81d72960 <__x86_return_thunk> > > It seems to be that the former is loading byte constants individually, whereas > Linus combined all those fields into a single byte and eliminated one of them. I think you need to re-order the structure. The top set writes to bytes 0..4 with: > 0xffffffff81777275 <+9>: movw $0x1,(%rdi) > 0xffffffff8177727a <+14>: setne 0x3(%rdi) > 0xffffffff81777280 <+20>: movb $0x0,0x2(%rdi) > 0xffffffff81777284 <+24>: movb $0x1,0x4(%rdi) Note that the 'setne' writes into the middle of the constants. The lower writes bytes 0..1 with: > 0xffffffff8177731a <+9>: movb $0x2,(%rdi) > 0xffffffff8177731d <+12>: setne 0x1(%rdi) I think that if you move the 'conditional' value to offset 4 you'll get fewer writes. Probably a 32bit load into %eax and then a write. I don't think gcc likes generating 16bit immediates. In some tests I did it loaded a 32bit value into %eax and then wrote the low bits. So the code is much the same (on x86) for 2 or 4 bytes of constants. I'm sure you can use the 'data-16' prefix with an immediate. I'm not sure why you have two non-zero values when Linus only had one though. OTOH you don't want to be writing 3 bytes of constants. Also gcc won't generate: movl $0xaabbccdd,%eax setne %al // overwriting the dd movl %eax,(%rdi) and I suspect the partial write (to %al) will be a stall. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 378da0cb3069..33febccadd9d 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -450,14 +450,33 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i) EXPORT_SYMBOL_GPL(_copy_mc_to_iter); #endif /* CONFIG_ARCH_HAS_COPY_MC */ -static size_t memcpy_from_iter_mc(void *iter_from, size_t progress, - size_t len, void *to, void *priv2) +static __always_inline +size_t memcpy_from_iter_mc(void *iter_from, size_t progress, + size_t len, void *to, void *priv2) { - struct iov_iter *iter = priv2; + return copy_mc_to_kernel(to + progress, iter_from, len); +} - if (iov_iter_is_copy_mc(iter)) - return copy_mc_to_kernel(to + progress, iter_from, len); - return memcpy_from_iter(iter_from, progress, len, to, priv2); +static size_t __copy_from_iter_mc(void *addr, size_t bytes, struct iov_iter *i) +{ + size_t progress; + + if (unlikely(i->count < bytes)) + bytes = i->count; + if (unlikely(!bytes)) + return 0; + progress = iterate_bvec(i, bytes, addr, NULL, memcpy_from_iter_mc); + i->count -= progress; + return progress; +} + +static __always_inline +size_t __copy_from_iter(void *addr, size_t bytes, struct iov_iter *i) +{ + if (unlikely(iov_iter_is_copy_mc(i))) + return __copy_from_iter_mc(addr, bytes, i); + return iterate_and_advance(i, bytes, addr, + copy_from_user_iter, memcpy_from_iter); } size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i) @@ -467,9 +486,7 @@ size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i) if (user_backed_iter(i)) might_fault(); - return iterate_and_advance2(i, bytes, addr, i, - copy_from_user_iter, - memcpy_from_iter_mc); + return __copy_from_iter(addr, bytes, i); } EXPORT_SYMBOL(_copy_from_iter); @@ -690,9 +707,7 @@ size_t copy_page_from_iter_atomic(struct page *page, size_t offset, } p = kmap_atomic(page) + offset; - n = iterate_and_advance2(i, n, p, i, - copy_from_user_iter, - memcpy_from_iter_mc); + __copy_from_iter(p, n, i); kunmap_atomic(p); copied += n; offset += n;