diff mbox series

[v3,2/2] iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()

Message ID 20230816120741.534415-3-dhowells@redhat.com (mailing list archive)
State New, archived
Headers show
Series iov_iter: Convert the iterator macros into inline funcs | expand

Commit Message

David Howells Aug. 16, 2023, 12:07 p.m. UTC
iter->copy_mc is only used with a bvec iterator and only by
dump_emit_page() in fs/coredump.c so rather than handle this in
memcpy_from_iter_mc() where it is checked repeatedly by _copy_from_iter() and
copy_page_from_iter_atomic(),
---
 lib/iov_iter.c | 39 +++++++++++++++++++++++++++------------
 1 file changed, 27 insertions(+), 12 deletions(-)

Comments

David Laight Aug. 16, 2023, 12:28 p.m. UTC | #1
From: David Howells
> Sent: Wednesday, August 16, 2023 1:08 PM

Given:

> iter->copy_mc is only used with a bvec iterator and only by
> dump_emit_page() in fs/coredump.c so rather than handle this in
> memcpy_from_iter_mc() where it is checked repeatedly by _copy_from_iter() and
> copy_page_from_iter_atomic(),

Instead of adding the extra check to every copy:

> +static __always_inline
> +size_t __copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
> +{
> +	if (unlikely(iov_iter_is_copy_mc(i)))
> +		return __copy_from_iter_mc(addr, bytes, i);
> +	return iterate_and_advance(i, bytes, addr,
> +				   copy_from_user_iter, memcpy_from_iter);
>  }

Couldn't the relevant code directly call __copy_frtom_iter_mc() ?
Or a version then checked iov_is_copy_mc() and then fell
back to the standard function.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
David Howells Aug. 16, 2023, 1 p.m. UTC | #2
David Laight <David.Laight@ACULAB.COM> wrote:

> 
> Couldn't the relevant code directly call __copy_from_iter_mc() ?
> Or a version then checked iov_is_copy_mc() and then fell
> back to the standard function.

No, because the marked iterator is handed by the coredump code to
__kernel_write_iter() and thence on to who-knows-what driver - which will call
copy_from_iter() or some such.  $DRIVER shouldn't need to know about
->copy_mc.

One thing I do wonder about, though, is what should happen if they call, say,
csum_and_copy_from_iter()?  That doesn't have an _mc variant.  Or what if they
extract the pages and operate directly on those?

David
David Laight Aug. 16, 2023, 2:19 p.m. UTC | #3
From: David Howells
> Sent: Wednesday, August 16, 2023 2:01 PM
> 
> David Laight <David.Laight@ACULAB.COM> wrote:
> 
> >
> > Couldn't the relevant code directly call __copy_from_iter_mc() ?
> > Or a version then checked iov_is_copy_mc() and then fell
> > back to the standard function.
> 
> No, because the marked iterator is handed by the coredump code to
> __kernel_write_iter() and thence on to who-knows-what driver - which will call
> copy_from_iter() or some such.  $DRIVER shouldn't need to know about
> ->copy_mc.

What about ITER_BVEC_MC ??

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Linus Torvalds Aug. 16, 2023, 6:50 p.m. UTC | #4
On Wed, 16 Aug 2023 at 14:19, David Laight <David.Laight@aculab.com> wrote:
>
> What about ITER_BVEC_MC ??

That probably would be the best option. Just make it a proper
ITER_xyz, instead of an odd sub-case for one ITER (but set up in such
a way that it looks like it might happen for other ITER_xyz cases).

                Linus
David Howells Aug. 16, 2023, 8:35 p.m. UTC | #5
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > What about ITER_BVEC_MC ??
> 
> That probably would be the best option. Just make it a proper
> ITER_xyz, instead of an odd sub-case for one ITER (but set up in such
> a way that it looks like it might happen for other ITER_xyz cases).

I'm not sure that buys us anything.  It would then require every call to
iov_iter_is_bvec()[*] to check for two values instead of one - including in
iterate_and_advance() - and *still* we'd have to have the special-casing in
_copy_from_iter() and copy_page_from_iter_atomic().

The issue is that ITER_xyz changes the iteration function - but we don't
actually want to do that; rather, we need to change the step function.

David

[*] There's a bunch of them outside of iov_iter.c.
Linus Torvalds Aug. 17, 2023, 4:18 a.m. UTC | #6
On Wed, 16 Aug 2023 at 22:35, David Howells <dhowells@redhat.com> wrote:
>
> I'm not sure that buys us anything.  It would then require every call to
> iov_iter_is_bvec()[*] to check for two values instead of one

Well, that part is trivially fixable, and we should do that anyway for
other reasons.

See the attached patch.

> The issue is that ITER_xyz changes the iteration function - but we don't
> actually want to do that; rather, we need to change the step function.

Yeah, that may be the fundamental issue. But making the ITER_xyz flags
be bit masks would help - partly exactly because it makes it so
trivial yo say "for this set of ITER_xyz, do this".

This patch only does that for the 'user_backed' thing, which was a similar case.

Hmm?

               Linus
David Laight Aug. 17, 2023, 8:41 a.m. UTC | #7
> This patch only does that for the 'user_backed' thing, which was a similar case.

Yes, having two fields that have to be set to match is a recipe
for disaster.

Although I'm not sure the bit-fields really help.
There are 8 bytes at the start of the structure, might as well
use them :-)
OTOH the 'nofault' and 'copy_mc' flags could be put into much
higher bits of a 32bit value.

Alternatively put both/all the USER values first.
Then an ordered compare could be used.

If everything is actually inlined could the 'iter' be passed
through to the step() functions?
Although is might be better to pass a cached copy of iter->iter_type
(although that might just end up spilling to stack.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Linus Torvalds Aug. 17, 2023, 2:38 p.m. UTC | #8
On Thu, 17 Aug 2023 at 10:42, David Laight <David.Laight@aculab.com> wrote:
>
> Although I'm not sure the bit-fields really help.
> There are 8 bytes at the start of the structure, might as well
> use them :-)

Actuallyç I wrote the patch that way because it seems to improve code
generation.

The bitfields are generally all set together as just plain one-time
constants at initialization time, and gcc sees that it's a full byte
write. And the reason 'data_source' is not a bitfield is that it's not
a constant at iov_iter init time (it's an argument to all the init
functions), so having that one as a separate byte at init time is good
for code generation when you don't need to mask bits or anything like
that.

And once initialized, having things be dense and doing all the
compares with a bitwise 'and' instead of doing them as some value
compare again tends to generate good code.

Then being able to test multiple bits at the same time is just gravy
on top of that (ie that whole "remove user_backed, because it's easier
to just test the bit combination").

> OTOH the 'nofault' and 'copy_mc' flags could be put into much
> higher bits of a 32bit value.

Once you start doing that, you often get bigger constants in the code stream.

I didn't do any *extensive* testing of the code generation, but the
stuff I looked at looked good.

               Linus
David Laight Aug. 17, 2023, 3:16 p.m. UTC | #9
From: Linus Torvalds
> Sent: Thursday, August 17, 2023 3:38 PM
> 
> On Thu, 17 Aug 2023 at 10:42, David Laight <David.Laight@aculab.com> wrote:
> >
> > Although I'm not sure the bit-fields really help.
> > There are 8 bytes at the start of the structure, might as well
> > use them :-)
> 
> Actuallyç I wrote the patch that way because it seems to improve code
> generation.
> 
> The bitfields are generally all set together as just plain one-time
> constants at initialization time, and gcc sees that it's a full byte
> write.

I've just spent too long on godbolt (again) :-)
Fiddling with:

#define t1 unsigned char
struct b {
    t1 b1:7;
    t1 b2:1;
};

void ff(struct b *,int);

void ff1(void)
{
    struct b b = {.b1=3, .b2 = 1};
    ff(&b, sizeof b);
}

gcc for x86-64 make a pigs-breakfast when the bitfields are 'char'
and loads the constant from memory using pc-relative access.
Otherwise pretty must all variants (with or without the bitfield)
get initialised in a single write.
(Although gcc seems to insist on loading a 32bit constant into %eax.)

I can well imagine that keeping the constant below 32768 will help
on architectures that have to construct large constants.

> And the reason 'data_source' is not a bitfield is that it's not
> a constant at iov_iter init time (it's an argument to all the init
> functions), so having that one as a separate byte at init time is good
> for code generation when you don't need to mask bits or anything like
> that.
> 
> And once initialized, having things be dense and doing all the
> compares with a bitwise 'and' instead of doing them as some value
> compare again tends to generate good code.
> 
> Then being able to test multiple bits at the same time is just gravy
> on top of that (ie that whole "remove user_backed, because it's easier
> to just test the bit combination").

Indeed, they used to be bits but never got tested together.

> > OTOH the 'nofault' and 'copy_mc' flags could be put into much
> > higher bits of a 32bit value.
> 
> Once you start doing that, you often get bigger constants in the code stream.

I wasn't thinking of using 'really big' values :-)
Even 32768 can be an issue because some cpu sign extend all constants.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Linus Torvalds Aug. 17, 2023, 3:31 p.m. UTC | #10
On Thu, 17 Aug 2023 at 17:16, David Laight <David.Laight@aculab.com> wrote:
>
> gcc for x86-64 make a pigs-breakfast when the bitfields are 'char'
> and loads the constant from memory using pc-relative access.

I think your godbolt tests must be with some other model than what the
kernel uses.

For example, for me, iov_iter_init generates

        testl   %esi, %esi      # direction test
        movb    $1, (%rdi)      # bitfields
        movq    $0, 8(%rdi)
        movq    %rdx, 16(%rdi)
        movq    %r8, 24(%rdi)
        movq    %rcx, 32(%rdi)
        setne   1(%rdi)            # set the direction byte

with my patch for me. Which is pretty much optimal.

*Without& the patch, I get

        movzwl  .LC1(%rip), %eax
        testl   %esi, %esi
        movb    $0, (%rdi)
        movb    $1, 4(%rdi)
        movw    %ax, 1(%rdi)
        movq    $0, 8(%rdi)
        movq    %rdx, 16(%rdi)
        movq    %r8, 24(%rdi)
        movq    %rcx, 32(%rdi)
        setne   3(%rdi)

which is that disgusting "move two bytes from memory", and makes
absolutely no sense as a way to "write 2 zero bytes":

.LC1:
        .byte   0
        .byte   0

I think that's some odd gcc bug, actually.


> Otherwise pretty must all variants (with or without the bitfield)
> get initialised in a single write.

So there may well be some odd gcc code generation issue that is
triggered by the fact that we use an initializer to set those things,
and we then have two bytes (with my patch) followed by a hole, or
three bytes (without it) followed by a hole.

But that bitfield thing improves things at least for me. I think the
example code you used for godbolt is actually something else than what
the kernel does.

               Linus
David Laight Aug. 17, 2023, 4:06 p.m. UTC | #11
From: Linus Torvalds
> Sent: Thursday, August 17, 2023 4:31 PM
...
>         movzwl  .LC1(%rip), %eax
>         testl   %esi, %esi
>         movb    $0, (%rdi)
>         movb    $1, 4(%rdi)
>         movw    %ax, 1(%rdi)
>         movq    $0, 8(%rdi)
>         movq    %rdx, 16(%rdi)
>         movq    %r8, 24(%rdi)
>         movq    %rcx, 32(%rdi)
>         setne   3(%rdi)
> 
> which is that disgusting "move two bytes from memory", and makes
> absolutely no sense as a way to "write 2 zero bytes":
> 
> .LC1:
>         .byte   0
>         .byte   0
> 
> I think that's some odd gcc bug, actually.

I get that with some code, but not others.
Seems to depend on random other stuff.
Happens for:
	struct { unsigned char x:7, y:1; };
but not if I add anything after if (that gets zeroed).
Which seems to be the opposite of what you see.

If I use explicit assignments (rather than an initialiser)
I still get merged writes (even if not a bitfield) but also
lose the memory access.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
David Howells Aug. 18, 2023, 11:39 a.m. UTC | #12
Would it make sense to always check for MCE in _copy_from_iter() and always
return a short transfer if we encounter one?  It looks pretty cheap in terms
of code size as the exception table stuff handles it, so we don't need to do
anything in the normal path.

I guess this would change the handling of memory errors and DAX errors.

David
---
iov_iter: Always handle MCE in _copy_to_iter()

(incomplete)

---
 arch/x86/include/asm/mce.h |   22 ++++++++++++++++++++++
 fs/coredump.c              |    1 -
 include/linux/uio.h        |   16 ----------------
 lib/iov_iter.c             |   17 +++++------------
 4 files changed, 27 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 180b1cbfcc4e..ee3ff090360d 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -353,4 +353,26 @@ static inline void mce_hygon_feature_init(struct cpuinfo_x86 *c)	{ return mce_am
 
 unsigned long copy_mc_fragile_handle_tail(char *to, char *from, unsigned len);
 
+static __always_inline __must_check
+unsigned long memcpy_mc(void *to, const void *from, unsigned long len)
+{
+#ifdef CONFIG_ARCH_HAS_COPY_MC
+	/*
+	 * If CPU has FSRM feature, use 'rep movs'.
+	 * Otherwise, use rep_movs_alternative.
+	 */
+	asm volatile(
+		"1:\n\t"
+		ALTERNATIVE("rep movsb",
+			    "call rep_movs_alternative", ALT_NOT(X86_FEATURE_FSRM))
+		"2:\n"
+		_ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_DEFAULT_MCE_SAFE)
+		:"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
+		: : "memory", "rax", "r8", "r9", "r10", "r11");
+#else
+	return memcpy(to, from, len);
+#endif
+	return len;
+}
+
 #endif /* _ASM_X86_MCE_H */
diff --git a/fs/coredump.c b/fs/coredump.c
index 9d235fa14ab9..ad54102a5e14 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -884,7 +884,6 @@ static int dump_emit_page(struct coredump_params *cprm, struct page *page)
 	pos = file->f_pos;
 	bvec_set_page(&bvec, page, PAGE_SIZE, 0);
 	iov_iter_bvec(&iter, ITER_SOURCE, &bvec, 1, PAGE_SIZE);
-	iov_iter_set_copy_mc(&iter);
 	n = __kernel_write_iter(cprm->file, &iter, &pos);
 	if (n != PAGE_SIZE)
 		return 0;
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 42bce38a8e87..73078ba297b7 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -40,7 +40,6 @@ struct iov_iter_state {
 
 struct iov_iter {
 	u8 iter_type;
-	bool copy_mc;
 	bool nofault;
 	bool data_source;
 	bool user_backed;
@@ -252,22 +251,8 @@ size_t _copy_from_iter_flushcache(void *addr, size_t bytes, struct iov_iter *i);
 
 #ifdef CONFIG_ARCH_HAS_COPY_MC
 size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
-static inline void iov_iter_set_copy_mc(struct iov_iter *i)
-{
-	i->copy_mc = true;
-}
-
-static inline bool iov_iter_is_copy_mc(const struct iov_iter *i)
-{
-	return i->copy_mc;
-}
 #else
 #define _copy_mc_to_iter _copy_to_iter
-static inline void iov_iter_set_copy_mc(struct iov_iter *i) { }
-static inline bool iov_iter_is_copy_mc(const struct iov_iter *i)
-{
-	return false;
-}
 #endif
 
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
@@ -382,7 +367,6 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
 	WARN_ON(direction & ~(READ | WRITE));
 	*i = (struct iov_iter) {
 		.iter_type = ITER_UBUF,
-		.copy_mc = false,
 		.user_backed = true,
 		.data_source = direction,
 		.ubuf = buf,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index d282fd4d348f..887d9cb9be4e 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -14,6 +14,7 @@
 #include <linux/scatterlist.h>
 #include <linux/instrumented.h>
 #include <linux/iov_iter.h>
+#include <asm/mce.h>
 
 static __always_inline
 size_t copy_to_user_iter(void __user *iter_to, size_t progress,
@@ -168,7 +169,6 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction,
 	WARN_ON(direction & ~(READ | WRITE));
 	*i = (struct iov_iter) {
 		.iter_type = ITER_IOVEC,
-		.copy_mc = false,
 		.nofault = false,
 		.user_backed = true,
 		.data_source = direction,
@@ -254,14 +254,11 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 EXPORT_SYMBOL_GPL(_copy_mc_to_iter);
 #endif /* CONFIG_ARCH_HAS_COPY_MC */
 
-static size_t memcpy_from_iter_mc(void *iter_from, size_t progress,
-				  size_t len, void *to, void *priv2)
+static __always_inline
+size_t memcpy_from_iter_mc(void *iter_from, size_t progress,
+			   size_t len, void *to, void *priv2)
 {
-	struct iov_iter *iter = priv2;
-
-	if (iov_iter_is_copy_mc(iter))
-		return copy_mc_to_kernel(to + progress, iter_from, len);
-	return memcpy_from_iter(iter_from, progress, len, to, priv2);
+	return memcpy_mc(to + progress, iter_from, len);
 }
 
 size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
@@ -632,7 +629,6 @@ void iov_iter_kvec(struct iov_iter *i, unsigned int direction,
 	WARN_ON(direction & ~(READ | WRITE));
 	*i = (struct iov_iter){
 		.iter_type = ITER_KVEC,
-		.copy_mc = false,
 		.data_source = direction,
 		.kvec = kvec,
 		.nr_segs = nr_segs,
@@ -649,7 +645,6 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int direction,
 	WARN_ON(direction & ~(READ | WRITE));
 	*i = (struct iov_iter){
 		.iter_type = ITER_BVEC,
-		.copy_mc = false,
 		.data_source = direction,
 		.bvec = bvec,
 		.nr_segs = nr_segs,
@@ -678,7 +673,6 @@ void iov_iter_xarray(struct iov_iter *i, unsigned int direction,
 	BUG_ON(direction & ~1);
 	*i = (struct iov_iter) {
 		.iter_type = ITER_XARRAY,
-		.copy_mc = false,
 		.data_source = direction,
 		.xarray = xarray,
 		.xarray_start = start,
@@ -702,7 +696,6 @@ void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count)
 	BUG_ON(direction != READ);
 	*i = (struct iov_iter){
 		.iter_type = ITER_DISCARD,
-		.copy_mc = false,
 		.data_source = false,
 		.count = count,
 		.iov_offset = 0
David Howells Aug. 18, 2023, 11:42 a.m. UTC | #13
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Well, that part is trivially fixable, and we should do that anyway for
> other reasons.
> ..
>  enum iter_type {
>  	/* iter types */
> -	ITER_IOVEC,
> -	ITER_KVEC,
> -	ITER_BVEC,
> -	ITER_XARRAY,
> -	ITER_DISCARD,
> -	ITER_UBUF,
> +	ITER_IOVEC = 1,
> +	ITER_UBUF = 2,
> +	ITER_KVEC = 4,
> +	ITER_BVEC = 8,
> +	ITER_XARRAY = 16,
> +	ITER_DISCARD = 32,
>  };

It used to be this way, but Al switched it:

	8cd54c1c848031a87820e58d772166ffdf8c08c0
	iov_iter: separate direction from flavour

David
David Laight Aug. 18, 2023, 12:16 p.m. UTC | #14
From: David Howells
> Sent: Friday, August 18, 2023 12:43 PM
> 
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > Well, that part is trivially fixable, and we should do that anyway for
> > other reasons.
> > ..
> >  enum iter_type {
> >  	/* iter types */
> > -	ITER_IOVEC,
> > -	ITER_KVEC,
> > -	ITER_BVEC,
> > -	ITER_XARRAY,
> > -	ITER_DISCARD,
> > -	ITER_UBUF,
> > +	ITER_IOVEC = 1,
> > +	ITER_UBUF = 2,
> > +	ITER_KVEC = 4,
> > +	ITER_BVEC = 8,
> > +	ITER_XARRAY = 16,
> > +	ITER_DISCARD = 32,

That could be zero - no bits and default.

> >  };
> 
> It used to be this way, but Al switched it:
> 
> 	8cd54c1c848031a87820e58d772166ffdf8c08c0
> 	iov_iter: separate direction from flavour

Except it also had the direction flag inside the enum.
That caused its own piles of grief.

IIRC Linus had type:6 - that doesn't leave any headroom
for additional types (even though they shouldn't proliferate).
It may be best to avoid bits 15+ (in a bitfield) due to
issues with large constants and sign extension.
On x86 (I think) 'and immediate' and 'bit test' are the same
size for bit 0 to 7, BIT wins for higher bits.

gcc generates strange code for some initialisers (see yesterday's
thread) and you definitely mustn't leave unused bits in a bitfield.
Might be better is the fields are assigned later!

(I also saw clang carefully preserving %eax on stabck!)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Matthew Wilcox Aug. 18, 2023, 12:26 p.m. UTC | #15
On Fri, Aug 18, 2023 at 12:16:23PM +0000, David Laight wrote:
> > > +	ITER_IOVEC = 1,
> > > +	ITER_UBUF = 2,
> > > +	ITER_KVEC = 4,
> > > +	ITER_BVEC = 8,
> > > +	ITER_XARRAY = 16,
> > > +	ITER_DISCARD = 32,
> 
> IIRC Linus had type:6 - that doesn't leave any headroom
> for additional types (even though they shouldn't proliferate).

I have proposed an ITER_KBUF in the past (it is to KVEC as UBUF is
to IOVEC).  I didn't care enough to keep pushing it, but it's clearly
a common idiom.
David Laight Aug. 18, 2023, 12:41 p.m. UTC | #16
From: Matthew Wilcox
> Sent: Friday, August 18, 2023 1:27 PM
> 
> On Fri, Aug 18, 2023 at 12:16:23PM +0000, David Laight wrote:
> > > > +	ITER_IOVEC = 1,
> > > > +	ITER_UBUF = 2,
> > > > +	ITER_KVEC = 4,
> > > > +	ITER_BVEC = 8,
> > > > +	ITER_XARRAY = 16,
> > > > +	ITER_DISCARD = 32,
> >
> > IIRC Linus had type:6 - that doesn't leave any headroom
> > for additional types (even though they shouldn't proliferate).
> 
> I have proposed an ITER_KBUF in the past (it is to KVEC as UBUF is
> to IOVEC).  I didn't care enough to keep pushing it, but it's clearly
> a common idiom.

Indeed, I didn't spot UBUF going in - I spot a lot of stuff.

I did wonder if you could optimise for a vector length of 1
(inside the KVEC conditional).
That would also pick up the cases where there only happens
to be a single buffer.

I also remember writing a patch that simplified import_iovec()
by combining the iov_iter with a struct iovec iovec[UIO_FASTIOV].
All got bogged down in io_uring which would need changing first.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
David Howells Aug. 18, 2023, 1:33 p.m. UTC | #17
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> This patch only does that for the 'user_backed' thing, which was a similar
> case.

It makes some things a bit bigger, makes some a bit smaller:

__iov_iter_get_pages_alloc               dcr 0x331 -> 0x32a -0x7
_copy_from_iter                          dcr 0x36e -> 0x36a -0x4
_copy_from_iter_flushcache               inc 0x359 -> 0x36b +0x12
_copy_mc_to_iter                         dcr 0x3a7 -> 0x39b -0xc
_copy_to_iter                            inc 0x358 -> 0x359 +0x1
copy_page_to_iter_nofault.part.0         dcr 0x3f1 -> 0x3ef -0x2
csum_and_copy_from_iter                  dcr 0x3e8 -> 0x3e4 -0x4
csum_and_copy_to_iter                    inc 0x46a -> 0x46d +0x3
dup_iter                                 inc 0x34 -> 0x39 +0x5
fault_in_iov_iter_readable               inc 0x9b -> 0xa0 +0x5
fault_in_iov_iter_writeable              inc 0x9b -> 0xa0 +0x5
first_iovec_segment                      inc 0x4a -> 0x51 +0x7
import_single_range                      dcr 0x62 -> 0x40 -0x22
import_ubuf                              dcr 0x65 -> 0x43 -0x22
iov_iter_advance                         inc 0xd7 -> 0x103 +0x2c
iov_iter_alignment                       inc 0xe0 -> 0xe2 +0x2
iov_iter_extract_pages                   dcr 0x418 -> 0x416 -0x2
iov_iter_init                            dcr 0x31 -> 0x27 -0xa
iov_iter_is_aligned                      inc 0xf3 -> 0x108 +0x15
iov_iter_npages                          inc 0x119 -> 0x11a +0x1
iov_iter_revert                          inc 0x88 -> 0x99 +0x11
iov_iter_single_seg_count                inc 0x38 -> 0x3e +0x6
iov_iter_ubuf                            new 0x39
iov_iter_zero                            inc 0x34f -> 0x353 +0x4
iter_iov                                 new 0x17

Adding an extra patch to get rid of the bitfields and using a u8 for the type
and bools for the flags makes very little difference on top of the above:

__iov_iter_get_pages_alloc               inc 0x32a -> 0x32f +0x5
_copy_from_iter                          inc 0x36a -> 0x36d +0x3
copy_page_from_iter_atomic.part.0        inc 0x3cf -> 0x3d2 +0x3
csum_and_copy_to_iter                    dcr 0x46d -> 0x46a -0x3
iov_iter_advance                         dcr 0x103 -> 0xfd -0x6
iov_iter_extract_pages                   inc 0x416 -> 0x417 +0x1
iov_iter_init                            inc 0x27 -> 0x2d +0x6
iov_iter_revert                          dcr 0x99 -> 0x95 -0x4

For reference, I generated the stats with:

	nm build3/lib/iov_iter.o  | sort >a
	... change...
	nm build3/lib/iov_iter.o  | sort >b
	perl analyse.pl a b

where analyse.pl is attached.

David
---
#!/usr/bin/perl -w
use strict;

die "$0 <file_a> <file_b>" if ($#ARGV != 1);
my ($file_a, $file_b) = @ARGV;
die "$file_a: File not found\n" unless -r $file_a;
die "$file_b: File not found\n" unless -r $file_b;

my %a = ();
my %b = ();
my %c = ();

sub read_one($$$)
{
    my ($file, $list, $all) = @_;
    my $last = undef;

    open FD, "<$file" || die $file;
    while (<FD>) {
	if (/([0-9a-f][0-9a-f]+) [Tt] ([_a-zA-Z0-9.]*)/) {
	    my $addr = hex $1;
	    my $sym = $2;
	    #print $addr, " ", $sym, "\n";

	    my %obj = (
		sym	=> $sym,
		addr	=> $addr,
		size	=> 0
		);

	    $list->{$sym} = \%obj;
	    $all->{$sym} = 1;

	    if ($last) {
		$last->{size} = $addr - $last->{addr};
	    }

	    $last = \%obj;
	}
    }
    close(FD);
}

read_one($file_a, \%a, \%c);
read_one($file_b, \%b, \%c);

foreach my $sym (sort keys %c) {
    my $as = -1;
    my $bs = -1;

    $as = $a{$sym}->{size} if (exists($a{$sym}));
    $bs = $b{$sym}->{size} if (exists($b{$sym}));

    next if ($as == $bs);
    #next if ($sym =~ /__UNIQUE_ID/);

    if ($as == -1) {
	printf "%-40s new 0x%x\n", $sym, $bs;
    } elsif ($bs == -1) {
	printf "%-40s del 0x%x\n", $sym, $as;
    } elsif ($bs > $as) {
	printf "%-40s inc 0x%x -> 0x%x +0x%x\n", $sym, $as, $bs, $bs - $as;
    } else {
	printf "%-40s dcr 0x%x -> 0x%x -0x%x\n", $sym, $as, $bs, $as - $bs;
    }
}
David Howells Aug. 18, 2023, 3:19 p.m. UTC | #18
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > Although I'm not sure the bit-fields really help.
> > There are 8 bytes at the start of the structure, might as well
> > use them :-)
> 
> Actuallyç I wrote the patch that way because it seems to improve code
> generation.
> 
> The bitfields are generally all set together as just plain one-time
> constants at initialization time, and gcc sees that it's a full byte
> write. And the reason 'data_source' is not a bitfield is that it's not
> a constant at iov_iter init time (it's an argument to all the init
> functions), so having that one as a separate byte at init time is good
> for code generation when you don't need to mask bits or anything like
> that.
> 
> And once initialized, having things be dense and doing all the
> compares with a bitwise 'and' instead of doing them as some value
> compare again tends to generate good code.

Actually...  I said that switch(enum) seemed to generate suboptimal code...
However, if the enum is renumbered such that the constants are in the same
order as in the switch() it generates better code.

So we want this order:

	enum iter_type {
		ITER_UBUF,
		ITER_IOVEC,
		ITER_BVEC,
		ITER_KVEC,
		ITER_XARRAY,
		ITER_DISCARD,
	};

to match:

	switch (iov_iter_type(iter)) {
	case ITER_UBUF:
		progress = iterate_ubuf(iter, len, priv, priv2, ustep);
		break;
	case ITER_IOVEC:
		progress = iterate_iovec(iter, len, priv, priv2, ustep);
		break;
	case ITER_BVEC:
		progress = iterate_bvec(iter, len, priv, priv2, step);
		break;
	case ITER_KVEC:
		progress = iterate_kvec(iter, len, priv, priv2, step);
		break;
	case ITER_XARRAY:
		progress = iterate_xarray(iter, len, priv, priv2, step);
		break;
	case ITER_DISCARD:
	default:
		progress = len;
		break;
	}

then gcc should be able to generate a ternary tree, which it does here:

	<+77>:	mov    (%rdx),%al
	<+79>:	cmp    $0x2,%al
	<+81>:	je     0xffffffff81779bcc <_copy_from_iter+394>
	<+87>:	ja     0xffffffff81779aa9 <_copy_from_iter+103>

though it really split the number space in the wrong place.  It can then use
one CMP (or TEST) to split again:

	<+89>:	test   %al,%al
	<+91>:	mov    0x8(%rdx),%rdx
	<+95>:	jne    0xffffffff81779b48 <_copy_from_iter+262>
	<+101>:	jmp    0xffffffff81779b17 <_copy_from_iter+213>

It then should only require one CMP at this point, since AL can only be 4, 5
or 6+:

	<+103>:	cmp    $0x3,%al
	<+105>:	je     0xffffffff81779c6e <_copy_from_iter+556>
	<+111>:	cmp    $0x4,%al
	<+113>:	jne    0xffffffff81779dd2 <_copy_from_iter+912>

The end result being that it should have at most two CMP instructions for any
number in the range 0 to 5 - though in this case, it will have three for AL>3.

However, it doesn't do this with if-if-if with a number sequence or a bitmask,
but rather generates an chain of cmp-cmp-cmp or test-test-test, presumably
because it fails to spot the if-conditions are related.

I should log a gcc bug on this on the poor switch() behaviour.

Also, if we renumber the enum to put UBUF and IOVEC first, we can get rid of
iter->user_backed in favour of:

	static inline bool user_backed_iter(const struct iov_iter *i)
	{
		return iter_is_ubuf(i) || iter_is_iovec(i);
	}

which gcc just changes into something like a "CMP $1" and a "JA".


Comparing Linus's bit patch (+ is better) to renumbering the switch (- is
better):

__iov_iter_get_pages_alloc               inc 0x32a -> 0x331 +0x7
_copy_from_iter                          dcr 0x3c7 -> 0x3bf -0x8
_copy_from_iter_flushcache               inc 0x364 -> 0x370 +0xc
_copy_from_iter_nocache                  inc 0x33e -> 0x347 +0x9
_copy_mc_to_iter                         dcr 0x3bc -> 0x3b6 -0x6
_copy_to_iter                            inc 0x34a -> 0x34b +0x1
copy_page_from_iter_atomic.part.0        dcr 0x424 -> 0x41c -0x8
copy_page_to_iter_nofault.part.0         dcr 0x3a9 -> 0x3a5 -0x4
csum_and_copy_from_iter                  inc 0x3e5 -> 0x3e8 +0x3
csum_and_copy_to_iter                    dcr 0x449 -> 0x448 -0x1
dup_iter                                 inc 0x34 -> 0x37 +0x3
fault_in_iov_iter_readable               dcr 0x9c -> 0x9a -0x2
fault_in_iov_iter_writeable              dcr 0x9c -> 0x9a -0x2
iov_iter_advance                         dcr 0xde -> 0xd8 -0x6
iov_iter_alignment                       inc 0xe0 -> 0xe3 +0x3
iov_iter_extract_pages                   inc 0x416 -> 0x418 +0x2
iov_iter_gap_alignment                   dcr 0x69 -> 0x67 -0x2
iov_iter_init                            inc 0x27 -> 0x31 +0xa
iov_iter_is_aligned                      dcr 0x104 -> 0xf5 -0xf
iov_iter_npages                          inc 0x119 -> 0x11d +0x4
iov_iter_revert                          dcr 0x93 -> 0x86 -0xd
iov_iter_single_seg_count                dcr 0x41 -> 0x3c -0x5
iov_iter_ubuf                            inc 0x39 -> 0x3a +0x1
iov_iter_zero                            dcr 0x338 -> 0x32e -0xa
memcpy_from_iter_mc                      inc 0x2a -> 0x2b +0x1

I think there may be more savings to be made if I go and convert more of the
functions to using switch().

David
David Laight Aug. 18, 2023, 3:42 p.m. UTC | #19
From: David Howells
> Sent: Friday, August 18, 2023 4:20 PM
> 
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > > Although I'm not sure the bit-fields really help.
> > > There are 8 bytes at the start of the structure, might as well
> > > use them :-)
> >
> > Actuallyç I wrote the patch that way because it seems to improve code
> > generation.
> >
> > The bitfields are generally all set together as just plain one-time
> > constants at initialization time, and gcc sees that it's a full byte
> > write. And the reason 'data_source' is not a bitfield is that it's not
> > a constant at iov_iter init time (it's an argument to all the init
> > functions), so having that one as a separate byte at init time is good
> > for code generation when you don't need to mask bits or anything like
> > that.
> >
> > And once initialized, having things be dense and doing all the
> > compares with a bitwise 'and' instead of doing them as some value
> > compare again tends to generate good code.
> 
> Actually...  I said that switch(enum) seemed to generate suboptimal code...
> However, if the enum is renumbered such that the constants are in the same
> order as in the switch() it generates better code.

Hmmm.. the order of the switch labels really shouldn't matter.

The advantage of the if-chain is that you can optimise for
the most common case.

> So we want this order:
> 
> 	enum iter_type {
> 		ITER_UBUF,
> 		ITER_IOVEC,
> 		ITER_BVEC,
> 		ITER_KVEC,
> 		ITER_XARRAY,
> 		ITER_DISCARD,
> 	};

Will gcc actually code this version without pessimising it?

	if (likely(type <= ITER_IOVEC) {
		if (likely(type != ITER_IOVEC))
			iterate_ubuf();
		else
			iterate_iovec();
	} else if (likely(type) <= ITER_KVEC)) {
		if (type == ITER_KVEC)
			iterate_kvec();
		else
			iterate_bvec();
	} else if (type == ITER_XARRAY) {
		iterate_xarrar()
	} else {
		discard;
	}

But I bet you can't stop it replicating the compares.
(especially with the likely().

That has two mis-predicted (are they ever right!) branches in the
common user-copy versions and three in the common kernel ones.

In some architectures you might get the default 'fall through'
to the UBUF code if the branches aren't predictable.
But I believe current x86 cpu never do static prediction.
So you always lose :-)

...
> 	static inline bool user_backed_iter(const struct iov_iter *i)
> 	{
> 		return iter_is_ubuf(i) || iter_is_iovec(i);
> 	}
> 
> which gcc just changes into something like a "CMP $1" and a "JA".

That makes sense...

> Comparing Linus's bit patch (+ is better) to renumbering the switch (- is
> better):
> 
....
> iov_iter_init                            inc 0x27 -> 0x31 +0xa

Are you hitting the gcc bug that loads the constant from memory?

> I think there may be more savings to be made if I go and convert more of the
> functions to using switch().

Size isn't everything, the code needs to be optimised for the hot paths.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
David Howells Aug. 18, 2023, 4:48 p.m. UTC | #20
David Laight <David.Laight@ACULAB.COM> wrote:

> > iov_iter_init                            inc 0x27 -> 0x31 +0xa
> 
> Are you hitting the gcc bug that loads the constant from memory?

I'm not sure what that looks like.  For your perusal, here's a disassembly of
the use-switch-on-enum variant:

   0xffffffff8177726c <+0>:     cmp    $0x1,%esi
   0xffffffff8177726f <+3>:     jbe    0xffffffff81777273 <iov_iter_init+7>
   0xffffffff81777271 <+5>:     ud2
   0xffffffff81777273 <+7>:     test   %esi,%esi
   0xffffffff81777275 <+9>:     movw   $0x1,(%rdi)
   0xffffffff8177727a <+14>:    setne  0x3(%rdi)
   0xffffffff8177727e <+18>:    xor    %eax,%eax
   0xffffffff81777280 <+20>:    movb   $0x0,0x2(%rdi)
   0xffffffff81777284 <+24>:    movb   $0x1,0x4(%rdi)
   0xffffffff81777288 <+28>:    mov    %rax,0x8(%rdi)
   0xffffffff8177728c <+32>:    mov    %rdx,0x10(%rdi)
   0xffffffff81777290 <+36>:    mov    %r8,0x18(%rdi)
   0xffffffff81777294 <+40>:    mov    %rcx,0x20(%rdi)
   0xffffffff81777298 <+44>:    jmp    0xffffffff81d728a0 <__x86_return_thunk>

versus the use-bitmap variant:

   0xffffffff81777311 <+0>:     cmp    $0x1,%esi
   0xffffffff81777314 <+3>:     jbe    0xffffffff81777318 <iov_iter_init+7>
   0xffffffff81777316 <+5>:     ud2
   0xffffffff81777318 <+7>:     test   %esi,%esi
   0xffffffff8177731a <+9>:     movb   $0x2,(%rdi)
   0xffffffff8177731d <+12>:    setne  0x1(%rdi)
   0xffffffff81777321 <+16>:    xor    %eax,%eax
   0xffffffff81777323 <+18>:    mov    %rdx,0x10(%rdi)
   0xffffffff81777327 <+22>:    mov    %rax,0x8(%rdi)
   0xffffffff8177732b <+26>:    mov    %r8,0x18(%rdi)
   0xffffffff8177732f <+30>:    mov    %rcx,0x20(%rdi)
   0xffffffff81777333 <+34>:    jmp    0xffffffff81d72960 <__x86_return_thunk>

It seems to be that the former is loading byte constants individually, whereas
Linus combined all those fields into a single byte and eliminated one of them.

David
David Laight Aug. 18, 2023, 9:39 p.m. UTC | #21
From: David Howells
> Sent: Friday, August 18, 2023 5:49 PM
> 
> David Laight <David.Laight@ACULAB.COM> wrote:
> 
> > > iov_iter_init                            inc 0x27 -> 0x31 +0xa
> >
> > Are you hitting the gcc bug that loads the constant from memory?
> 
> I'm not sure what that looks like.  For your perusal, here's a disassembly of
> the use-switch-on-enum variant:
> 
>    0xffffffff8177726c <+0>:     cmp    $0x1,%esi
>    0xffffffff8177726f <+3>:     jbe    0xffffffff81777273 <iov_iter_init+7>
>    0xffffffff81777271 <+5>:     ud2
>    0xffffffff81777273 <+7>:     test   %esi,%esi
>    0xffffffff81777275 <+9>:     movw   $0x1,(%rdi)
>    0xffffffff8177727a <+14>:    setne  0x3(%rdi)
>    0xffffffff8177727e <+18>:    xor    %eax,%eax
>    0xffffffff81777280 <+20>:    movb   $0x0,0x2(%rdi)
>    0xffffffff81777284 <+24>:    movb   $0x1,0x4(%rdi)
>    0xffffffff81777288 <+28>:    mov    %rax,0x8(%rdi)
>    0xffffffff8177728c <+32>:    mov    %rdx,0x10(%rdi)
>    0xffffffff81777290 <+36>:    mov    %r8,0x18(%rdi)
>    0xffffffff81777294 <+40>:    mov    %rcx,0x20(%rdi)
>    0xffffffff81777298 <+44>:    jmp    0xffffffff81d728a0 <__x86_return_thunk>
> 
> versus the use-bitmap variant:
> 
>    0xffffffff81777311 <+0>:     cmp    $0x1,%esi
>    0xffffffff81777314 <+3>:     jbe    0xffffffff81777318 <iov_iter_init+7>
>    0xffffffff81777316 <+5>:     ud2
>    0xffffffff81777318 <+7>:     test   %esi,%esi
>    0xffffffff8177731a <+9>:     movb   $0x2,(%rdi)
>    0xffffffff8177731d <+12>:    setne  0x1(%rdi)
>    0xffffffff81777321 <+16>:    xor    %eax,%eax
>    0xffffffff81777323 <+18>:    mov    %rdx,0x10(%rdi)
>    0xffffffff81777327 <+22>:    mov    %rax,0x8(%rdi)
>    0xffffffff8177732b <+26>:    mov    %r8,0x18(%rdi)
>    0xffffffff8177732f <+30>:    mov    %rcx,0x20(%rdi)
>    0xffffffff81777333 <+34>:    jmp    0xffffffff81d72960 <__x86_return_thunk>
> 
> It seems to be that the former is loading byte constants individually, whereas
> Linus combined all those fields into a single byte and eliminated one of them.

I think you need to re-order the structure.
The top set writes to bytes 0..4 with:
>    0xffffffff81777275 <+9>:     movw   $0x1,(%rdi)
>    0xffffffff8177727a <+14>:    setne  0x3(%rdi)
>    0xffffffff81777280 <+20>:    movb   $0x0,0x2(%rdi)
>    0xffffffff81777284 <+24>:    movb   $0x1,0x4(%rdi)
Note that the 'setne' writes into the middle of the constants.

The lower writes bytes 0..1 with:
>    0xffffffff8177731a <+9>:     movb   $0x2,(%rdi)
>    0xffffffff8177731d <+12>:    setne  0x1(%rdi)

I think that if you move the 'conditional' value to offset 4
you'll get fewer writes.
Probably a 32bit load into %eax and then a write.

I don't think gcc likes generating 16bit immediates.
In some tests I did it loaded a 32bit value into %eax
and then wrote the low bits.
So the code is much the same (on x86) for 2 or 4 bytes
of constants.
I'm sure you can use the 'data-16' prefix with an immediate.

I'm not sure why you have two non-zero values when Linus
only had one though.

OTOH you don't want to be writing 3 bytes of constants.
Also gcc won't generate:
	movl $0xaabbccdd,%eax
	setne %al   // overwriting the dd
	movl %eax,(%rdi)
and I suspect the partial write (to %al) will be a stall.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
diff mbox series

Patch

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 378da0cb3069..33febccadd9d 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -450,14 +450,33 @@  size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 EXPORT_SYMBOL_GPL(_copy_mc_to_iter);
 #endif /* CONFIG_ARCH_HAS_COPY_MC */
 
-static size_t memcpy_from_iter_mc(void *iter_from, size_t progress,
-				  size_t len, void *to, void *priv2)
+static __always_inline
+size_t memcpy_from_iter_mc(void *iter_from, size_t progress,
+			   size_t len, void *to, void *priv2)
 {
-	struct iov_iter *iter = priv2;
+	return copy_mc_to_kernel(to + progress, iter_from, len);
+}
 
-	if (iov_iter_is_copy_mc(iter))
-		return copy_mc_to_kernel(to + progress, iter_from, len);
-	return memcpy_from_iter(iter_from, progress, len, to, priv2);
+static size_t __copy_from_iter_mc(void *addr, size_t bytes, struct iov_iter *i)
+{
+	size_t progress;
+
+	if (unlikely(i->count < bytes))
+		bytes = i->count;
+	if (unlikely(!bytes))
+		return 0;
+	progress = iterate_bvec(i, bytes, addr, NULL, memcpy_from_iter_mc);
+	i->count -= progress;
+	return progress;
+}
+
+static __always_inline
+size_t __copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
+{
+	if (unlikely(iov_iter_is_copy_mc(i)))
+		return __copy_from_iter_mc(addr, bytes, i);
+	return iterate_and_advance(i, bytes, addr,
+				   copy_from_user_iter, memcpy_from_iter);
 }
 
 size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
@@ -467,9 +486,7 @@  size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 
 	if (user_backed_iter(i))
 		might_fault();
-	return iterate_and_advance2(i, bytes, addr, i,
-				    copy_from_user_iter,
-				    memcpy_from_iter_mc);
+	return __copy_from_iter(addr, bytes, i);
 }
 EXPORT_SYMBOL(_copy_from_iter);
 
@@ -690,9 +707,7 @@  size_t copy_page_from_iter_atomic(struct page *page, size_t offset,
 		}
 
 		p = kmap_atomic(page) + offset;
-		n = iterate_and_advance2(i, n, p, i,
-					 copy_from_user_iter,
-					 memcpy_from_iter_mc);
+		__copy_from_iter(p, n, i);
 		kunmap_atomic(p);
 		copied += n;
 		offset += n;