[RFC] drm: Optimise drm_ioctl() for small user args

Message ID	20170530125520.14030-1-chris@chris-wilson.co.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Chris Wilson <chris@chris-wilson.co.uk> To: dri-devel@lists.freedesktop.org Subject: [RFC] drm: Optimise drm_ioctl() for small user args Date: Tue, 30 May 2017 13:55:20 +0100 Message-Id: <20170530125520.14030-1-chris@chris-wilson.co.uk> Cc: intel-gfx@lists.freedesktop.org, Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Message ID

20170530125520.14030-1-chris@chris-wilson.co.uk (mailing list archive)

State

New, archived

Headers

From: Chris Wilson <chris@chris-wilson.co.uk>
To: dri-devel@lists.freedesktop.org
Subject: [RFC] drm: Optimise drm_ioctl() for small user args
Date: Tue, 30 May 2017 13:55:20 +0100
Message-Id: <20170530125520.14030-1-chris@chris-wilson.co.uk>
Cc: intel-gfx@lists.freedesktop.org,
	Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Precedence: list
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Commit Message

Chris Wilson May 30, 2017, 12:55 p.m. UTC

When looking at simple ioctls coupled with conveniently small user
parameters, the overhead of the syscall and drm_ioctl() present large
low hanging fruit. Profiling trivial microbenchmarks around
i915_gem_busy_ioctl, the low hanging fruit comprises of the call to
copy_user(). Those calls are only inlined by the macro where the
constant is known at compile-time, but the ioctl argument size depends
on the ioctl number. To help the compiler, explicitly add switches for
the small sizes that expand to simple moves to/from user. Doing the
multiple inlines does add significant code bloat, so it is very
debatable as to its value. Back to the trivial, but frequently used,
example of i915_gem_busy_ioctl() on a Broadwell avoiding the call gives
us a 15-25% improvement:

			 before		  after
	single		100.173ns	 84.496ns
	parallel (x4)	204.275ns	152.957ns

On a baby Broxton nuc:

			before           after
	single		245.355ns	199.477ns
	parallel (x2)	280.892ns	232.726ns

Looking at the cost distribution by moving an equivalent switch into
arch/x86/lib/usercopy, the overhead to the copy user is split almost
equally between the function call and the actual copy itself. It seems
copy_user_enhanced_fast_string simply is not that good at small (single
register) copies. Something as simple as

@@ -28,6 +28,9 @@ copy_user_generic(void *to, const void *from, unsigned len)
 {
        unsigned ret;

+       if (len <= 16)
+               return copy_user_generic_unrolled(to, from, len);

is enough to speed up i915_gem_busy_ioctl() by 10% :|

Note that this overhead may entirely be x86 specific.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/drm_ioctl.c | 111 ++++++++++++++++++++++++++++++++------------
 1 file changed, 82 insertions(+), 29 deletions(-)

Comments

Joonas Lahtinen May 30, 2017, 1:24 p.m. UTC | #1

On ti, 2017-05-30 at 13:55 +0100, Chris Wilson wrote:
> When looking at simple ioctls coupled with conveniently small user
> parameters, the overhead of the syscall and drm_ioctl() present large
> low hanging fruit. Profiling trivial microbenchmarks around
> i915_gem_busy_ioctl, the low hanging fruit comprises of the call to
> copy_user(). Those calls are only inlined by the macro where the
> constant is known at compile-time, but the ioctl argument size depends
> on the ioctl number. To help the compiler, explicitly add switches for
> the small sizes that expand to simple moves to/from user. Doing the
> multiple inlines does add significant code bloat, so it is very
> debatable as to its value. Back to the trivial, but frequently used,
> example of i915_gem_busy_ioctl() on a Broadwell avoiding the call gives
> us a 15-25% improvement:
> 
> 			 before		  after
> 	single		100.173ns	 84.496ns
> 	parallel (x4)	204.275ns	152.957ns
> 
> On a baby Broxton nuc:
> 
> 			before           after
> 	single		245.355ns	199.477ns
> 	parallel (x2)	280.892ns	232.726ns
> 
> Looking at the cost distribution by moving an equivalent switch into
> arch/x86/lib/usercopy, the overhead to the copy user is split almost
> equally between the function call and the actual copy itself. It seems
> copy_user_enhanced_fast_string simply is not that good at small (single
> register) copies. Something as simple as
> 
> @@ -28,6 +28,9 @@ copy_user_generic(void *to, const void *from, unsigned len)
>  {
>         unsigned ret;
> 
> +       if (len <= 16)
> +               return copy_user_generic_unrolled(to, from, len);
> 
> is enough to speed up i915_gem_busy_ioctl() by 10% :|
> 
> Note that this overhead may entirely be x86 specific.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

I think this should be integrated into __copy_{to,from}_user directly,
but in the meanwhile the code is;

Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

Regards, Joonas

Chris Wilson May 30, 2017, 3:24 p.m. UTC | #2

On Tue, May 30, 2017 at 01:55:20PM +0100, Chris Wilson wrote:
> When looking at simple ioctls coupled with conveniently small user
> parameters, the overhead of the syscall and drm_ioctl() present large
> low hanging fruit. Profiling trivial microbenchmarks around
> i915_gem_busy_ioctl, the low hanging fruit comprises of the call to
> copy_user(). Those calls are only inlined by the macro where the
> constant is known at compile-time, but the ioctl argument size depends
> on the ioctl number. To help the compiler, explicitly add switches for
> the small sizes that expand to simple moves to/from user. Doing the
> multiple inlines does add significant code bloat, so it is very
> debatable as to its value. Back to the trivial, but frequently used,
> example of i915_gem_busy_ioctl() on a Broadwell avoiding the call gives
> us a 15-25% improvement:
> 
> 			 before		  after
> 	single		100.173ns	 84.496ns
> 	parallel (x4)	204.275ns	152.957ns
> 
> On a baby Broxton nuc:
> 
> 			before           after
> 	single		245.355ns	199.477ns
> 	parallel (x2)	280.892ns	232.726ns
> 
> Looking at the cost distribution by moving an equivalent switch into
> arch/x86/lib/usercopy, the overhead to the copy user is split almost
> equally between the function call and the actual copy itself. It seems
> copy_user_enhanced_fast_string simply is not that good at small (single
> register) copies. Something as simple as
> 
> @@ -28,6 +28,9 @@ copy_user_generic(void *to, const void *from, unsigned len)
>  {
>         unsigned ret;
> 
> +       if (len <= 16)
> +               return copy_user_generic_unrolled(to, from, len);
> 
> is enough to speed up i915_gem_busy_ioctl() by 10% :|
> 
> Note that this overhead may entirely be x86 specific.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> ---
> +	if (in_size) {
> +		if (unlikely(!access_ok(VERIFY_READ, arg, in_size)))
> +			goto err_invalid_user;
> +
> +		switch (in_size) {
> +		case 4:
> +			if (unlikely(__copy_from_user(kdata, (void __user *)arg,
> +						      4)))
> +				goto err_invalid_user;
> +			break;
> +		case 8:
> +			if (unlikely(__copy_from_user(kdata, (void __user *)arg,
> +						      8)))
> +				goto err_invalid_user;
> +			break;
> +		case 16:
> +			if (unlikely(__copy_from_user(kdata, (void __user *)arg,
> +						      16)))
> +				goto err_invalid_user;
> +			break;

For example, currently x86-32 only converts case 4 above. It could
trivially do case 8 as well, but by case 16 it may as well use the
function call to its loop in assembly. And currently x86-32 has no
optimisations for fixed sized puts.
-Chris

diff --git a/drivers/gpu/drm/drm_ioctl.c b/drivers/gpu/drm/drm_ioctl.c
index 865e3ee4d743..93ba59a30a85 100644
--- a/drivers/gpu/drm/drm_ioctl.c
+++ b/drivers/gpu/drm/drm_ioctl.c
@@ -715,11 +715,11 @@  long drm_ioctl(struct file *filp,
 	const struct drm_ioctl_desc *ioctl = NULL;
 	drm_ioctl_t *func;
 	unsigned int nr = DRM_IOCTL_NR(cmd);
-	int retcode = -EINVAL;
 	char stack_kdata[128];
-	char *kdata = NULL;
+	char *kdata = stack_kdata;
 	unsigned int in_size, out_size, drv_size, ksize;
 	bool is_driver_ioctl;
+	int retcode;
 
 	dev = file_priv->minor->dev;
 
@@ -731,12 +731,12 @@  long drm_ioctl(struct file *filp,
 	if (is_driver_ioctl) {
 		/* driver ioctl */
 		if (nr - DRM_COMMAND_BASE >= dev->driver->num_ioctls)
-			goto err_i1;
+			goto err_invalid_ioctl;
 		ioctl = &dev->driver->ioctls[nr - DRM_COMMAND_BASE];
 	} else {
 		/* core ioctl */
 		if (nr >= DRM_CORE_IOCTL_COUNT)
-			goto err_i1;
+			goto err_invalid_ioctl;
 		ioctl = &drm_ioctls[nr];
 	}
 
@@ -758,29 +758,50 @@  long drm_ioctl(struct file *filp,
 
 	if (unlikely(!func)) {
 		DRM_DEBUG("no function\n");
-		retcode = -EINVAL;
-		goto err_i1;
+		goto err_invalid;
 	}
 
 	retcode = drm_ioctl_permit(ioctl->flags, file_priv);
 	if (unlikely(retcode))
-		goto err_i1;
-
-	if (ksize <= sizeof(stack_kdata)) {
-		kdata = stack_kdata;
-	} else {
-		kdata = kmalloc(ksize, GFP_KERNEL);
-		if (!kdata) {
-			retcode = -ENOMEM;
-			goto err_i1;
+		goto out;
+
+	if (in_size) {
+		if (unlikely(!access_ok(VERIFY_READ, arg, in_size)))
+			goto err_invalid_user;
+
+		switch (in_size) {
+		case 4:
+			if (unlikely(__copy_from_user(kdata, (void __user *)arg,
+						      4)))
+				goto err_invalid_user;
+			break;
+		case 8:
+			if (unlikely(__copy_from_user(kdata, (void __user *)arg,
+						      8)))
+				goto err_invalid_user;
+			break;
+		case 16:
+			if (unlikely(__copy_from_user(kdata, (void __user *)arg,
+						      16)))
+				goto err_invalid_user;
+			break;
+
+		default:
+			if (ksize > sizeof(stack_kdata)) {
+				kdata = kmalloc(ksize, GFP_KERNEL);
+				if (unlikely(!kdata)) {
+					retcode = -ENOMEM;
+					goto out;
+				}
+			}
+
+			if (unlikely(__copy_from_user(kdata, (void __user *)arg,
+						      in_size)))
+				goto err_invalid_user;
+			break;
 		}
 	}
 
-	if (copy_from_user(kdata, (void __user *)arg, in_size) != 0) {
-		retcode = -EFAULT;
-		goto err_i1;
-	}
-
 	if (ksize > in_size)
 		memset(kdata + in_size, 0, ksize - in_size);
 
@@ -794,21 +815,53 @@  long drm_ioctl(struct file *filp,
 		mutex_unlock(&drm_global_mutex);
 	}
 
-	if (copy_to_user((void __user *)arg, kdata, out_size) != 0)
-		retcode = -EFAULT;
-
-      err_i1:
-	if (!ioctl)
-		DRM_DEBUG("invalid ioctl: pid=%d, dev=0x%lx, auth=%d, cmd=0x%02x, nr=0x%02x\n",
-			  task_pid_nr(current),
-			  (long)old_encode_dev(file_priv->minor->kdev->devt),
-			  file_priv->authenticated, cmd, nr);
+	if (out_size) {
+		if (unlikely(!access_ok(VERIFY_WRITE, arg, out_size)))
+			goto err_invalid_user;
+
+		switch (out_size) {
+		case 4:
+			if (unlikely(__copy_to_user((void __user *)arg,
+						    kdata, 4)))
+				goto err_invalid_user;
+			break;
+		case 8:
+			if (unlikely(__copy_to_user((void __user *)arg,
+						    kdata, 8)))
+				goto err_invalid_user;
+			break;
+		case 16:
+			if (unlikely(__copy_to_user((void __user *)arg,
+						    kdata, 16)))
+				goto err_invalid_user;
+			break;
+		default:
+			if (unlikely(__copy_to_user((void __user *)arg,
+						    kdata, out_size)))
+				goto err_invalid_user;
+			break;
+		}
+	}
 
+out:
 	if (kdata != stack_kdata)
 		kfree(kdata);
 	if (retcode)
 		DRM_DEBUG("ret = %d\n", retcode);
 	return retcode;
+
+err_invalid_ioctl:
+	DRM_DEBUG("invalid ioctl: pid=%d, dev=0x%lx, auth=%d, cmd=0x%02x, nr=0x%02x\n",
+		  task_pid_nr(current),
+		  (long)old_encode_dev(file_priv->minor->kdev->devt),
+		  file_priv->authenticated, cmd, nr);
+err_invalid:
+	retcode = -EINVAL;
+	goto out;
+
+err_invalid_user:
+	retcode = -EFAULT;
+	goto out;
 }
 EXPORT_SYMBOL(drm_ioctl);

[RFC] drm: Optimise drm_ioctl() for small user args

Commit Message

Comments

Patch