Message ID | 1308094989-3153-1-git-send-email-eric@anholt.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, 14 Jun 2011 16:43:09 -0700, Eric Anholt <eric@anholt.net> wrote: > This reverts commit 4a684a4117abd756291969336af454e8a958802f. > Userland has always been required to set the object's domain to GTT > before using it through a GTT mapping, it's not something that the > kernel is supposed to enforce. (The pagefault support is so that we > can handle multiple mappings without userland having to pin across > them, not so that userland can use GTT after GPU domains without > telling the kernel). The only place that can do accurate per-page tracking of domains is the kernel. (And avoid clflushes for GTT eviction. Ok, so we're not quite there yet.) Relying on userspace to get it right, and for co-operating processes to coordinate their access to buffers is a misfeature. > Fixes 19.2% +/- 0.8% (n=6) performance regression in cairo-gl > firefox-talos-gfx on my T420 latop. Since GTT mappings are fundamentally slow, don't you think this suggests that mesa is doing something wrong? Or a bug in cairo-gl to make mesa act this way? For the record, on my SugarBay, this patch on top of drm-intel-fixes and mesa.git + your instruction streaming, regresses cairo-xlib 4.5s to 4.7s (4%, with a claim of .1% stddev) and improves cairo-gl from 21.5s to 20.1s (6.5%, with a claim of .5% stddev) for the firefox-talos-gfx. [This is a system that can do under 2s for cairo-xlib using vmap and the perf patches I posted before.] -Chris
On Wed, 15 Jun 2011 10:39:06 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote: > On Tue, 14 Jun 2011 16:43:09 -0700, Eric Anholt <eric@anholt.net> wrote: > > This reverts commit 4a684a4117abd756291969336af454e8a958802f. > > Userland has always been required to set the object's domain to GTT > > before using it through a GTT mapping, it's not something that the > > kernel is supposed to enforce. (The pagefault support is so that we > > can handle multiple mappings without userland having to pin across > > them, not so that userland can use GTT after GPU domains without > > telling the kernel). > > The only place that can do accurate per-page tracking of domains is the > kernel. (And avoid clflushes for GTT eviction. Ok, so we're not quite > there yet.) Relying on userspace to get it right, and for co-operating > processes to coordinate their access to buffers is a misfeature. We rely on userspace to get it right all the time, which was a design decision made early on for performance reasons. For example, userspace can submit the wrong domains with an exec call, and non-GTT mappings also behave like I'm changing GTT back to. Unless you quietly changed that, too? > > Fixes 19.2% +/- 0.8% (n=6) performance regression in cairo-gl > > firefox-talos-gfx on my T420 latop. > > Since GTT mappings are fundamentally slow, don't you think this suggests > that mesa is doing something wrong? Or a bug in cairo-gl to make mesa act > this way? Page table changes are expensive, yes. That's why we cache BOs in userspace and hang onto the mappings of them. We only moved to GTT mappings in Mesa in places where it was faster than the alternatives (or for tiled buffers, where due to the a17 swizzling there is no other option). > For the record, on my SugarBay, this patch on top of drm-intel-fixes and > mesa.git + your instruction streaming, regresses cairo-xlib 4.5s to 4.7s > (4%, with a claim of .1% stddev) and improves cairo-gl from 21.5s to 20.1s > (6.5%, with a claim of .5% stddev) for the firefox-talos-gfx. [This is a > system that can do under 2s for cairo-xlib using vmap and the perf > patches I posted before.] > -Chris I was testing with semaphores enabled on the LLC series. I'm curious how this revert could possibly regress cairo-xlib, though -- that seems very strange.
On Wed, 15 Jun 2011 08:08:59 -0700, Eric Anholt <eric@anholt.net> wrote: > On Wed, 15 Jun 2011 10:39:06 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > For the record, on my SugarBay, this patch on top of drm-intel-fixes and > > mesa.git + your instruction streaming, regresses cairo-xlib 4.5s to 4.7s > > (4%, with a claim of .1% stddev) and improves cairo-gl from 21.5s to 20.1s > > (6.5%, with a claim of .5% stddev) for the firefox-talos-gfx. [This is a > > system that can do under 2s for cairo-xlib using vmap and the perf > > patches I posted before.] > > -Chris > > I was testing with semaphores enabled on the LLC series. I'm curious > how this revert could possibly regress cairo-xlib, though -- that seems > very strange. Moving back to LLC+semaphores (2.6.39-rc2+), firefox-talos-gfx: xlib: 4.473 gl: 20.753 applying the patch: xlib: 4.472 gl: 20.824 I'm just not reproducing the same issue you are seeing. Are you using a standard distro Kconfig, or if not, can you send me yours? -Chris
On Wed, 15 Jun 2011 17:03:58 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote: > On Wed, 15 Jun 2011 08:08:59 -0700, Eric Anholt <eric@anholt.net> wrote: > > On Wed, 15 Jun 2011 10:39:06 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > > For the record, on my SugarBay, this patch on top of drm-intel-fixes and > > > mesa.git + your instruction streaming, regresses cairo-xlib 4.5s to 4.7s > > > (4%, with a claim of .1% stddev) and improves cairo-gl from 21.5s to 20.1s > > > (6.5%, with a claim of .5% stddev) for the firefox-talos-gfx. [This is a > > > system that can do under 2s for cairo-xlib using vmap and the perf > > > patches I posted before.] > > > -Chris > > > > I was testing with semaphores enabled on the LLC series. I'm curious > > how this revert could possibly regress cairo-xlib, though -- that seems > > very strange. > > Moving back to LLC+semaphores (2.6.39-rc2+), firefox-talos-gfx: > xlib: 4.473 > gl: 20.753 > > applying the patch: > xlib: 4.472 > gl: 20.824 > > I'm just not reproducing the same issue you are seeing. Are you using a > standard distro Kconfig, or if not, can you send me yours? > -Chris http://people.freedesktop.org/~anholt/dotconfig Pushed my kernel tree to "gtt-revert" branch. Fresh set of numbers on two fresh boots, just about as identical testing environments as I could produce: (without revert) anholt@eliezer:src/cairo/perf% cat /home/anholt/cairogl-before 21791099815 21677090234 22060471069 22049244399 21721362892 21966143347 (with revert) anholt@eliezer:src/cairo/perf% cat /home/anholt/cairogl-after 19162606198 19127697366 19462253708 19172535715 19339999910 19221539215 anholt@eliezer:src/cairo/perf% ministat ~/cairogl-before ~/cairogl-after x /home/anholt/cairogl-before + /home/anholt/cairogl-after +------------------------------------------------------------------------------+ | + x| |++ + + + xx x x x| ||__A___| |___A__M_|| +------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 6 2.167709e+10 2.2060471e+10 2.1966143e+10 2.1877569e+10 1.6902073e+08 + 6 1.9127697e+10 1.9462254e+10 1.9221539e+10 1.9247772e+10 1.2847426e+08 Difference at 95.0% confidence -2.6298e+09 +/- 1.93108e+08 -12.0205% +/- 0.882677% (Student's t, pooled s = 1.50123e+08)
On Fri, 17 Jun 2011 12:06:54 -0700, Eric Anholt <eric@anholt.net> wrote: > On Wed, 15 Jun 2011 17:03:58 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > Moving back to LLC+semaphores (2.6.39-rc2+), firefox-talos-gfx: > > xlib: 4.473 > > gl: 20.753 > > > > applying the patch: > > xlib: 4.472 > > gl: 20.824 > > > > I'm just not reproducing the same issue you are seeing. Are you using a > > standard distro Kconfig, or if not, can you send me yours? > > -Chris > > http://people.freedesktop.org/~anholt/dotconfig > > Pushed my kernel tree to "gtt-revert" branch. Thanks, I'm testing with those on my SNB desktop, still only see around a 5% hit for firefox-talos-gfx. I still think this is optimising for bad behaviour of the client, and papering over two slow linear lookups in the kernel. [One for find_vma which is exacerbated by the cached vma on the bo and the other checking whether the GTT aperture is RAM.] Looking at what cairo-gl is doing, the root cause of why we are repeatedly rewriting textures is: commit 3b1c0a4bd66660780095e6016e3db451f34503a3 Author: Benjamin Otte <otte@redhat.com> Date: Fri May 14 15:56:17 2010 +0200 fallback: Remove span renderer paths Those paths were broken, as they didn't properly translate the polygon to the destination size. And rather than adding lots of code that allows translation, it's easier to just delete this code. Note that the only user of the code was the GL backend anyway. i.e. cairo-gl is no longer using spans for firefox-talos-gfx where unaligned clipping dominates. -Chris
On Sat, 18 Jun 2011 12:43:58 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote: > On Fri, 17 Jun 2011 12:06:54 -0700, Eric Anholt <eric@anholt.net> wrote: > > On Wed, 15 Jun 2011 17:03:58 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > > Moving back to LLC+semaphores (2.6.39-rc2+), firefox-talos-gfx: > > > xlib: 4.473 > > > gl: 20.753 > > > > > > applying the patch: > > > xlib: 4.472 > > > gl: 20.824 > > > > > > I'm just not reproducing the same issue you are seeing. Are you using a > > > standard distro Kconfig, or if not, can you send me yours? > > > -Chris > > > > http://people.freedesktop.org/~anholt/dotconfig > > > > Pushed my kernel tree to "gtt-revert" branch. > > Thanks, I'm testing with those on my SNB desktop, still only see around > a 5% hit for firefox-talos-gfx. GTT mappings are important. They're how textures are uploaded in GL in general. One of the longstanding things I've wanted to do for GL applications that repeatedly allocate new texture images is to userland BO cache those objects, which we don't do because of tiling. With the patch I'm reverting in place, there's basically no reason to do so because remapping the BO is much of the cost of recreating the BO from scratch. Remapping is the reason we do userland caching instead of kernel-side caching.
On Sat, 18 Jun 2011 13:20:05 -0700, Eric Anholt <eric@anholt.net> wrote: > On Sat, 18 Jun 2011 12:43:58 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > On Fri, 17 Jun 2011 12:06:54 -0700, Eric Anholt <eric@anholt.net> wrote: > > > On Wed, 15 Jun 2011 17:03:58 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > > > Moving back to LLC+semaphores (2.6.39-rc2+), firefox-talos-gfx: > > > > xlib: 4.473 > > > > gl: 20.753 > > > > > > > > applying the patch: > > > > xlib: 4.472 > > > > gl: 20.824 > > > > > > > > I'm just not reproducing the same issue you are seeing. Are you using a > > > > standard distro Kconfig, or if not, can you send me yours? > > > > -Chris > > > > > > http://people.freedesktop.org/~anholt/dotconfig > > > > > > Pushed my kernel tree to "gtt-revert" branch. > > > > Thanks, I'm testing with those on my SNB desktop, still only see around > > a 5% hit for firefox-talos-gfx. > > GTT mappings are important. They're how textures are uploaded in GL in > general. For lack of a better mechanism. Even using anholt/gtt-revert, I question the value of caching the GTT mapping in drm_intel_bo. For the cairo-gl and pts benchmarks I've run, the efficacy of the cached vma is very small and there is a very slight improvement by unmapping the vma after use. (The difference is so small, that it will take a lot more runs to determine if it is statistically significant.) > One of the longstanding things I've wanted to do for GL > applications that repeatedly allocate new texture images is to userland > BO cache those objects, which we don't do because of tiling. I'm failing to see the difference between this cache and a slightly smarter (i.e. one that preferentially returns an object with appropriate tiling and busy status) libdrm_intel cache. > With the > patch I'm reverting in place, there's basically no reason to do so > because remapping the BO is much of the cost of recreating the BO from > scratch. Remapping is the reason we do userland caching instead of > kernel-side caching. Conversely, it is the lack of such a kernel bo and page cache that is determinental to the ddx (at least on pre-SNB). And for any system running more than one renderer. [Except for the thorny issue of whether having to clear the pages returned from such a cache will nullify its benefits. I have yet to measure the impact of doing so.] -Chris
On Sun, 19 Jun 2011 17:28:11 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote: > On Sat, 18 Jun 2011 13:20:05 -0700, Eric Anholt <eric@anholt.net> wrote: > > On Sat, 18 Jun 2011 12:43:58 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > > On Fri, 17 Jun 2011 12:06:54 -0700, Eric Anholt <eric@anholt.net> wrote: > > > > On Wed, 15 Jun 2011 17:03:58 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > > > > Moving back to LLC+semaphores (2.6.39-rc2+), firefox-talos-gfx: > > > > > xlib: 4.473 > > > > > gl: 20.753 > > > > > > > > > > applying the patch: > > > > > xlib: 4.472 > > > > > gl: 20.824 > > > > > > > > > > I'm just not reproducing the same issue you are seeing. Are you using a > > > > > standard distro Kconfig, or if not, can you send me yours? > > > > > -Chris > > > > > > > > http://people.freedesktop.org/~anholt/dotconfig > > > > > > > > Pushed my kernel tree to "gtt-revert" branch. > > > > > > Thanks, I'm testing with those on my SNB desktop, still only see around > > > a 5% hit for firefox-talos-gfx. > > > > GTT mappings are important. They're how textures are uploaded in GL in > > general. > > For lack of a better mechanism. Even using anholt/gtt-revert, I question > the value of caching the GTT mapping in drm_intel_bo. For the cairo-gl and > pts benchmarks I've run, the efficacy of the cached vma is very small and > there is a very slight improvement by unmapping the vma after use. (The > difference is so small, that it will take a lot more runs to determine if > it is statistically significant.) I'm confused. You've measured a 5% impact from removing this part of the caching, and I've measured 12-19%, so what are you planning that doesn't involve caching the mapping that's faster than caching the mapping? > > One of the longstanding things I've wanted to do for GL > > applications that repeatedly allocate new texture images is to userland > > BO cache those objects, which we don't do because of tiling. > > I'm failing to see the difference between this cache and a slightly > smarter (i.e. one that preferentially returns an object with appropriate > tiling and busy status) libdrm_intel cache. I was talking about it being in libdrm_intel, sorry if that was unclear.
On Sun, 19 Jun 2011 10:01:23 -0700, Eric Anholt <eric@anholt.net> wrote: > On Sun, 19 Jun 2011 17:28:11 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > For lack of a better mechanism. Even using anholt/gtt-revert, I question > > the value of caching the GTT mapping in drm_intel_bo. For the cairo-gl and > > pts benchmarks I've run, the efficacy of the cached vma is very small and > > there is a very slight improvement by unmapping the vma after use. (The > > difference is so small, that it will take a lot more runs to determine if > > it is statistically significant.) > > I'm confused. You've measured a 5% impact from removing this part of > the caching, and I've measured 12-19%, so what are you planning that > doesn't involve caching the mapping that's faster than caching the > mapping? As I pointed out, cairo-gl is using fallback code and not spans. Once that is corrected, cairo-gl no longer continually recreating textures and so unaffected by the patch. Removing the cached vma is then arguably beneficial, though the difference looks to be in the noise. -Chris
On Sun, 19 Jun 2011 18:14:11 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote: > On Sun, 19 Jun 2011 10:01:23 -0700, Eric Anholt <eric@anholt.net> wrote: > > On Sun, 19 Jun 2011 17:28:11 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > > For lack of a better mechanism. Even using anholt/gtt-revert, I question > > > the value of caching the GTT mapping in drm_intel_bo. For the cairo-gl and > > > pts benchmarks I've run, the efficacy of the cached vma is very small and > > > there is a very slight improvement by unmapping the vma after use. (The > > > difference is so small, that it will take a lot more runs to determine if > > > it is statistically significant.) > > > > I'm confused. You've measured a 5% impact from removing this part of > > the caching, and I've measured 12-19%, so what are you planning that > > doesn't involve caching the mapping that's faster than caching the > > mapping? > > As I pointed out, cairo-gl is using fallback code and not spans. Once that > is corrected, cairo-gl no longer continually recreating textures and so > unaffected by the patch. Removing the cached vma is then arguably > beneficial, though the difference looks to be in the noise. So you don't actually want us to fix the performance regression in texture uploads, and instead want to just tell applications not to upload textures? What about applications where we're not the authors, and where they don't have a choice but to upload textures (media players)?
On Sun, 19 Jun 2011 14:20:14 -0700, Eric Anholt <eric@anholt.net> wrote: > On Sun, 19 Jun 2011 18:14:11 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > On Sun, 19 Jun 2011 10:01:23 -0700, Eric Anholt <eric@anholt.net> wrote: > > > On Sun, 19 Jun 2011 17:28:11 +0100, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > > > For lack of a better mechanism. Even using anholt/gtt-revert, I question > > > > the value of caching the GTT mapping in drm_intel_bo. For the cairo-gl and > > > > pts benchmarks I've run, the efficacy of the cached vma is very small and > > > > there is a very slight improvement by unmapping the vma after use. (The > > > > difference is so small, that it will take a lot more runs to determine if > > > > it is statistically significant.) > > > > > > I'm confused. You've measured a 5% impact from removing this part of > > > the caching, and I've measured 12-19%, so what are you planning that > > > doesn't involve caching the mapping that's faster than caching the > > > mapping? > > > > As I pointed out, cairo-gl is using fallback code and not spans. Once that > > is corrected, cairo-gl no longer continually recreating textures and so > > unaffected by the patch. Removing the cached vma is then arguably > > beneficial, though the difference looks to be in the noise. > > So you don't actually want us to fix the performance regression in > texture uploads, and instead want to just tell applications not to > upload textures? Nowhere, did I say that. I just said you were optimising for broken code and papering over bugs and lack of functionality elsewhere. The patch closes a race condition. The essence of your complaint is that the kernel is not as fast as we need it to be, and that the initial upload to any object is slower than expected. I presume you also have a plan for fixing the underwhelming performance of a glibc memcpy through an established GTT mapping? > What about applications where we're not the authors, and where they > don't have a choice but to upload textures (media players)? No, I'm not necessarily saying application code is broken, just the library. -Chris
Poking around in x86,pat I found one way of avoiding the linear walk for pat_pagerange_is_ram() which appears to nullify the regression. I would appreciate confirmation and some review before I go poking dragons. -Chris
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index d231729..7bbd355 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1218,11 +1218,11 @@ int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) ret = i915_gem_object_bind_to_gtt(obj, 0, true); if (ret) goto unlock; - } - ret = i915_gem_object_set_to_gtt_domain(obj, write); - if (ret) - goto unlock; + ret = i915_gem_object_set_to_gtt_domain(obj, write); + if (ret) + goto unlock; + } if (obj->tiling_mode == I915_TILING_NONE) ret = i915_gem_object_put_fence(obj); @@ -2953,8 +2953,6 @@ i915_gem_object_flush_gtt_write_domain(struct drm_i915_gem_object *obj) */ wmb(); - i915_gem_release_mmap(obj); - old_write_domain = obj->base.write_domain; obj->base.write_domain = 0; diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c index 20a4cc5..4934cf8 100644 --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c @@ -187,10 +187,6 @@ i915_gem_object_set_to_gpu_domain(struct drm_i915_gem_object *obj, if ((flush_domains | invalidate_domains) & I915_GEM_DOMAIN_CPU) i915_gem_clflush_object(obj); - /* blow away mappings if mapped through GTT */ - if ((flush_domains | invalidate_domains) & I915_GEM_DOMAIN_GTT) - i915_gem_release_mmap(obj); - if (obj->base.pending_write_domain) cd->flips |= atomic_read(&obj->pending_flip);