Message ID | 1470238607-34415-2-git-send-email-david.s.gordon@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Aug 03, 2016 at 04:36:46PM +0100, Dave Gordon wrote: > The parallel execution test in gem_exec_nop chooses a pessimal > distribution of work to multiple engines; specifically, it > round-robins one batch to each engine in turn. As the workloads > are trivial (NOPs), this results in each engine becoming idle > between batches. Hence parallel submission is seen to take LONGER > than the same number of batches executed sequentially. > > If on the other hand we send enough work to each engine to keep > it busy until the next time we add to its queue, (i.e. round-robin > some larger number of batches to each engine in turn) then we can > get true parallel execution and should find that it is FASTER than > sequential execuion. > > By experiment, burst sizes of between 8 and 256 are sufficient to > keep multiple engines loaded, with the optimum (for this trivial > workload) being around 64. This is expected to be lower (possibly > as low as one) for more realistic (heavier) workloads. Quite funny. The driver submission overhead of A...A vs ABAB... engines is nearly identical, at least as far as the analysis presented here. -Chris
On 03/08/16 16:45, Chris Wilson wrote: > On Wed, Aug 03, 2016 at 04:36:46PM +0100, Dave Gordon wrote: >> The parallel execution test in gem_exec_nop chooses a pessimal >> distribution of work to multiple engines; specifically, it >> round-robins one batch to each engine in turn. As the workloads >> are trivial (NOPs), this results in each engine becoming idle >> between batches. Hence parallel submission is seen to take LONGER >> than the same number of batches executed sequentially. >> >> If on the other hand we send enough work to each engine to keep >> it busy until the next time we add to its queue, (i.e. round-robin >> some larger number of batches to each engine in turn) then we can >> get true parallel execution and should find that it is FASTER than >> sequential execuion. >> >> By experiment, burst sizes of between 8 and 256 are sufficient to >> keep multiple engines loaded, with the optimum (for this trivial >> workload) being around 64. This is expected to be lower (possibly >> as low as one) for more realistic (heavier) workloads. > > Quite funny. The driver submission overhead of A...A vs ABAB... engines > is nearly identical, at least as far as the analysis presented here. > -Chris Correct; but because the workloads are so trivial, if we hand out jobs one at a time to each engine, the first will have finished the one batch it's been given before we get round to giving at a second one (even in execlist mode). If there are N engines, submitting a single batch takes S seconds, and the workload takes W seconds to execute, then if W < N*S the engine will be idle between batches. For example, if N is 4, W is 2us, and S is 1us, then the engine will be idle some 50% of the time. This wouldn't be an issue for more realistic workloads, where W >> S. It only looks problematic because of the trivial nature of the work. .Dave.
On 03/08/2016 17:05, Dave Gordon wrote: > On 03/08/16 16:45, Chris Wilson wrote: >> On Wed, Aug 03, 2016 at 04:36:46PM +0100, Dave Gordon wrote: >>> The parallel execution test in gem_exec_nop chooses a pessimal >>> distribution of work to multiple engines; specifically, it >>> round-robins one batch to each engine in turn. As the workloads >>> are trivial (NOPs), this results in each engine becoming idle >>> between batches. Hence parallel submission is seen to take LONGER >>> than the same number of batches executed sequentially. >>> >>> If on the other hand we send enough work to each engine to keep >>> it busy until the next time we add to its queue, (i.e. round-robin >>> some larger number of batches to each engine in turn) then we can >>> get true parallel execution and should find that it is FASTER than >>> sequential execuion. >>> >>> By experiment, burst sizes of between 8 and 256 are sufficient to >>> keep multiple engines loaded, with the optimum (for this trivial >>> workload) being around 64. This is expected to be lower (possibly >>> as low as one) for more realistic (heavier) workloads. >> >> Quite funny. The driver submission overhead of A...A vs ABAB... engines >> is nearly identical, at least as far as the analysis presented here. >> -Chris > > Correct; but because the workloads are so trivial, if we hand out jobs > one at a time to each engine, the first will have finished the one > batch it's been given before we get round to giving at a second one > (even in execlist mode). If there are N engines, submitting a single > batch takes S seconds, and the workload takes W seconds to execute, > then if W < N*S the engine will be idle between batches. For example, > if N is 4, W is 2us, and S is 1us, then the engine will be idle some > 50% of the time. > > This wouldn't be an issue for more realistic workloads, where W >> S. > It only looks problematic because of the trivial nature of the work. Can you post the numbers that you get? I seem to get massive variability on my BDW. The render ring always gives me around 2.9us/batch but the other rings sometimes give me region of 1.2us and sometimes 7-8us. > > .Dave. > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
On 18/08/16 13:01, John Harrison wrote: > On 03/08/2016 17:05, Dave Gordon wrote: >> On 03/08/16 16:45, Chris Wilson wrote: >>> On Wed, Aug 03, 2016 at 04:36:46PM +0100, Dave Gordon wrote: >>>> The parallel execution test in gem_exec_nop chooses a pessimal >>>> distribution of work to multiple engines; specifically, it >>>> round-robins one batch to each engine in turn. As the workloads >>>> are trivial (NOPs), this results in each engine becoming idle >>>> between batches. Hence parallel submission is seen to take LONGER >>>> than the same number of batches executed sequentially. >>>> >>>> If on the other hand we send enough work to each engine to keep >>>> it busy until the next time we add to its queue, (i.e. round-robin >>>> some larger number of batches to each engine in turn) then we can >>>> get true parallel execution and should find that it is FASTER than >>>> sequential execuion. >>>> >>>> By experiment, burst sizes of between 8 and 256 are sufficient to >>>> keep multiple engines loaded, with the optimum (for this trivial >>>> workload) being around 64. This is expected to be lower (possibly >>>> as low as one) for more realistic (heavier) workloads. >>> >>> Quite funny. The driver submission overhead of A...A vs ABAB... engines >>> is nearly identical, at least as far as the analysis presented here. >>> -Chris >> >> Correct; but because the workloads are so trivial, if we hand out jobs >> one at a time to each engine, the first will have finished the one >> batch it's been given before we get round to giving at a second one >> (even in execlist mode). If there are N engines, submitting a single >> batch takes S seconds, and the workload takes W seconds to execute, >> then if W < N*S the engine will be idle between batches. For example, >> if N is 4, W is 2us, and S is 1us, then the engine will be idle some >> 50% of the time. >> >> This wouldn't be an issue for more realistic workloads, where W >> S. >> It only looks problematic because of the trivial nature of the work. > > Can you post the numbers that you get? > > I seem to get massive variability on my BDW. The render ring always > gives me around 2.9us/batch but the other rings sometimes give me region > of 1.2us and sometimes 7-8us. skylake# ./intel-gpu-tools/tests/gem_exec_nop --run-subtest basic IGT-Version: 1.15-gd09ad86 (x86_64) (Linux: 4.8.0-rc1-dsg-10839-g5e5a29c-z-tvrtko-fwname x86_64) Using GuC submission render: 594,944 cycles: 3.366us/batch bsd: 737,280 cycles: 2.715us/batch blt: 833,536 cycles: 2.400us/batch vebox: 710,656 cycles: 2.818us/batch Slowest engine was render, 3.366us/batch Total for all 4 engines is 11.300us per cycle, average 2.825us/batch All 4 engines (parallel/64): 5,324,800 cycles, average 1.878us/batch, overlap 90.1% Subtest basic: SUCCESS (18.013s) These are the results of running the modified test on SKL with GuC submission. If the GPU could execute a trivial batch in less time than it takes the CPU to submit one, then CPU/driver/GuC performance would become the determining factor -- every batch would be completed before the next one was submitted to the GPU even when they're going to the same engine. If the GPU takes longer to execute a batch than N times the time taken for the driver to submit it (where N is the number of engines), then the GPU performance would become the limiting factor; the CPU would be able to hand out one batch to each engine, and by the time it returned to the first, that engine would still not be idle. But in crossover territory, where the batch takes longer to execute than the time to submit it, but less than N times as long, the round-robin burst size (number of batches sent to each engine before moving to the next) can make a big difference, primarily because the submission mechanism gets the opportunity to use dual submission and/or lite restore, effectively reducing the number of separate writes to the ELSP and hence the s/w overhead per batch. Note that SKL GuC firmware 6.1 didn't support dual submission or lite restore, whereas the next version (8.11) does. Therefore, with that firmware we don't see the same slowdown when going to 1-at-a-time round-robin. I have a different (new) test that shows this more clearly. .Dave.
On 18/08/16 16:27, Dave Gordon wrote: [snip] > Note that SKL GuC firmware 6.1 didn't support dual submission or lite > restore, whereas the next version (8.11) does. Therefore, with that > firmware we don't see the same slowdown when going to 1-at-a-time > round-robin. I have a different (new) test that shows this more clearly. This is with GuC version 6.1: skylake# ./intel-gpu-tools/tests/gem_exec_paranop | fgrep -v SUCCESS Time to exec 8-byte batch: 3.428µs (ring=render) Time to exec 8-byte batch: 2.444µs (ring=bsd) Time to exec 8-byte batch: 2.394µs (ring=blt) Time to exec 8-byte batch: 2.615µs (ring=vebox) Time to exec 8-byte batch: 2.625µs (ring=all, sequential) Time to exec 8-byte batch: 12.701µs (ring=all, parallel/1) *** Time to exec 8-byte batch: 7.259µs (ring=all, parallel/2) Time to exec 8-byte batch: 4.336µs (ring=all, parallel/4) Time to exec 8-byte batch: 2.937µs (ring=all, parallel/8) Time to exec 8-byte batch: 2.661µs (ring=all, parallel/16) Time to exec 8-byte batch: 2.245µs (ring=all, parallel/32) Time to exec 8-byte batch: 1.626µs (ring=all, parallel/64) Time to exec 8-byte batch: 2.170µs (ring=all, parallel/128) Time to exec 8-byte batch: 1.804µs (ring=all, parallel/256) Time to exec 8-byte batch: 2.602µs (ring=all, parallel/512) Time to exec 8-byte batch: 2.602µs (ring=all, parallel/1024) Time to exec 8-byte batch: 2.607µs (ring=all, parallel/2048) Time to exec 4Kbyte batch: 14.835µs (ring=render) Time to exec 4Kbyte batch: 11.787µs (ring=bsd) Time to exec 4Kbyte batch: 11.533µs (ring=blt) Time to exec 4Kbyte batch: 11.991µs (ring=vebox) Time to exec 4Kbyte batch: 12.444µs (ring=all, sequential) Time to exec 4Kbyte batch: 16.211µs (ring=all, parallel/1) Time to exec 4Kbyte batch: 13.943µs (ring=all, parallel/2) Time to exec 4Kbyte batch: 13.878µs (ring=all, parallel/4) Time to exec 4Kbyte batch: 13.841µs (ring=all, parallel/8) Time to exec 4Kbyte batch: 14.188µs (ring=all, parallel/16) Time to exec 4Kbyte batch: 13.747µs (ring=all, parallel/32) Time to exec 4Kbyte batch: 13.734µs (ring=all, parallel/64) Time to exec 4Kbyte batch: 13.727µs (ring=all, parallel/128) Time to exec 4Kbyte batch: 13.947µs (ring=all, parallel/256) Time to exec 4Kbyte batch: 12.230µs (ring=all, parallel/512) Time to exec 4Kbyte batch: 12.147µs (ring=all, parallel/1024) Time to exec 4Kbyte batch: 12.617µs (ring=all, parallel/2048) What this shows is that the submission overhead is ~3us which is comparable with the execution time of a trivial (8-byte) batch, but insignificant compared with the time to execute the 4Kbyte batch. The burst size therefore makes very little difference to the larger batches. .Dave.
On 18/08/16 16:36, Dave Gordon wrote: > On 18/08/16 16:27, Dave Gordon wrote: > > [snip] > >> Note that SKL GuC firmware 6.1 didn't support dual submission or lite >> restore, whereas the next version (8.11) does. Therefore, with that >> firmware we don't see the same slowdown when going to 1-at-a-time >> round-robin. I have a different (new) test that shows this more clearly. > > This is with GuC version 6.1: > > skylake# ./intel-gpu-tools/tests/gem_exec_paranop | fgrep -v SUCCESS > > Time to exec 8-byte batch: 3.428µs (ring=render) > Time to exec 8-byte batch: 2.444µs (ring=bsd) > Time to exec 8-byte batch: 2.394µs (ring=blt) > Time to exec 8-byte batch: 2.615µs (ring=vebox) > Time to exec 8-byte batch: 2.625µs (ring=all, sequential) > Time to exec 8-byte batch: 12.701µs (ring=all, parallel/1) *** > Time to exec 8-byte batch: 7.259µs (ring=all, parallel/2) > Time to exec 8-byte batch: 4.336µs (ring=all, parallel/4) > Time to exec 8-byte batch: 2.937µs (ring=all, parallel/8) > Time to exec 8-byte batch: 2.661µs (ring=all, parallel/16) > Time to exec 8-byte batch: 2.245µs (ring=all, parallel/32) > Time to exec 8-byte batch: 1.626µs (ring=all, parallel/64) > Time to exec 8-byte batch: 2.170µs (ring=all, parallel/128) > Time to exec 8-byte batch: 1.804µs (ring=all, parallel/256) > Time to exec 8-byte batch: 2.602µs (ring=all, parallel/512) > Time to exec 8-byte batch: 2.602µs (ring=all, parallel/1024) > Time to exec 8-byte batch: 2.607µs (ring=all, parallel/2048) And for comparison, here are the figures with v8.11: # ./intel-gpu-tools/tests/gem_exec_paranop | fgrep -v SUCCESS Time to exec 8-byte batch: 3.458µs (ring=render) Time to exec 8-byte batch: 2.154µs (ring=bsd) Time to exec 8-byte batch: 2.156µs (ring=blt) Time to exec 8-byte batch: 2.156µs (ring=vebox) Time to exec 8-byte batch: 2.388µs (ring=all, sequential) Time to exec 8-byte batch: 5.897µs (ring=all, parallel/1) Time to exec 8-byte batch: 4.669µs (ring=all, parallel/2) Time to exec 8-byte batch: 4.278µs (ring=all, parallel/4) Time to exec 8-byte batch: 2.410µs (ring=all, parallel/8) Time to exec 8-byte batch: 2.165µs (ring=all, parallel/16) Time to exec 8-byte batch: 2.158µs (ring=all, parallel/32) Time to exec 8-byte batch: 1.594µs (ring=all, parallel/64) Time to exec 8-byte batch: 1.583µs (ring=all, parallel/128) Time to exec 8-byte batch: 2.473µs (ring=all, parallel/256) Time to exec 8-byte batch: 2.264µs (ring=all, parallel/512) Time to exec 8-byte batch: 2.357µs (ring=all, parallel/1024) Time to exec 8-byte batch: 2.382µs (ring=all, parallel/2048) All generally slightly faster, but parallel/1 is approximately twice as fast, while parallel/64 is virtually unchanged, as are all the timings for large batches. .Dave.
On 18/08/16 16:27, Dave Gordon wrote: > On 18/08/16 13:01, John Harrison wrote: [snip] >> Can you post the numbers that you get? >> >> I seem to get massive variability on my BDW. The render ring always >> gives me around 2.9us/batch but the other rings sometimes give me region >> of 1.2us and sometimes 7-8us. > > skylake# ./intel-gpu-tools/tests/gem_exec_nop --run-subtest basic > IGT-Version: 1.15-gd09ad86 (x86_64) (Linux: > 4.8.0-rc1-dsg-10839-g5e5a29c-z-tvrtko-fwname x86_64) > Using GuC submission > render: 594,944 cycles: 3.366us/batch > bsd: 737,280 cycles: 2.715us/batch > blt: 833,536 cycles: 2.400us/batch > vebox: 710,656 cycles: 2.818us/batch > Slowest engine was render, 3.366us/batch > Total for all 4 engines is 11.300us per cycle, average 2.825us/batch > All 4 engines (parallel/64): 5,324,800 cycles, average 1.878us/batch, > overlap 90.1% > Subtest basic: SUCCESS (18.013s) That was GuC f/w 6.1, here's the results from 8.11: skylake# sudo ./intel-gpu-tools/tests/gem_exec_nop --run-subtest basic IGT-Version: 1.15-gd09ad86 (x86_64) (Linux: 4.8.0-rc2-dsg-11313-g7430e5f-dsg-work-101 x86_64) Using GuC submission render: 585,728 cycles: 3.418us/batch bsd: 930,816 cycles: 2.151us/batch blt: 930,816 cycles: 2.150us/batch vebox: 930,816 cycles: 2.150us/batch Slowest engine was render, 3.418us/batch Total for all 4 engines is 9.869us per cycle, average 2.467us/batch All 4 engines (parallel/64): 5,668,864 cycles, average 1.765us/batch, overlap 89.9% Subtest basic: SUCCESS (18.016s) ... showing minor improvements generally, especially the non-render engines. .Dave.
On 03/08/2016 16:36, Dave Gordon wrote: > The parallel execution test in gem_exec_nop chooses a pessimal > distribution of work to multiple engines; specifically, it > round-robins one batch to each engine in turn. As the workloads > are trivial (NOPs), this results in each engine becoming idle > between batches. Hence parallel submission is seen to take LONGER > than the same number of batches executed sequentially. > > If on the other hand we send enough work to each engine to keep > it busy until the next time we add to its queue, (i.e. round-robin > some larger number of batches to each engine in turn) then we can > get true parallel execution and should find that it is FASTER than > sequential execuion. > > By experiment, burst sizes of between 8 and 256 are sufficient to > keep multiple engines loaded, with the optimum (for this trivial > workload) being around 64. This is expected to be lower (possibly > as low as one) for more realistic (heavier) workloads. > > Signed-off-by: Dave Gordon <david.s.gordon@intel.com> > --- > tests/gem_exec_nop.c | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/tests/gem_exec_nop.c b/tests/gem_exec_nop.c > index 9b89260..c2bd472 100644 > --- a/tests/gem_exec_nop.c > +++ b/tests/gem_exec_nop.c > @@ -166,14 +166,17 @@ static void all(int fd, uint32_t handle, int timeout) > gem_sync(fd, handle); > intel_detect_and_clear_missed_interrupts(fd); > > +#define BURST 64 > + > count = 0; > clock_gettime(CLOCK_MONOTONIC, &start); > do { > - for (int loop = 0; loop < 1024; loop++) { > + for (int loop = 0; loop < 1024/BURST; loop++) { > for (int n = 0; n < nengine; n++) { > execbuf.flags &= ~ENGINE_FLAGS; > execbuf.flags |= engines[n]; > - gem_execbuf(fd, &execbuf); > + for (int b = 0; b < BURST; ++b) > + gem_execbuf(fd, &execbuf); > } > } > count += nengine * 1024; Would be nice to have the burst size configurable but either way... Reviewed-by: John Harrison <john.c.harrison@intel.com>
diff --git a/tests/gem_exec_nop.c b/tests/gem_exec_nop.c index 9b89260..c2bd472 100644 --- a/tests/gem_exec_nop.c +++ b/tests/gem_exec_nop.c @@ -166,14 +166,17 @@ static void all(int fd, uint32_t handle, int timeout) gem_sync(fd, handle); intel_detect_and_clear_missed_interrupts(fd); +#define BURST 64 + count = 0; clock_gettime(CLOCK_MONOTONIC, &start); do { - for (int loop = 0; loop < 1024; loop++) { + for (int loop = 0; loop < 1024/BURST; loop++) { for (int n = 0; n < nengine; n++) { execbuf.flags &= ~ENGINE_FLAGS; execbuf.flags |= engines[n]; - gem_execbuf(fd, &execbuf); + for (int b = 0; b < BURST; ++b) + gem_execbuf(fd, &execbuf); } } count += nengine * 1024;
The parallel execution test in gem_exec_nop chooses a pessimal distribution of work to multiple engines; specifically, it round-robins one batch to each engine in turn. As the workloads are trivial (NOPs), this results in each engine becoming idle between batches. Hence parallel submission is seen to take LONGER than the same number of batches executed sequentially. If on the other hand we send enough work to each engine to keep it busy until the next time we add to its queue, (i.e. round-robin some larger number of batches to each engine in turn) then we can get true parallel execution and should find that it is FASTER than sequential execuion. By experiment, burst sizes of between 8 and 256 are sufficient to keep multiple engines loaded, with the optimum (for this trivial workload) being around 64. This is expected to be lower (possibly as low as one) for more realistic (heavier) workloads. Signed-off-by: Dave Gordon <david.s.gordon@intel.com> --- tests/gem_exec_nop.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)