diff mbox

[1/2] igt/gem_exec_nop: add burst submission to parallel execution test

Message ID 1470238607-34415-2-git-send-email-david.s.gordon@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dave Gordon Aug. 3, 2016, 3:36 p.m. UTC
The parallel execution test in gem_exec_nop chooses a pessimal
distribution of work to multiple engines; specifically, it
round-robins one batch to each engine in turn. As the workloads
are trivial (NOPs), this results in each engine becoming idle
between batches. Hence parallel submission is seen to take LONGER
than the same number of batches executed sequentially.

If on the other hand we send enough work to each engine to keep
it busy until the next time we add to its queue, (i.e. round-robin
some larger number of batches to each engine in turn) then we can
get true parallel execution and should find that it is FASTER than
sequential execuion.

By experiment, burst sizes of between 8 and 256 are sufficient to
keep multiple engines loaded, with the optimum (for this trivial
workload) being around 64. This is expected to be lower (possibly
as low as one) for more realistic (heavier) workloads.

Signed-off-by: Dave Gordon <david.s.gordon@intel.com>
---
 tests/gem_exec_nop.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Chris Wilson Aug. 3, 2016, 3:45 p.m. UTC | #1
On Wed, Aug 03, 2016 at 04:36:46PM +0100, Dave Gordon wrote:
> The parallel execution test in gem_exec_nop chooses a pessimal
> distribution of work to multiple engines; specifically, it
> round-robins one batch to each engine in turn. As the workloads
> are trivial (NOPs), this results in each engine becoming idle
> between batches. Hence parallel submission is seen to take LONGER
> than the same number of batches executed sequentially.
> 
> If on the other hand we send enough work to each engine to keep
> it busy until the next time we add to its queue, (i.e. round-robin
> some larger number of batches to each engine in turn) then we can
> get true parallel execution and should find that it is FASTER than
> sequential execuion.
> 
> By experiment, burst sizes of between 8 and 256 are sufficient to
> keep multiple engines loaded, with the optimum (for this trivial
> workload) being around 64. This is expected to be lower (possibly
> as low as one) for more realistic (heavier) workloads.

Quite funny. The driver submission overhead of A...A vs ABAB... engines
is nearly identical, at least as far as the analysis presented here.
-Chris
Dave Gordon Aug. 3, 2016, 4:05 p.m. UTC | #2
On 03/08/16 16:45, Chris Wilson wrote:
> On Wed, Aug 03, 2016 at 04:36:46PM +0100, Dave Gordon wrote:
>> The parallel execution test in gem_exec_nop chooses a pessimal
>> distribution of work to multiple engines; specifically, it
>> round-robins one batch to each engine in turn. As the workloads
>> are trivial (NOPs), this results in each engine becoming idle
>> between batches. Hence parallel submission is seen to take LONGER
>> than the same number of batches executed sequentially.
>>
>> If on the other hand we send enough work to each engine to keep
>> it busy until the next time we add to its queue, (i.e. round-robin
>> some larger number of batches to each engine in turn) then we can
>> get true parallel execution and should find that it is FASTER than
>> sequential execuion.
>>
>> By experiment, burst sizes of between 8 and 256 are sufficient to
>> keep multiple engines loaded, with the optimum (for this trivial
>> workload) being around 64. This is expected to be lower (possibly
>> as low as one) for more realistic (heavier) workloads.
>
> Quite funny. The driver submission overhead of A...A vs ABAB... engines
> is nearly identical, at least as far as the analysis presented here.
> -Chris

Correct; but because the workloads are so trivial, if we hand out jobs 
one at a time to each engine, the first will have finished the one batch 
it's been given before we get round to giving at a second one (even in 
execlist mode). If there are N engines, submitting a single batch takes 
S seconds, and the workload takes W seconds to execute, then if W < N*S 
the engine will be idle between batches. For example, if N is 4, W is 
2us, and S is 1us, then the engine will be idle some 50% of the time.

This wouldn't be an issue for more realistic workloads, where W >> S.
It only looks problematic because of the trivial nature of the work.

.Dave.
John Harrison Aug. 18, 2016, 12:01 p.m. UTC | #3
On 03/08/2016 17:05, Dave Gordon wrote:
> On 03/08/16 16:45, Chris Wilson wrote:
>> On Wed, Aug 03, 2016 at 04:36:46PM +0100, Dave Gordon wrote:
>>> The parallel execution test in gem_exec_nop chooses a pessimal
>>> distribution of work to multiple engines; specifically, it
>>> round-robins one batch to each engine in turn. As the workloads
>>> are trivial (NOPs), this results in each engine becoming idle
>>> between batches. Hence parallel submission is seen to take LONGER
>>> than the same number of batches executed sequentially.
>>>
>>> If on the other hand we send enough work to each engine to keep
>>> it busy until the next time we add to its queue, (i.e. round-robin
>>> some larger number of batches to each engine in turn) then we can
>>> get true parallel execution and should find that it is FASTER than
>>> sequential execuion.
>>>
>>> By experiment, burst sizes of between 8 and 256 are sufficient to
>>> keep multiple engines loaded, with the optimum (for this trivial
>>> workload) being around 64. This is expected to be lower (possibly
>>> as low as one) for more realistic (heavier) workloads.
>>
>> Quite funny. The driver submission overhead of A...A vs ABAB... engines
>> is nearly identical, at least as far as the analysis presented here.
>> -Chris
>
> Correct; but because the workloads are so trivial, if we hand out jobs 
> one at a time to each engine, the first will have finished the one 
> batch it's been given before we get round to giving at a second one 
> (even in execlist mode). If there are N engines, submitting a single 
> batch takes S seconds, and the workload takes W seconds to execute, 
> then if W < N*S the engine will be idle between batches. For example, 
> if N is 4, W is 2us, and S is 1us, then the engine will be idle some 
> 50% of the time.
>
> This wouldn't be an issue for more realistic workloads, where W >> S.
> It only looks problematic because of the trivial nature of the work.

Can you post the numbers that you get?

I seem to get massive variability on my BDW. The render ring always 
gives me around 2.9us/batch but the other rings sometimes give me region 
of 1.2us and sometimes 7-8us.


>
> .Dave.
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Dave Gordon Aug. 18, 2016, 3:27 p.m. UTC | #4
On 18/08/16 13:01, John Harrison wrote:
> On 03/08/2016 17:05, Dave Gordon wrote:
>> On 03/08/16 16:45, Chris Wilson wrote:
>>> On Wed, Aug 03, 2016 at 04:36:46PM +0100, Dave Gordon wrote:
>>>> The parallel execution test in gem_exec_nop chooses a pessimal
>>>> distribution of work to multiple engines; specifically, it
>>>> round-robins one batch to each engine in turn. As the workloads
>>>> are trivial (NOPs), this results in each engine becoming idle
>>>> between batches. Hence parallel submission is seen to take LONGER
>>>> than the same number of batches executed sequentially.
>>>>
>>>> If on the other hand we send enough work to each engine to keep
>>>> it busy until the next time we add to its queue, (i.e. round-robin
>>>> some larger number of batches to each engine in turn) then we can
>>>> get true parallel execution and should find that it is FASTER than
>>>> sequential execuion.
>>>>
>>>> By experiment, burst sizes of between 8 and 256 are sufficient to
>>>> keep multiple engines loaded, with the optimum (for this trivial
>>>> workload) being around 64. This is expected to be lower (possibly
>>>> as low as one) for more realistic (heavier) workloads.
>>>
>>> Quite funny. The driver submission overhead of A...A vs ABAB... engines
>>> is nearly identical, at least as far as the analysis presented here.
>>> -Chris
>>
>> Correct; but because the workloads are so trivial, if we hand out jobs
>> one at a time to each engine, the first will have finished the one
>> batch it's been given before we get round to giving at a second one
>> (even in execlist mode). If there are N engines, submitting a single
>> batch takes S seconds, and the workload takes W seconds to execute,
>> then if W < N*S the engine will be idle between batches. For example,
>> if N is 4, W is 2us, and S is 1us, then the engine will be idle some
>> 50% of the time.
>>
>> This wouldn't be an issue for more realistic workloads, where W >> S.
>> It only looks problematic because of the trivial nature of the work.
>
> Can you post the numbers that you get?
>
> I seem to get massive variability on my BDW. The render ring always
> gives me around 2.9us/batch but the other rings sometimes give me region
> of 1.2us and sometimes 7-8us.

skylake# ./intel-gpu-tools/tests/gem_exec_nop --run-subtest basic
IGT-Version: 1.15-gd09ad86 (x86_64) (Linux: 
4.8.0-rc1-dsg-10839-g5e5a29c-z-tvrtko-fwname x86_64)
Using GuC submission
render: 594,944 cycles: 3.366us/batch
bsd: 737,280 cycles: 2.715us/batch
blt: 833,536 cycles: 2.400us/batch
vebox: 710,656 cycles: 2.818us/batch
Slowest engine was render, 3.366us/batch
Total for all 4 engines is 11.300us per cycle, average 2.825us/batch
All 4 engines (parallel/64): 5,324,800 cycles, average 1.878us/batch, 
overlap 90.1%
Subtest basic: SUCCESS (18.013s)

These are the results of running the modified test on SKL with GuC 
submission.

If the GPU could execute a trivial batch in less time than it takes the 
CPU to submit one, then CPU/driver/GuC performance would become the 
determining factor -- every batch would be completed before the next one 
was submitted to the GPU even when they're going to the same engine.

If the GPU takes longer to execute a batch than N times the time taken 
for the driver to submit it (where N is the number of engines), then the 
GPU performance would become the limiting factor; the CPU would be able 
to hand out one batch to each engine, and by the time it returned to the 
first, that engine would still not be idle.

But in crossover territory, where the batch takes longer to execute than 
the time to submit it, but less than N times as long, the round-robin 
burst size (number of batches sent to each engine before moving to the 
next) can make a big difference, primarily because the submission 
mechanism gets the opportunity to use dual submission and/or lite 
restore, effectively reducing the number of separate writes to the ELSP 
and hence the s/w overhead per batch.

Note that SKL GuC firmware 6.1 didn't support dual submission or lite 
restore, whereas the next version (8.11) does. Therefore, with that 
firmware we don't see the same slowdown when going to 1-at-a-time 
round-robin. I have a different (new) test that shows this more clearly.

.Dave.
Dave Gordon Aug. 18, 2016, 3:36 p.m. UTC | #5
On 18/08/16 16:27, Dave Gordon wrote:

[snip]

> Note that SKL GuC firmware 6.1 didn't support dual submission or lite
> restore, whereas the next version (8.11) does. Therefore, with that
> firmware we don't see the same slowdown when going to 1-at-a-time
> round-robin. I have a different (new) test that shows this more clearly.

This is with GuC version 6.1:

skylake# ./intel-gpu-tools/tests/gem_exec_paranop | fgrep -v SUCCESS

Time to exec 8-byte batch:	  3.428µs (ring=render)
Time to exec 8-byte batch:	  2.444µs (ring=bsd)
Time to exec 8-byte batch:	  2.394µs (ring=blt)
Time to exec 8-byte batch:	  2.615µs (ring=vebox)
Time to exec 8-byte batch:	  2.625µs (ring=all, sequential)
Time to exec 8-byte batch:	 12.701µs (ring=all, parallel/1) ***
Time to exec 8-byte batch:	  7.259µs (ring=all, parallel/2)
Time to exec 8-byte batch:	  4.336µs (ring=all, parallel/4)
Time to exec 8-byte batch:	  2.937µs (ring=all, parallel/8)
Time to exec 8-byte batch:	  2.661µs (ring=all, parallel/16)
Time to exec 8-byte batch:	  2.245µs (ring=all, parallel/32)
Time to exec 8-byte batch:	  1.626µs (ring=all, parallel/64)
Time to exec 8-byte batch:	  2.170µs (ring=all, parallel/128)
Time to exec 8-byte batch:	  1.804µs (ring=all, parallel/256)
Time to exec 8-byte batch:	  2.602µs (ring=all, parallel/512)
Time to exec 8-byte batch:	  2.602µs (ring=all, parallel/1024)
Time to exec 8-byte batch:	  2.607µs (ring=all, parallel/2048)

Time to exec 4Kbyte batch:	 14.835µs (ring=render)
Time to exec 4Kbyte batch:	 11.787µs (ring=bsd)
Time to exec 4Kbyte batch:	 11.533µs (ring=blt)
Time to exec 4Kbyte batch:	 11.991µs (ring=vebox)
Time to exec 4Kbyte batch:	 12.444µs (ring=all, sequential)
Time to exec 4Kbyte batch:	 16.211µs (ring=all, parallel/1)
Time to exec 4Kbyte batch:	 13.943µs (ring=all, parallel/2)
Time to exec 4Kbyte batch:	 13.878µs (ring=all, parallel/4)
Time to exec 4Kbyte batch:	 13.841µs (ring=all, parallel/8)
Time to exec 4Kbyte batch:	 14.188µs (ring=all, parallel/16)
Time to exec 4Kbyte batch:	 13.747µs (ring=all, parallel/32)
Time to exec 4Kbyte batch:	 13.734µs (ring=all, parallel/64)
Time to exec 4Kbyte batch:	 13.727µs (ring=all, parallel/128)
Time to exec 4Kbyte batch:	 13.947µs (ring=all, parallel/256)
Time to exec 4Kbyte batch:	 12.230µs (ring=all, parallel/512)
Time to exec 4Kbyte batch:	 12.147µs (ring=all, parallel/1024)
Time to exec 4Kbyte batch:	 12.617µs (ring=all, parallel/2048)

What this shows is that the submission overhead is ~3us which is 
comparable with the execution time of a trivial (8-byte) batch, but 
insignificant compared with the time to execute the 4Kbyte batch. The 
burst size therefore makes very little difference to the larger batches.

.Dave.
Dave Gordon Aug. 18, 2016, 3:54 p.m. UTC | #6
On 18/08/16 16:36, Dave Gordon wrote:
> On 18/08/16 16:27, Dave Gordon wrote:
>
> [snip]
>
>> Note that SKL GuC firmware 6.1 didn't support dual submission or lite
>> restore, whereas the next version (8.11) does. Therefore, with that
>> firmware we don't see the same slowdown when going to 1-at-a-time
>> round-robin. I have a different (new) test that shows this more clearly.
>
> This is with GuC version 6.1:
>
> skylake# ./intel-gpu-tools/tests/gem_exec_paranop | fgrep -v SUCCESS
>
> Time to exec 8-byte batch:      3.428µs (ring=render)
> Time to exec 8-byte batch:      2.444µs (ring=bsd)
> Time to exec 8-byte batch:      2.394µs (ring=blt)
> Time to exec 8-byte batch:      2.615µs (ring=vebox)
> Time to exec 8-byte batch:      2.625µs (ring=all, sequential)
> Time to exec 8-byte batch:     12.701µs (ring=all, parallel/1) ***
> Time to exec 8-byte batch:      7.259µs (ring=all, parallel/2)
> Time to exec 8-byte batch:      4.336µs (ring=all, parallel/4)
> Time to exec 8-byte batch:      2.937µs (ring=all, parallel/8)
> Time to exec 8-byte batch:      2.661µs (ring=all, parallel/16)
> Time to exec 8-byte batch:      2.245µs (ring=all, parallel/32)
> Time to exec 8-byte batch:      1.626µs (ring=all, parallel/64)
> Time to exec 8-byte batch:      2.170µs (ring=all, parallel/128)
> Time to exec 8-byte batch:      1.804µs (ring=all, parallel/256)
> Time to exec 8-byte batch:      2.602µs (ring=all, parallel/512)
> Time to exec 8-byte batch:      2.602µs (ring=all, parallel/1024)
> Time to exec 8-byte batch:      2.607µs (ring=all, parallel/2048)

And for comparison, here are the figures with v8.11:

# ./intel-gpu-tools/tests/gem_exec_paranop | fgrep -v SUCCESS

Time to exec 8-byte batch:	  3.458µs (ring=render)
Time to exec 8-byte batch:	  2.154µs (ring=bsd)
Time to exec 8-byte batch:	  2.156µs (ring=blt)
Time to exec 8-byte batch:	  2.156µs (ring=vebox)
Time to exec 8-byte batch:	  2.388µs (ring=all, sequential)
Time to exec 8-byte batch:	  5.897µs (ring=all, parallel/1)
Time to exec 8-byte batch:	  4.669µs (ring=all, parallel/2)
Time to exec 8-byte batch:	  4.278µs (ring=all, parallel/4)
Time to exec 8-byte batch:	  2.410µs (ring=all, parallel/8)
Time to exec 8-byte batch:	  2.165µs (ring=all, parallel/16)
Time to exec 8-byte batch:	  2.158µs (ring=all, parallel/32)
Time to exec 8-byte batch:	  1.594µs (ring=all, parallel/64)
Time to exec 8-byte batch:	  1.583µs (ring=all, parallel/128)
Time to exec 8-byte batch:	  2.473µs (ring=all, parallel/256)
Time to exec 8-byte batch:	  2.264µs (ring=all, parallel/512)
Time to exec 8-byte batch:	  2.357µs (ring=all, parallel/1024)
Time to exec 8-byte batch:	  2.382µs (ring=all, parallel/2048)

All generally slightly faster, but parallel/1 is approximately twice as 
fast, while parallel/64 is virtually unchanged, as are all the timings 
for large batches.

.Dave.
Dave Gordon Aug. 18, 2016, 3:59 p.m. UTC | #7
On 18/08/16 16:27, Dave Gordon wrote:
> On 18/08/16 13:01, John Harrison wrote:

[snip]

>> Can you post the numbers that you get?
>>
>> I seem to get massive variability on my BDW. The render ring always
>> gives me around 2.9us/batch but the other rings sometimes give me region
>> of 1.2us and sometimes 7-8us.
>
> skylake# ./intel-gpu-tools/tests/gem_exec_nop --run-subtest basic
> IGT-Version: 1.15-gd09ad86 (x86_64) (Linux:
> 4.8.0-rc1-dsg-10839-g5e5a29c-z-tvrtko-fwname x86_64)
> Using GuC submission
> render: 594,944 cycles: 3.366us/batch
> bsd: 737,280 cycles: 2.715us/batch
> blt: 833,536 cycles: 2.400us/batch
> vebox: 710,656 cycles: 2.818us/batch
> Slowest engine was render, 3.366us/batch
> Total for all 4 engines is 11.300us per cycle, average 2.825us/batch
> All 4 engines (parallel/64): 5,324,800 cycles, average 1.878us/batch,
> overlap 90.1%
> Subtest basic: SUCCESS (18.013s)

That was GuC f/w 6.1, here's the results from 8.11:

skylake# sudo ./intel-gpu-tools/tests/gem_exec_nop --run-subtest basic
IGT-Version: 1.15-gd09ad86 (x86_64) (Linux: 
4.8.0-rc2-dsg-11313-g7430e5f-dsg-work-101 x86_64)
Using GuC submission
render: 585,728 cycles: 3.418us/batch
bsd: 930,816 cycles: 2.151us/batch
blt: 930,816 cycles: 2.150us/batch
vebox: 930,816 cycles: 2.150us/batch
Slowest engine was render, 3.418us/batch
Total for all 4 engines is 9.869us per cycle, average 2.467us/batch
All 4 engines (parallel/64): 5,668,864 cycles, average 1.765us/batch, 
overlap 89.9%
Subtest basic: SUCCESS (18.016s)

... showing minor improvements generally, especially the non-render engines.

.Dave.
John Harrison Aug. 22, 2016, 2:28 p.m. UTC | #8
On 03/08/2016 16:36, Dave Gordon wrote:
> The parallel execution test in gem_exec_nop chooses a pessimal
> distribution of work to multiple engines; specifically, it
> round-robins one batch to each engine in turn. As the workloads
> are trivial (NOPs), this results in each engine becoming idle
> between batches. Hence parallel submission is seen to take LONGER
> than the same number of batches executed sequentially.
>
> If on the other hand we send enough work to each engine to keep
> it busy until the next time we add to its queue, (i.e. round-robin
> some larger number of batches to each engine in turn) then we can
> get true parallel execution and should find that it is FASTER than
> sequential execuion.
>
> By experiment, burst sizes of between 8 and 256 are sufficient to
> keep multiple engines loaded, with the optimum (for this trivial
> workload) being around 64. This is expected to be lower (possibly
> as low as one) for more realistic (heavier) workloads.
>
> Signed-off-by: Dave Gordon <david.s.gordon@intel.com>
> ---
>   tests/gem_exec_nop.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/tests/gem_exec_nop.c b/tests/gem_exec_nop.c
> index 9b89260..c2bd472 100644
> --- a/tests/gem_exec_nop.c
> +++ b/tests/gem_exec_nop.c
> @@ -166,14 +166,17 @@ static void all(int fd, uint32_t handle, int timeout)
>   	gem_sync(fd, handle);
>   	intel_detect_and_clear_missed_interrupts(fd);
>   
> +#define	BURST	64
> +
>   	count = 0;
>   	clock_gettime(CLOCK_MONOTONIC, &start);
>   	do {
> -		for (int loop = 0; loop < 1024; loop++) {
> +		for (int loop = 0; loop < 1024/BURST; loop++) {
>   			for (int n = 0; n < nengine; n++) {
>   				execbuf.flags &= ~ENGINE_FLAGS;
>   				execbuf.flags |= engines[n];
> -				gem_execbuf(fd, &execbuf);
> +				for (int b = 0; b < BURST; ++b)
> +					gem_execbuf(fd, &execbuf);
>   			}
>   		}
>   		count += nengine * 1024;

Would be nice to have the burst size configurable but either way...

Reviewed-by: John Harrison <john.c.harrison@intel.com>
diff mbox

Patch

diff --git a/tests/gem_exec_nop.c b/tests/gem_exec_nop.c
index 9b89260..c2bd472 100644
--- a/tests/gem_exec_nop.c
+++ b/tests/gem_exec_nop.c
@@ -166,14 +166,17 @@  static void all(int fd, uint32_t handle, int timeout)
 	gem_sync(fd, handle);
 	intel_detect_and_clear_missed_interrupts(fd);
 
+#define	BURST	64
+
 	count = 0;
 	clock_gettime(CLOCK_MONOTONIC, &start);
 	do {
-		for (int loop = 0; loop < 1024; loop++) {
+		for (int loop = 0; loop < 1024/BURST; loop++) {
 			for (int n = 0; n < nengine; n++) {
 				execbuf.flags &= ~ENGINE_FLAGS;
 				execbuf.flags |= engines[n];
-				gem_execbuf(fd, &execbuf);
+				for (int b = 0; b < BURST; ++b)
+					gem_execbuf(fd, &execbuf);
 			}
 		}
 		count += nengine * 1024;