mbox series

[v10,0/6] DRM scheduler kunit tests

Message ID 20250324092633.49746-1-tvrtko.ursulin@igalia.com (mailing list archive)
Headers show
Series DRM scheduler kunit tests | expand

Message

Tvrtko Ursulin March 24, 2025, 9:26 a.m. UTC
There has repeatedly been quite a bit of apprehension when any change to the DRM
scheduler is proposed, with two main reasons being code base is considered
fragile, not well understood and not very well documented, and secondly the lack
of systematic testing outside the vendor specific tests suites and/or test
farms.

This series is an attempt to dislodge this status quo by adding some unit tests
using the kunit framework.

General approach is that there is a mock "hardware" backend which can be
controlled from tests, which in turn allows exercising various scheduler code
paths.

Only some simple basic tests get added in the series and hopefully it is easy to
understand what tests are doing.

An obligatory "screenshot" for reference:

[14:09:05] ============ drm_sched_basic_tests (3 subtests) ============
[14:09:06] [PASSED] drm_sched_basic_submit
[14:09:06] ================== drm_sched_basic_test  ===================
[14:09:06] [PASSED] A queue of jobs in a single entity
[14:09:06] [PASSED] A chain of dependent jobs across multiple entities
[14:09:06] [PASSED] Multiple independent job queues
[14:09:06] [PASSED] Multiple inter-dependent job queues
[14:09:07] ============== [PASSED] drm_sched_basic_test ===============
[14:09:07] [PASSED] drm_sched_basic_entity_cleanup
[14:09:07] ============== [PASSED] drm_sched_basic_tests ==============
[14:09:07] ======== drm_sched_basic_timeout_tests (1 subtest) =========
[14:09:08] [PASSED] drm_sched_basic_timeout
[14:09:08] ========== [PASSED] drm_sched_basic_timeout_tests ==========
[14:09:08] ======= drm_sched_basic_priority_tests (2 subtests) ========
[14:09:10] [PASSED] drm_sched_priorities
[14:09:10] [PASSED] drm_sched_change_priority
[14:09:10] ========= [PASSED] drm_sched_basic_priority_tests ==========
[14:09:10] ====== drm_sched_basic_modify_sched_tests (1 subtest) ======
[14:09:11] [PASSED] drm_sched_test_modify_sched
[14:09:11] ======= [PASSED] drm_sched_basic_modify_sched_tests ========
[14:09:11] ======== drm_sched_basic_credits_tests (1 subtest) =========
[14:09:12] [PASSED] drm_sched_test_credits
[14:09:12] ========== [PASSED] drm_sched_basic_credits_tests ==========
[14:09:12] ============================================================
[14:09:12] Testing complete. Ran 11 tests: passed: 11
[14:09:13] Elapsed time: 13.539s total, 0.001s configuring, 3.004s building, 10.462s running

v2:
 * Parameterize a bunch of similar tests.
 * Improve test commentary.
 * Rename TDR test to timeout. (Christian)
 * Improve quality and consistency of naming. (Philipp)

RFC v2 -> series v1:
 * Rebased for drm_sched_init changes.
 * Fixed modular build.
 * Added some comments.
 * Filename renames. (Philipp)

v2:
 * Dealt with a bunch of checkpatch warnings.

v3:
 * Some mock API renames, kerneldoc grammar fixes and indentation fixes.

v4:
 * Fix use after free caused by relying on scheduler fence for querying status.
 * Kerneldoc fixes.

v5:
 * Cleanup in-flight jobs on scheduler shutdown.
 * Change hang_limit to 1.

v6:
 * Use KUNIT_ASSERT_TRUE/FALSE.
 * Fixed patch titles.
 * Added credit_limit test.
 * Added CONFIG_DRM_SCHED_KUNIT_TEST_ASPIRATIONAL.

v7:
 * v6 omitted to send the first patch by mistake.

v8:
 * Removed CONFIG_DRM_SCHED_KUNIT_TEST_ASPIRATIONAL for now.
 * Added Christian's acks.

v9:
 * Fixed a potential memory leak caused by a race condition on mock scheduler
   shutdown. In order to reliably clean up everything, we have keep track of
   jobs even past the signalling stage, all until either DRM sched core managed
   to run the ->free_job() callback, or until mock scheduler teardown from the
   test.

v10:
 * Rebase for a merge conflict in Kconfig.

Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <phasta@kernel.org>

Tvrtko Ursulin (6):
  drm: Move some options to separate new Kconfig
  drm/sched: Add scheduler unit testing infrastructure and some basic
    tests
  drm/sched: Add a simple timeout test
  drm/sched: Add basic priority tests
  drm/sched: Add a basic test for modifying entities scheduler list
  drm/sched: Add a basic test for checking credit limit

 drivers/gpu/drm/Kconfig                       | 110 +---
 drivers/gpu/drm/Kconfig.debug                 | 116 +++++
 drivers/gpu/drm/scheduler/.kunitconfig        |  12 +
 drivers/gpu/drm/scheduler/Makefile            |   2 +
 drivers/gpu/drm/scheduler/tests/Makefile      |   7 +
 .../gpu/drm/scheduler/tests/mock_scheduler.c  | 359 +++++++++++++
 drivers/gpu/drm/scheduler/tests/sched_tests.h | 226 +++++++++
 drivers/gpu/drm/scheduler/tests/tests_basic.c | 476 ++++++++++++++++++
 8 files changed, 1203 insertions(+), 105 deletions(-)
 create mode 100644 drivers/gpu/drm/Kconfig.debug
 create mode 100644 drivers/gpu/drm/scheduler/.kunitconfig
 create mode 100644 drivers/gpu/drm/scheduler/tests/Makefile
 create mode 100644 drivers/gpu/drm/scheduler/tests/mock_scheduler.c
 create mode 100644 drivers/gpu/drm/scheduler/tests/sched_tests.h
 create mode 100644 drivers/gpu/drm/scheduler/tests/tests_basic.c

Comments

Philipp Stanner March 24, 2025, 9:44 a.m. UTC | #1
On Mon, 2025-03-24 at 09:26 +0000, Tvrtko Ursulin wrote:
> There has repeatedly been quite a bit of apprehension when any change
> to the DRM
> scheduler is proposed, with two main reasons being code base is
> considered
> fragile, not well understood and not very well documented, and
> secondly the lack
> of systematic testing outside the vendor specific tests suites and/or
> test
> farms.
> 
> This series is an attempt to dislodge this status quo by adding some
> unit tests
> using the kunit framework.
> 
> General approach is that there is a mock "hardware" backend which can
> be
> controlled from tests, which in turn allows exercising various
> scheduler code
> paths.
> 
> Only some simple basic tests get added in the series and hopefully it
> is easy to
> understand what tests are doing.
> 
> An obligatory "screenshot" for reference:
> 
> [14:09:05] ============ drm_sched_basic_tests (3 subtests)
> ============
> [14:09:06] [PASSED] drm_sched_basic_submit
> [14:09:06] ================== drm_sched_basic_test 
> ===================
> [14:09:06] [PASSED] A queue of jobs in a single entity
> [14:09:06] [PASSED] A chain of dependent jobs across multiple
> entities
> [14:09:06] [PASSED] Multiple independent job queues
> [14:09:06] [PASSED] Multiple inter-dependent job queues
> [14:09:07] ============== [PASSED] drm_sched_basic_test
> ===============
> [14:09:07] [PASSED] drm_sched_basic_entity_cleanup
> [14:09:07] ============== [PASSED] drm_sched_basic_tests
> ==============
> [14:09:07] ======== drm_sched_basic_timeout_tests (1 subtest)
> =========
> [14:09:08] [PASSED] drm_sched_basic_timeout
> [14:09:08] ========== [PASSED] drm_sched_basic_timeout_tests
> ==========
> [14:09:08] ======= drm_sched_basic_priority_tests (2 subtests)
> ========
> [14:09:10] [PASSED] drm_sched_priorities
> [14:09:10] [PASSED] drm_sched_change_priority
> [14:09:10] ========= [PASSED] drm_sched_basic_priority_tests
> ==========
> [14:09:10] ====== drm_sched_basic_modify_sched_tests (1 subtest)
> ======
> [14:09:11] [PASSED] drm_sched_test_modify_sched
> [14:09:11] ======= [PASSED] drm_sched_basic_modify_sched_tests
> ========
> [14:09:11] ======== drm_sched_basic_credits_tests (1 subtest)
> =========
> [14:09:12] [PASSED] drm_sched_test_credits
> [14:09:12] ========== [PASSED] drm_sched_basic_credits_tests
> ==========
> [14:09:12]
> ============================================================
> [14:09:12] Testing complete. Ran 11 tests: passed: 11
> [14:09:13] Elapsed time: 13.539s total, 0.001s configuring, 3.004s
> building, 10.462s running
> 
> v2:
>  * Parameterize a bunch of similar tests.
>  * Improve test commentary.
>  * Rename TDR test to timeout. (Christian)
>  * Improve quality and consistency of naming. (Philipp)
> 
> RFC v2 -> series v1:
>  * Rebased for drm_sched_init changes.
>  * Fixed modular build.
>  * Added some comments.
>  * Filename renames. (Philipp)
> 
> v2:
>  * Dealt with a bunch of checkpatch warnings.
> 
> v3:
>  * Some mock API renames, kerneldoc grammar fixes and indentation
> fixes.
> 
> v4:
>  * Fix use after free caused by relying on scheduler fence for
> querying status.
>  * Kerneldoc fixes.
> 
> v5:
>  * Cleanup in-flight jobs on scheduler shutdown.
>  * Change hang_limit to 1.
> 
> v6:
>  * Use KUNIT_ASSERT_TRUE/FALSE.
>  * Fixed patch titles.
>  * Added credit_limit test.
>  * Added CONFIG_DRM_SCHED_KUNIT_TEST_ASPIRATIONAL.
> 
> v7:
>  * v6 omitted to send the first patch by mistake.
> 
> v8:
>  * Removed CONFIG_DRM_SCHED_KUNIT_TEST_ASPIRATIONAL for now.
>  * Added Christian's acks.
> 
> v9:
>  * Fixed a potential memory leak caused by a race condition on mock
> scheduler
>    shutdown. In order to reliably clean up everything, we have keep
> track of
>    jobs even past the signalling stage, all until either DRM sched
> core managed
>    to run the ->free_job() callback, or until mock scheduler teardown
> from the
>    test.
> 
> v10:
>  * Rebase for a merge conflict in Kconfig.
> 
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Philipp Stanner <phasta@kernel.org>
> 
> Tvrtko Ursulin (6):
>   drm: Move some options to separate new Kconfig
>   drm/sched: Add scheduler unit testing infrastructure and some basic
>     tests
>   drm/sched: Add a simple timeout test
>   drm/sched: Add basic priority tests
>   drm/sched: Add a basic test for modifying entities scheduler list
>   drm/sched: Add a basic test for checking credit limit
> 
>  drivers/gpu/drm/Kconfig                       | 110 +---
>  drivers/gpu/drm/Kconfig.debug                 | 116 +++++
>  drivers/gpu/drm/scheduler/.kunitconfig        |  12 +
>  drivers/gpu/drm/scheduler/Makefile            |   2 +
>  drivers/gpu/drm/scheduler/tests/Makefile      |   7 +
>  .../gpu/drm/scheduler/tests/mock_scheduler.c  | 359 +++++++++++++
>  drivers/gpu/drm/scheduler/tests/sched_tests.h | 226 +++++++++
>  drivers/gpu/drm/scheduler/tests/tests_basic.c | 476
> ++++++++++++++++++
>  8 files changed, 1203 insertions(+), 105 deletions(-)
>  create mode 100644 drivers/gpu/drm/Kconfig.debug
>  create mode 100644 drivers/gpu/drm/scheduler/.kunitconfig
>  create mode 100644 drivers/gpu/drm/scheduler/tests/Makefile
>  create mode 100644 drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>  create mode 100644 drivers/gpu/drm/scheduler/tests/sched_tests.h
>  create mode 100644 drivers/gpu/drm/scheduler/tests/tests_basic.c
> 


Applied to drm-misc-next

Thanks for your endurance!
P.
Tvrtko Ursulin March 24, 2025, 11:25 a.m. UTC | #2
On 24/03/2025 09:44, Philipp Stanner wrote:
> On Mon, 2025-03-24 at 09:26 +0000, Tvrtko Ursulin wrote:
>> There has repeatedly been quite a bit of apprehension when any change
>> to the DRM
>> scheduler is proposed, with two main reasons being code base is
>> considered
>> fragile, not well understood and not very well documented, and
>> secondly the lack
>> of systematic testing outside the vendor specific tests suites and/or
>> test
>> farms.
>>
>> This series is an attempt to dislodge this status quo by adding some
>> unit tests
>> using the kunit framework.
>>
>> General approach is that there is a mock "hardware" backend which can
>> be
>> controlled from tests, which in turn allows exercising various
>> scheduler code
>> paths.
>>
>> Only some simple basic tests get added in the series and hopefully it
>> is easy to
>> understand what tests are doing.
>>
>> An obligatory "screenshot" for reference:
>>
>> [14:09:05] ============ drm_sched_basic_tests (3 subtests)
>> ============
>> [14:09:06] [PASSED] drm_sched_basic_submit
>> [14:09:06] ================== drm_sched_basic_test
>> ===================
>> [14:09:06] [PASSED] A queue of jobs in a single entity
>> [14:09:06] [PASSED] A chain of dependent jobs across multiple
>> entities
>> [14:09:06] [PASSED] Multiple independent job queues
>> [14:09:06] [PASSED] Multiple inter-dependent job queues
>> [14:09:07] ============== [PASSED] drm_sched_basic_test
>> ===============
>> [14:09:07] [PASSED] drm_sched_basic_entity_cleanup
>> [14:09:07] ============== [PASSED] drm_sched_basic_tests
>> ==============
>> [14:09:07] ======== drm_sched_basic_timeout_tests (1 subtest)
>> =========
>> [14:09:08] [PASSED] drm_sched_basic_timeout
>> [14:09:08] ========== [PASSED] drm_sched_basic_timeout_tests
>> ==========
>> [14:09:08] ======= drm_sched_basic_priority_tests (2 subtests)
>> ========
>> [14:09:10] [PASSED] drm_sched_priorities
>> [14:09:10] [PASSED] drm_sched_change_priority
>> [14:09:10] ========= [PASSED] drm_sched_basic_priority_tests
>> ==========
>> [14:09:10] ====== drm_sched_basic_modify_sched_tests (1 subtest)
>> ======
>> [14:09:11] [PASSED] drm_sched_test_modify_sched
>> [14:09:11] ======= [PASSED] drm_sched_basic_modify_sched_tests
>> ========
>> [14:09:11] ======== drm_sched_basic_credits_tests (1 subtest)
>> =========
>> [14:09:12] [PASSED] drm_sched_test_credits
>> [14:09:12] ========== [PASSED] drm_sched_basic_credits_tests
>> ==========
>> [14:09:12]
>> ============================================================
>> [14:09:12] Testing complete. Ran 11 tests: passed: 11
>> [14:09:13] Elapsed time: 13.539s total, 0.001s configuring, 3.004s
>> building, 10.462s running
>>
>> v2:
>>   * Parameterize a bunch of similar tests.
>>   * Improve test commentary.
>>   * Rename TDR test to timeout. (Christian)
>>   * Improve quality and consistency of naming. (Philipp)
>>
>> RFC v2 -> series v1:
>>   * Rebased for drm_sched_init changes.
>>   * Fixed modular build.
>>   * Added some comments.
>>   * Filename renames. (Philipp)
>>
>> v2:
>>   * Dealt with a bunch of checkpatch warnings.
>>
>> v3:
>>   * Some mock API renames, kerneldoc grammar fixes and indentation
>> fixes.
>>
>> v4:
>>   * Fix use after free caused by relying on scheduler fence for
>> querying status.
>>   * Kerneldoc fixes.
>>
>> v5:
>>   * Cleanup in-flight jobs on scheduler shutdown.
>>   * Change hang_limit to 1.
>>
>> v6:
>>   * Use KUNIT_ASSERT_TRUE/FALSE.
>>   * Fixed patch titles.
>>   * Added credit_limit test.
>>   * Added CONFIG_DRM_SCHED_KUNIT_TEST_ASPIRATIONAL.
>>
>> v7:
>>   * v6 omitted to send the first patch by mistake.
>>
>> v8:
>>   * Removed CONFIG_DRM_SCHED_KUNIT_TEST_ASPIRATIONAL for now.
>>   * Added Christian's acks.
>>
>> v9:
>>   * Fixed a potential memory leak caused by a race condition on mock
>> scheduler
>>     shutdown. In order to reliably clean up everything, we have keep
>> track of
>>     jobs even past the signalling stage, all until either DRM sched
>> core managed
>>     to run the ->free_job() callback, or until mock scheduler teardown
>> from the
>>     test.
>>
>> v10:
>>   * Rebase for a merge conflict in Kconfig.
>>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Philipp Stanner <phasta@kernel.org>
>>
>> Tvrtko Ursulin (6):
>>    drm: Move some options to separate new Kconfig
>>    drm/sched: Add scheduler unit testing infrastructure and some basic
>>      tests
>>    drm/sched: Add a simple timeout test
>>    drm/sched: Add basic priority tests
>>    drm/sched: Add a basic test for modifying entities scheduler list
>>    drm/sched: Add a basic test for checking credit limit
>>
>>   drivers/gpu/drm/Kconfig                       | 110 +---
>>   drivers/gpu/drm/Kconfig.debug                 | 116 +++++
>>   drivers/gpu/drm/scheduler/.kunitconfig        |  12 +
>>   drivers/gpu/drm/scheduler/Makefile            |   2 +
>>   drivers/gpu/drm/scheduler/tests/Makefile      |   7 +
>>   .../gpu/drm/scheduler/tests/mock_scheduler.c  | 359 +++++++++++++
>>   drivers/gpu/drm/scheduler/tests/sched_tests.h | 226 +++++++++
>>   drivers/gpu/drm/scheduler/tests/tests_basic.c | 476
>> ++++++++++++++++++
>>   8 files changed, 1203 insertions(+), 105 deletions(-)
>>   create mode 100644 drivers/gpu/drm/Kconfig.debug
>>   create mode 100644 drivers/gpu/drm/scheduler/.kunitconfig
>>   create mode 100644 drivers/gpu/drm/scheduler/tests/Makefile
>>   create mode 100644 drivers/gpu/drm/scheduler/tests/mock_scheduler.c
>>   create mode 100644 drivers/gpu/drm/scheduler/tests/sched_tests.h
>>   create mode 100644 drivers/gpu/drm/scheduler/tests/tests_basic.c
>>
> 
> 
> Applied to drm-misc-next
> 
> Thanks for your endurance!

Thank you for the much needed scrutiny and testing!

Regards,

Tvrtko