mbox series

[v3,0/3] drm/i915: implement internal workqueues

Message ID 20230530111534.871403-1-luciano.coelho@intel.com (mailing list archive)
Headers show
Series drm/i915: implement internal workqueues | expand

Message

Luca Coelho May 30, 2023, 11:15 a.m. UTC
Hi,

This series implements internal workqueues in the i915 driver in order
to avoid using the system queue.  We add one generic workqueue in the
drm_i915_private structure, one specific for wake references and one
in a self-test.

This is based on Tetsuo's work[1] and is required to get rid of the
flush_scheduled_work() usage.

As discussed, we can either take Tetsuo's implementation first, which
uses a module-global workqueue, and then my series on top of that, to
change the implementation to per-device workqueues, or apply my series
directly.  And a third option would be to keep the workqueue as
module-global.  I'm fine with any of this options.  It's up to the
maintainers to decide which one to take.

In v2:
   * Removed per-engine workqueue and wakeref-specific queue;
   * Renamed the workqueue name from "i915-wq" to "unordered_wq";
   * Added comment about when the new workqueue should be used;
   * Changed wakeref structs to store i915 instead of rpm;

In v3:
   * Fixed queue to destroy in the init error path;
   * Improved the new queue's description a bit;
   * Removed stray empty-line removal;

Please review.

[1] https://patchwork.freedesktop.org/series/114608/

Cheers,
Luca.


Luca Coelho (3):
  drm/i915: use pointer to i915 instead of rpm in wakeref
  drm/i915: add a dedicated workqueue inside drm_i915_private
  drm/i915/selftests: add local workqueue for SW fence selftest

 drivers/gpu/drm/i915/display/intel_display.c  |  5 ++--
 .../drm/i915/display/intel_display_driver.c   |  2 +-
 drivers/gpu/drm/i915/display/intel_dmc.c      |  2 +-
 drivers/gpu/drm/i915/display/intel_dp.c       |  2 +-
 .../drm/i915/display/intel_dp_link_training.c |  3 ++-
 drivers/gpu/drm/i915/display/intel_drrs.c     |  4 +++-
 drivers/gpu/drm/i915/display/intel_fbc.c      |  2 +-
 drivers/gpu/drm/i915/display/intel_fbdev.c    |  3 ++-
 drivers/gpu/drm/i915/display/intel_hdcp.c     | 23 +++++++++++--------
 drivers/gpu/drm/i915/display/intel_hotplug.c  | 18 ++++++++++-----
 drivers/gpu/drm/i915/display/intel_opregion.c |  3 ++-
 drivers/gpu/drm/i915/display/intel_pps.c      |  4 +++-
 drivers/gpu/drm/i915/display/intel_psr.c      |  8 ++++---
 drivers/gpu/drm/i915/gt/intel_engine_pm.c     |  4 +---
 .../drm/i915/gt/intel_execlists_submission.c  |  5 ++--
 .../gpu/drm/i915/gt/intel_gt_buffer_pool.c    | 10 ++++----
 drivers/gpu/drm/i915/gt/intel_gt_irq.c        |  2 +-
 drivers/gpu/drm/i915/gt/intel_gt_pm.c         |  2 +-
 drivers/gpu/drm/i915/gt/intel_gt_requests.c   | 10 ++++----
 drivers/gpu/drm/i915/gt/intel_reset.c         |  2 +-
 drivers/gpu/drm/i915/gt/intel_rps.c           | 20 ++++++++--------
 drivers/gpu/drm/i915/gt/selftest_engine_cs.c  |  2 +-
 drivers/gpu/drm/i915/i915_driver.c            | 13 +++++++++++
 drivers/gpu/drm/i915/i915_drv.h               | 10 ++++++++
 drivers/gpu/drm/i915/i915_request.c           |  2 +-
 drivers/gpu/drm/i915/intel_runtime_pm.c       |  2 +-
 drivers/gpu/drm/i915/intel_wakeref.c          | 22 ++++++++++--------
 drivers/gpu/drm/i915/intel_wakeref.h          | 12 +++++-----
 .../gpu/drm/i915/selftests/i915_sw_fence.c    | 16 ++++++++++---
 29 files changed, 136 insertions(+), 77 deletions(-)

Comments

Jani Nikula June 5, 2023, 3:06 p.m. UTC | #1
On Wed, 31 May 2023, Patchwork <patchwork@emeril.freedesktop.org> wrote:
> #### Possible regressions ####
>
>   * igt@gem_close_race@basic-process:
>     - fi-blb-e6850:       [PASS][1] -> [ABORT][2]
>    [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-blb-e6850/igt@gem_close_race@basic-process.html
>    [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-blb-e6850/igt@gem_close_race@basic-process.html
>     - fi-hsw-4770:        [PASS][3] -> [ABORT][4]
>    [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-hsw-4770/igt@gem_close_race@basic-process.html
>    [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-hsw-4770/igt@gem_close_race@basic-process.html
>     - fi-elk-e7500:       [PASS][5] -> [ABORT][6]
>    [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-elk-e7500/igt@gem_close_race@basic-process.html
>    [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-elk-e7500/igt@gem_close_race@basic-process.html
>
>   * igt@i915_selftest@live@evict:
>     - bat-adlp-9:         [PASS][7] -> [ABORT][8]
>    [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlp-9/igt@i915_selftest@live@evict.html
>    [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlp-9/igt@i915_selftest@live@evict.html
>     - bat-rpls-2:         [PASS][9] -> [ABORT][10]
>    [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-2/igt@i915_selftest@live@evict.html
>    [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-2/igt@i915_selftest@live@evict.html
>     - bat-adlm-1:         [PASS][11] -> [ABORT][12]
>    [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlm-1/igt@i915_selftest@live@evict.html
>    [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlm-1/igt@i915_selftest@live@evict.html
>     - bat-rpls-1:         [PASS][13] -> [ABORT][14]
>    [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-1/igt@i915_selftest@live@evict.html
>    [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-1/igt@i915_selftest@live@evict.html

This still fails consistently, I have no clue why, and the above aren't
even remotely related to display.

What now? Tvrtko?

BR,
Jani.
Tvrtko Ursulin June 6, 2023, 10:06 a.m. UTC | #2
On 05/06/2023 16:06, Jani Nikula wrote:
> On Wed, 31 May 2023, Patchwork <patchwork@emeril.freedesktop.org> wrote:
>> #### Possible regressions ####
>>
>>    * igt@gem_close_race@basic-process:
>>      - fi-blb-e6850:       [PASS][1] -> [ABORT][2]
>>     [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-blb-e6850/igt@gem_close_race@basic-process.html
>>     [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-blb-e6850/igt@gem_close_race@basic-process.html
>>      - fi-hsw-4770:        [PASS][3] -> [ABORT][4]
>>     [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-hsw-4770/igt@gem_close_race@basic-process.html
>>     [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-hsw-4770/igt@gem_close_race@basic-process.html
>>      - fi-elk-e7500:       [PASS][5] -> [ABORT][6]
>>     [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-elk-e7500/igt@gem_close_race@basic-process.html
>>     [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-elk-e7500/igt@gem_close_race@basic-process.html
>>
>>    * igt@i915_selftest@live@evict:
>>      - bat-adlp-9:         [PASS][7] -> [ABORT][8]
>>     [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlp-9/igt@i915_selftest@live@evict.html
>>     [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlp-9/igt@i915_selftest@live@evict.html
>>      - bat-rpls-2:         [PASS][9] -> [ABORT][10]
>>     [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-2/igt@i915_selftest@live@evict.html
>>     [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-2/igt@i915_selftest@live@evict.html
>>      - bat-adlm-1:         [PASS][11] -> [ABORT][12]
>>     [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlm-1/igt@i915_selftest@live@evict.html
>>     [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlm-1/igt@i915_selftest@live@evict.html
>>      - bat-rpls-1:         [PASS][13] -> [ABORT][14]
>>     [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-1/igt@i915_selftest@live@evict.html
>>     [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-1/igt@i915_selftest@live@evict.html
> 
> This still fails consistently, I have no clue why, and the above aren't
> even remotely related to display.
> 
> What now? Tvrtko?

Hmm..

<4> [46.782321] Chain exists of:
   (wq_completion)i915 --> (work_completion)(&i915->mm.free_work) --> &vm->mutex
<4> [46.782329]  Possible unsafe locking scenario:
<4> [46.782332]        CPU0                    CPU1
<4> [46.782334]        ----                    ----
<4> [46.782337]   lock(&vm->mutex);
<4> [46.782340]                                lock((work_completion)(&i915->mm.free_work));
<4> [46.782344]                                lock(&vm->mutex);
<4> [46.782348]   lock((wq_completion)i915);


"(wq_completion)i915"

So it's not about the new wq even. Perhaps it is this hunk:

--- a/drivers/gpu/drm/i915/intel_wakeref.c
+++ b/drivers/gpu/drm/i915/intel_wakeref.c
@@ -75,7 +75,7 @@ void __intel_wakeref_put_last(struct intel_wakeref *wf, unsigned long flags)
  
  	/* Assume we are not in process context and so cannot sleep. */
  	if (flags & INTEL_WAKEREF_PUT_ASYNC || !mutex_trylock(&wf->mutex)) {
-		mod_delayed_work(system_wq, &wf->work,
+		mod_delayed_work(wf->i915->wq, &wf->work,

Transformation from this patch otherwise is system_wq with the new unordered wq, so I'd try that first.

Regards,

Tvrtko
Luca Coelho June 6, 2023, 11:06 a.m. UTC | #3
On Tue, 2023-06-06 at 11:06 +0100, Tvrtko Ursulin wrote:
> On 05/06/2023 16:06, Jani Nikula wrote:
> > On Wed, 31 May 2023, Patchwork <patchwork@emeril.freedesktop.org> wrote:
> > > #### Possible regressions ####
> > > 
> > >    * igt@gem_close_race@basic-process:
> > >      - fi-blb-e6850:       [PASS][1] -> [ABORT][2]
> > >     [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-blb-e6850/igt@gem_close_race@basic-process.html
> > >     [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-blb-e6850/igt@gem_close_race@basic-process.html
> > >      - fi-hsw-4770:        [PASS][3] -> [ABORT][4]
> > >     [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-hsw-4770/igt@gem_close_race@basic-process.html
> > >     [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-hsw-4770/igt@gem_close_race@basic-process.html
> > >      - fi-elk-e7500:       [PASS][5] -> [ABORT][6]
> > >     [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-elk-e7500/igt@gem_close_race@basic-process.html
> > >     [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-elk-e7500/igt@gem_close_race@basic-process.html
> > > 
> > >    * igt@i915_selftest@live@evict:
> > >      - bat-adlp-9:         [PASS][7] -> [ABORT][8]
> > >     [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlp-9/igt@i915_selftest@live@evict.html
> > >     [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlp-9/igt@i915_selftest@live@evict.html
> > >      - bat-rpls-2:         [PASS][9] -> [ABORT][10]
> > >     [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-2/igt@i915_selftest@live@evict.html
> > >     [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-2/igt@i915_selftest@live@evict.html
> > >      - bat-adlm-1:         [PASS][11] -> [ABORT][12]
> > >     [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlm-1/igt@i915_selftest@live@evict.html
> > >     [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlm-1/igt@i915_selftest@live@evict.html
> > >      - bat-rpls-1:         [PASS][13] -> [ABORT][14]
> > >     [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-1/igt@i915_selftest@live@evict.html
> > >     [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-1/igt@i915_selftest@live@evict.html
> > 
> > This still fails consistently, I have no clue why, and the above aren't
> > even remotely related to display.
> > 
> > What now? Tvrtko?
> 
> Hmm..
> 
> <4> [46.782321] Chain exists of:
>    (wq_completion)i915 --> (work_completion)(&i915->mm.free_work) --> &vm->mutex
> <4> [46.782329]  Possible unsafe locking scenario:
> <4> [46.782332]        CPU0                    CPU1
> <4> [46.782334]        ----                    ----
> <4> [46.782337]   lock(&vm->mutex);
> <4> [46.782340]                                lock((work_completion)(&i915->mm.free_work));
> <4> [46.782344]                                lock(&vm->mutex);
> <4> [46.782348]   lock((wq_completion)i915);
> 
> 
> "(wq_completion)i915"
> 
> So it's not about the new wq even. Perhaps it is this hunk:
> 
> --- a/drivers/gpu/drm/i915/intel_wakeref.c
> +++ b/drivers/gpu/drm/i915/intel_wakeref.c
> @@ -75,7 +75,7 @@ void __intel_wakeref_put_last(struct intel_wakeref *wf, unsigned long flags)
>   
>   	/* Assume we are not in process context and so cannot sleep. */
>   	if (flags & INTEL_WAKEREF_PUT_ASYNC || !mutex_trylock(&wf->mutex)) {
> -		mod_delayed_work(system_wq, &wf->work,
> +		mod_delayed_work(wf->i915->wq, &wf->work,
> 
> Transformation from this patch otherwise is system_wq with the new unordered wq, so I'd try that first.

Indeed this seems to be exactly the block that is causing the issue.  I
was sort of bisecting through all these changes and reverting this one
prevents the lockdep splat from happening...

So there's something that needs to be synced with the system_wq here,
but what? I need to dig into it.

--
Cheers,
Luca.
Tvrtko Ursulin June 6, 2023, 1:33 p.m. UTC | #4
On 06/06/2023 12:06, Coelho, Luciano wrote:
> On Tue, 2023-06-06 at 11:06 +0100, Tvrtko Ursulin wrote:
>> On 05/06/2023 16:06, Jani Nikula wrote:
>>> On Wed, 31 May 2023, Patchwork <patchwork@emeril.freedesktop.org> wrote:
>>>> #### Possible regressions ####
>>>>
>>>>     * igt@gem_close_race@basic-process:
>>>>       - fi-blb-e6850:       [PASS][1] -> [ABORT][2]
>>>>      [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-blb-e6850/igt@gem_close_race@basic-process.html
>>>>      [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-blb-e6850/igt@gem_close_race@basic-process.html
>>>>       - fi-hsw-4770:        [PASS][3] -> [ABORT][4]
>>>>      [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-hsw-4770/igt@gem_close_race@basic-process.html
>>>>      [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-hsw-4770/igt@gem_close_race@basic-process.html
>>>>       - fi-elk-e7500:       [PASS][5] -> [ABORT][6]
>>>>      [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-elk-e7500/igt@gem_close_race@basic-process.html
>>>>      [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-elk-e7500/igt@gem_close_race@basic-process.html
>>>>
>>>>     * igt@i915_selftest@live@evict:
>>>>       - bat-adlp-9:         [PASS][7] -> [ABORT][8]
>>>>      [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlp-9/igt@i915_selftest@live@evict.html
>>>>      [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlp-9/igt@i915_selftest@live@evict.html
>>>>       - bat-rpls-2:         [PASS][9] -> [ABORT][10]
>>>>      [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-2/igt@i915_selftest@live@evict.html
>>>>      [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-2/igt@i915_selftest@live@evict.html
>>>>       - bat-adlm-1:         [PASS][11] -> [ABORT][12]
>>>>      [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlm-1/igt@i915_selftest@live@evict.html
>>>>      [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlm-1/igt@i915_selftest@live@evict.html
>>>>       - bat-rpls-1:         [PASS][13] -> [ABORT][14]
>>>>      [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-1/igt@i915_selftest@live@evict.html
>>>>      [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-1/igt@i915_selftest@live@evict.html
>>>
>>> This still fails consistently, I have no clue why, and the above aren't
>>> even remotely related to display.
>>>
>>> What now? Tvrtko?
>>
>> Hmm..
>>
>> <4> [46.782321] Chain exists of:
>>     (wq_completion)i915 --> (work_completion)(&i915->mm.free_work) --> &vm->mutex
>> <4> [46.782329]  Possible unsafe locking scenario:
>> <4> [46.782332]        CPU0                    CPU1
>> <4> [46.782334]        ----                    ----
>> <4> [46.782337]   lock(&vm->mutex);
>> <4> [46.782340]                                lock((work_completion)(&i915->mm.free_work));
>> <4> [46.782344]                                lock(&vm->mutex);
>> <4> [46.782348]   lock((wq_completion)i915);
>>
>>
>> "(wq_completion)i915"
>>
>> So it's not about the new wq even. Perhaps it is this hunk:
>>
>> --- a/drivers/gpu/drm/i915/intel_wakeref.c
>> +++ b/drivers/gpu/drm/i915/intel_wakeref.c
>> @@ -75,7 +75,7 @@ void __intel_wakeref_put_last(struct intel_wakeref *wf, unsigned long flags)
>>    
>>    	/* Assume we are not in process context and so cannot sleep. */
>>    	if (flags & INTEL_WAKEREF_PUT_ASYNC || !mutex_trylock(&wf->mutex)) {
>> -		mod_delayed_work(system_wq, &wf->work,
>> +		mod_delayed_work(wf->i915->wq, &wf->work,
>>
>> Transformation from this patch otherwise is system_wq with the new unordered wq, so I'd try that first.
> 
> Indeed this seems to be exactly the block that is causing the issue.  I
> was sort of bisecting through all these changes and reverting this one
> prevents the lockdep splat from happening...
> 
> So there's something that needs to be synced with the system_wq here,
> but what? I need to dig into it.

AFAICT it is saying that i915->mm.free_work and engine->wakeref.work 
must not be on the same ordered wq. Otherwise execbuf call trace 
flushing under vm->mutex can deadlock against the free worker trying to 
grab vm->mutex. If engine->wakeref.work is on a separate unordered wq it 
would be safe since then execution will not be serialized with the 
free_work. So just using the new i915->unordered_wq for this hunk should 
work.

Regards,

Tvrtko
Luca Coelho June 6, 2023, 2:30 p.m. UTC | #5
On Tue, 2023-06-06 at 14:33 +0100, Tvrtko Ursulin wrote:
> On 06/06/2023 12:06, Coelho, Luciano wrote:
> > On Tue, 2023-06-06 at 11:06 +0100, Tvrtko Ursulin wrote:
> > > On 05/06/2023 16:06, Jani Nikula wrote:
> > > > On Wed, 31 May 2023, Patchwork <patchwork@emeril.freedesktop.org> wrote:
> > > > > #### Possible regressions ####
> > > > > 
> > > > >     * igt@gem_close_race@basic-process:
> > > > >       - fi-blb-e6850:       [PASS][1] -> [ABORT][2]
> > > > >      [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-blb-e6850/igt@gem_close_race@basic-process.html
> > > > >      [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-blb-e6850/igt@gem_close_race@basic-process.html
> > > > >       - fi-hsw-4770:        [PASS][3] -> [ABORT][4]
> > > > >      [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-hsw-4770/igt@gem_close_race@basic-process.html
> > > > >      [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-hsw-4770/igt@gem_close_race@basic-process.html
> > > > >       - fi-elk-e7500:       [PASS][5] -> [ABORT][6]
> > > > >      [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-elk-e7500/igt@gem_close_race@basic-process.html
> > > > >      [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-elk-e7500/igt@gem_close_race@basic-process.html
> > > > > 
> > > > >     * igt@i915_selftest@live@evict:
> > > > >       - bat-adlp-9:         [PASS][7] -> [ABORT][8]
> > > > >      [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlp-9/igt@i915_selftest@live@evict.html
> > > > >      [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlp-9/igt@i915_selftest@live@evict.html
> > > > >       - bat-rpls-2:         [PASS][9] -> [ABORT][10]
> > > > >      [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-2/igt@i915_selftest@live@evict.html
> > > > >      [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-2/igt@i915_selftest@live@evict.html
> > > > >       - bat-adlm-1:         [PASS][11] -> [ABORT][12]
> > > > >      [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlm-1/igt@i915_selftest@live@evict.html
> > > > >      [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlm-1/igt@i915_selftest@live@evict.html
> > > > >       - bat-rpls-1:         [PASS][13] -> [ABORT][14]
> > > > >      [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-1/igt@i915_selftest@live@evict.html
> > > > >      [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-1/igt@i915_selftest@live@evict.html
> > > > 
> > > > This still fails consistently, I have no clue why, and the above aren't
> > > > even remotely related to display.
> > > > 
> > > > What now? Tvrtko?
> > > 
> > > Hmm..
> > > 
> > > <4> [46.782321] Chain exists of:
> > >     (wq_completion)i915 --> (work_completion)(&i915->mm.free_work) --> &vm->mutex
> > > <4> [46.782329]  Possible unsafe locking scenario:
> > > <4> [46.782332]        CPU0                    CPU1
> > > <4> [46.782334]        ----                    ----
> > > <4> [46.782337]   lock(&vm->mutex);
> > > <4> [46.782340]                                lock((work_completion)(&i915->mm.free_work));
> > > <4> [46.782344]                                lock(&vm->mutex);
> > > <4> [46.782348]   lock((wq_completion)i915);
> > > 
> > > 
> > > "(wq_completion)i915"
> > > 
> > > So it's not about the new wq even. Perhaps it is this hunk:
> > > 
> > > --- a/drivers/gpu/drm/i915/intel_wakeref.c
> > > +++ b/drivers/gpu/drm/i915/intel_wakeref.c
> > > @@ -75,7 +75,7 @@ void __intel_wakeref_put_last(struct intel_wakeref *wf, unsigned long flags)
> > >    
> > >    	/* Assume we are not in process context and so cannot sleep. */
> > >    	if (flags & INTEL_WAKEREF_PUT_ASYNC || !mutex_trylock(&wf->mutex)) {
> > > -		mod_delayed_work(system_wq, &wf->work,
> > > +		mod_delayed_work(wf->i915->wq, &wf->work,
> > > 
> > > Transformation from this patch otherwise is system_wq with the new unordered wq, so I'd try that first.
> > 
> > Indeed this seems to be exactly the block that is causing the issue.  I
> > was sort of bisecting through all these changes and reverting this one
> > prevents the lockdep splat from happening...
> > 
> > So there's something that needs to be synced with the system_wq here,
> > but what? I need to dig into it.
> 
> AFAICT it is saying that i915->mm.free_work and engine->wakeref.work 
> must not be on the same ordered wq. Otherwise execbuf call trace 
> flushing under vm->mutex can deadlock against the free worker trying to 
> grab vm->mutex. If engine->wakeref.work is on a separate unordered wq it 
> would be safe since then execution will not be serialized with the 
> free_work. So just using the new i915->unordered_wq for this hunk should 
> work.

Ah, great, thanks for the insight! I'll try it now and see how it goes.

--
Cheers,
Luca.
Luca Coelho June 6, 2023, 4:22 p.m. UTC | #6
On Tue, 2023-06-06 at 14:30 +0000, Coelho, Luciano wrote:
> On Tue, 2023-06-06 at 14:33 +0100, Tvrtko Ursulin wrote:
> > On 06/06/2023 12:06, Coelho, Luciano wrote:
> > > On Tue, 2023-06-06 at 11:06 +0100, Tvrtko Ursulin wrote:
> > > > On 05/06/2023 16:06, Jani Nikula wrote:
> > > > > On Wed, 31 May 2023, Patchwork <patchwork@emeril.freedesktop.org> wrote:
> > > > > > #### Possible regressions ####
> > > > > > 
> > > > > >     * igt@gem_close_race@basic-process:
> > > > > >       - fi-blb-e6850:       [PASS][1] -> [ABORT][2]
> > > > > >      [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-blb-e6850/igt@gem_close_race@basic-process.html
> > > > > >      [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-blb-e6850/igt@gem_close_race@basic-process.html
> > > > > >       - fi-hsw-4770:        [PASS][3] -> [ABORT][4]
> > > > > >      [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-hsw-4770/igt@gem_close_race@basic-process.html
> > > > > >      [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-hsw-4770/igt@gem_close_race@basic-process.html
> > > > > >       - fi-elk-e7500:       [PASS][5] -> [ABORT][6]
> > > > > >      [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-elk-e7500/igt@gem_close_race@basic-process.html
> > > > > >      [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-elk-e7500/igt@gem_close_race@basic-process.html
> > > > > > 
> > > > > >     * igt@i915_selftest@live@evict:
> > > > > >       - bat-adlp-9:         [PASS][7] -> [ABORT][8]
> > > > > >      [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlp-9/igt@i915_selftest@live@evict.html
> > > > > >      [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlp-9/igt@i915_selftest@live@evict.html
> > > > > >       - bat-rpls-2:         [PASS][9] -> [ABORT][10]
> > > > > >      [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-2/igt@i915_selftest@live@evict.html
> > > > > >      [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-2/igt@i915_selftest@live@evict.html
> > > > > >       - bat-adlm-1:         [PASS][11] -> [ABORT][12]
> > > > > >      [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlm-1/igt@i915_selftest@live@evict.html
> > > > > >      [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlm-1/igt@i915_selftest@live@evict.html
> > > > > >       - bat-rpls-1:         [PASS][13] -> [ABORT][14]
> > > > > >      [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-1/igt@i915_selftest@live@evict.html
> > > > > >      [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-1/igt@i915_selftest@live@evict.html
> > > > > 
> > > > > This still fails consistently, I have no clue why, and the above aren't
> > > > > even remotely related to display.
> > > > > 
> > > > > What now? Tvrtko?
> > > > 
> > > > Hmm..
> > > > 
> > > > <4> [46.782321] Chain exists of:
> > > >     (wq_completion)i915 --> (work_completion)(&i915->mm.free_work) --> &vm->mutex
> > > > <4> [46.782329]  Possible unsafe locking scenario:
> > > > <4> [46.782332]        CPU0                    CPU1
> > > > <4> [46.782334]        ----                    ----
> > > > <4> [46.782337]   lock(&vm->mutex);
> > > > <4> [46.782340]                                lock((work_completion)(&i915->mm.free_work));
> > > > <4> [46.782344]                                lock(&vm->mutex);
> > > > <4> [46.782348]   lock((wq_completion)i915);
> > > > 
> > > > 
> > > > "(wq_completion)i915"
> > > > 
> > > > So it's not about the new wq even. Perhaps it is this hunk:
> > > > 
> > > > --- a/drivers/gpu/drm/i915/intel_wakeref.c
> > > > +++ b/drivers/gpu/drm/i915/intel_wakeref.c
> > > > @@ -75,7 +75,7 @@ void __intel_wakeref_put_last(struct intel_wakeref *wf, unsigned long flags)
> > > >    
> > > >    	/* Assume we are not in process context and so cannot sleep. */
> > > >    	if (flags & INTEL_WAKEREF_PUT_ASYNC || !mutex_trylock(&wf->mutex)) {
> > > > -		mod_delayed_work(system_wq, &wf->work,
> > > > +		mod_delayed_work(wf->i915->wq, &wf->work,
> > > > 
> > > > Transformation from this patch otherwise is system_wq with the new unordered wq, so I'd try that first.
> > > 
> > > Indeed this seems to be exactly the block that is causing the issue.  I
> > > was sort of bisecting through all these changes and reverting this one
> > > prevents the lockdep splat from happening...
> > > 
> > > So there's something that needs to be synced with the system_wq here,
> > > but what? I need to dig into it.
> > 
> > AFAICT it is saying that i915->mm.free_work and engine->wakeref.work 
> > must not be on the same ordered wq. Otherwise execbuf call trace 
> > flushing under vm->mutex can deadlock against the free worker trying to 
> > grab vm->mutex. If engine->wakeref.work is on a separate unordered wq it 
> > would be safe since then execution will not be serialized with the 
> > free_work. So just using the new i915->unordered_wq for this hunk should 
> > work.
> 
> Ah, great, thanks for the insight! I'll try it now and see how it goes.

This works now.  It was quite obviously wrong, but I was completely
blind to it.  Thanks a lot for the catch, Tvrtko!

v4 coming in a sec.

--
Cheers,
Luca.