Message ID | 20230530111534.871403-1-luciano.coelho@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | drm/i915: implement internal workqueues | expand |
On Wed, 31 May 2023, Patchwork <patchwork@emeril.freedesktop.org> wrote: > #### Possible regressions #### > > * igt@gem_close_race@basic-process: > - fi-blb-e6850: [PASS][1] -> [ABORT][2] > [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-blb-e6850/igt@gem_close_race@basic-process.html > [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-blb-e6850/igt@gem_close_race@basic-process.html > - fi-hsw-4770: [PASS][3] -> [ABORT][4] > [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-hsw-4770/igt@gem_close_race@basic-process.html > [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-hsw-4770/igt@gem_close_race@basic-process.html > - fi-elk-e7500: [PASS][5] -> [ABORT][6] > [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-elk-e7500/igt@gem_close_race@basic-process.html > [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-elk-e7500/igt@gem_close_race@basic-process.html > > * igt@i915_selftest@live@evict: > - bat-adlp-9: [PASS][7] -> [ABORT][8] > [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlp-9/igt@i915_selftest@live@evict.html > [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlp-9/igt@i915_selftest@live@evict.html > - bat-rpls-2: [PASS][9] -> [ABORT][10] > [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-2/igt@i915_selftest@live@evict.html > [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-2/igt@i915_selftest@live@evict.html > - bat-adlm-1: [PASS][11] -> [ABORT][12] > [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlm-1/igt@i915_selftest@live@evict.html > [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlm-1/igt@i915_selftest@live@evict.html > - bat-rpls-1: [PASS][13] -> [ABORT][14] > [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-1/igt@i915_selftest@live@evict.html > [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-1/igt@i915_selftest@live@evict.html This still fails consistently, I have no clue why, and the above aren't even remotely related to display. What now? Tvrtko? BR, Jani.
On 05/06/2023 16:06, Jani Nikula wrote: > On Wed, 31 May 2023, Patchwork <patchwork@emeril.freedesktop.org> wrote: >> #### Possible regressions #### >> >> * igt@gem_close_race@basic-process: >> - fi-blb-e6850: [PASS][1] -> [ABORT][2] >> [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-blb-e6850/igt@gem_close_race@basic-process.html >> [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-blb-e6850/igt@gem_close_race@basic-process.html >> - fi-hsw-4770: [PASS][3] -> [ABORT][4] >> [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-hsw-4770/igt@gem_close_race@basic-process.html >> [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-hsw-4770/igt@gem_close_race@basic-process.html >> - fi-elk-e7500: [PASS][5] -> [ABORT][6] >> [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-elk-e7500/igt@gem_close_race@basic-process.html >> [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-elk-e7500/igt@gem_close_race@basic-process.html >> >> * igt@i915_selftest@live@evict: >> - bat-adlp-9: [PASS][7] -> [ABORT][8] >> [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlp-9/igt@i915_selftest@live@evict.html >> [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlp-9/igt@i915_selftest@live@evict.html >> - bat-rpls-2: [PASS][9] -> [ABORT][10] >> [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-2/igt@i915_selftest@live@evict.html >> [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-2/igt@i915_selftest@live@evict.html >> - bat-adlm-1: [PASS][11] -> [ABORT][12] >> [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlm-1/igt@i915_selftest@live@evict.html >> [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlm-1/igt@i915_selftest@live@evict.html >> - bat-rpls-1: [PASS][13] -> [ABORT][14] >> [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-1/igt@i915_selftest@live@evict.html >> [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-1/igt@i915_selftest@live@evict.html > > This still fails consistently, I have no clue why, and the above aren't > even remotely related to display. > > What now? Tvrtko? Hmm.. <4> [46.782321] Chain exists of: (wq_completion)i915 --> (work_completion)(&i915->mm.free_work) --> &vm->mutex <4> [46.782329] Possible unsafe locking scenario: <4> [46.782332] CPU0 CPU1 <4> [46.782334] ---- ---- <4> [46.782337] lock(&vm->mutex); <4> [46.782340] lock((work_completion)(&i915->mm.free_work)); <4> [46.782344] lock(&vm->mutex); <4> [46.782348] lock((wq_completion)i915); "(wq_completion)i915" So it's not about the new wq even. Perhaps it is this hunk: --- a/drivers/gpu/drm/i915/intel_wakeref.c +++ b/drivers/gpu/drm/i915/intel_wakeref.c @@ -75,7 +75,7 @@ void __intel_wakeref_put_last(struct intel_wakeref *wf, unsigned long flags) /* Assume we are not in process context and so cannot sleep. */ if (flags & INTEL_WAKEREF_PUT_ASYNC || !mutex_trylock(&wf->mutex)) { - mod_delayed_work(system_wq, &wf->work, + mod_delayed_work(wf->i915->wq, &wf->work, Transformation from this patch otherwise is system_wq with the new unordered wq, so I'd try that first. Regards, Tvrtko
On Tue, 2023-06-06 at 11:06 +0100, Tvrtko Ursulin wrote: > On 05/06/2023 16:06, Jani Nikula wrote: > > On Wed, 31 May 2023, Patchwork <patchwork@emeril.freedesktop.org> wrote: > > > #### Possible regressions #### > > > > > > * igt@gem_close_race@basic-process: > > > - fi-blb-e6850: [PASS][1] -> [ABORT][2] > > > [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-blb-e6850/igt@gem_close_race@basic-process.html > > > [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-blb-e6850/igt@gem_close_race@basic-process.html > > > - fi-hsw-4770: [PASS][3] -> [ABORT][4] > > > [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-hsw-4770/igt@gem_close_race@basic-process.html > > > [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-hsw-4770/igt@gem_close_race@basic-process.html > > > - fi-elk-e7500: [PASS][5] -> [ABORT][6] > > > [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-elk-e7500/igt@gem_close_race@basic-process.html > > > [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-elk-e7500/igt@gem_close_race@basic-process.html > > > > > > * igt@i915_selftest@live@evict: > > > - bat-adlp-9: [PASS][7] -> [ABORT][8] > > > [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlp-9/igt@i915_selftest@live@evict.html > > > [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlp-9/igt@i915_selftest@live@evict.html > > > - bat-rpls-2: [PASS][9] -> [ABORT][10] > > > [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-2/igt@i915_selftest@live@evict.html > > > [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-2/igt@i915_selftest@live@evict.html > > > - bat-adlm-1: [PASS][11] -> [ABORT][12] > > > [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlm-1/igt@i915_selftest@live@evict.html > > > [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlm-1/igt@i915_selftest@live@evict.html > > > - bat-rpls-1: [PASS][13] -> [ABORT][14] > > > [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-1/igt@i915_selftest@live@evict.html > > > [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-1/igt@i915_selftest@live@evict.html > > > > This still fails consistently, I have no clue why, and the above aren't > > even remotely related to display. > > > > What now? Tvrtko? > > Hmm.. > > <4> [46.782321] Chain exists of: > (wq_completion)i915 --> (work_completion)(&i915->mm.free_work) --> &vm->mutex > <4> [46.782329] Possible unsafe locking scenario: > <4> [46.782332] CPU0 CPU1 > <4> [46.782334] ---- ---- > <4> [46.782337] lock(&vm->mutex); > <4> [46.782340] lock((work_completion)(&i915->mm.free_work)); > <4> [46.782344] lock(&vm->mutex); > <4> [46.782348] lock((wq_completion)i915); > > > "(wq_completion)i915" > > So it's not about the new wq even. Perhaps it is this hunk: > > --- a/drivers/gpu/drm/i915/intel_wakeref.c > +++ b/drivers/gpu/drm/i915/intel_wakeref.c > @@ -75,7 +75,7 @@ void __intel_wakeref_put_last(struct intel_wakeref *wf, unsigned long flags) > > /* Assume we are not in process context and so cannot sleep. */ > if (flags & INTEL_WAKEREF_PUT_ASYNC || !mutex_trylock(&wf->mutex)) { > - mod_delayed_work(system_wq, &wf->work, > + mod_delayed_work(wf->i915->wq, &wf->work, > > Transformation from this patch otherwise is system_wq with the new unordered wq, so I'd try that first. Indeed this seems to be exactly the block that is causing the issue. I was sort of bisecting through all these changes and reverting this one prevents the lockdep splat from happening... So there's something that needs to be synced with the system_wq here, but what? I need to dig into it. -- Cheers, Luca.
On 06/06/2023 12:06, Coelho, Luciano wrote: > On Tue, 2023-06-06 at 11:06 +0100, Tvrtko Ursulin wrote: >> On 05/06/2023 16:06, Jani Nikula wrote: >>> On Wed, 31 May 2023, Patchwork <patchwork@emeril.freedesktop.org> wrote: >>>> #### Possible regressions #### >>>> >>>> * igt@gem_close_race@basic-process: >>>> - fi-blb-e6850: [PASS][1] -> [ABORT][2] >>>> [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-blb-e6850/igt@gem_close_race@basic-process.html >>>> [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-blb-e6850/igt@gem_close_race@basic-process.html >>>> - fi-hsw-4770: [PASS][3] -> [ABORT][4] >>>> [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-hsw-4770/igt@gem_close_race@basic-process.html >>>> [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-hsw-4770/igt@gem_close_race@basic-process.html >>>> - fi-elk-e7500: [PASS][5] -> [ABORT][6] >>>> [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-elk-e7500/igt@gem_close_race@basic-process.html >>>> [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-elk-e7500/igt@gem_close_race@basic-process.html >>>> >>>> * igt@i915_selftest@live@evict: >>>> - bat-adlp-9: [PASS][7] -> [ABORT][8] >>>> [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlp-9/igt@i915_selftest@live@evict.html >>>> [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlp-9/igt@i915_selftest@live@evict.html >>>> - bat-rpls-2: [PASS][9] -> [ABORT][10] >>>> [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-2/igt@i915_selftest@live@evict.html >>>> [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-2/igt@i915_selftest@live@evict.html >>>> - bat-adlm-1: [PASS][11] -> [ABORT][12] >>>> [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlm-1/igt@i915_selftest@live@evict.html >>>> [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlm-1/igt@i915_selftest@live@evict.html >>>> - bat-rpls-1: [PASS][13] -> [ABORT][14] >>>> [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-1/igt@i915_selftest@live@evict.html >>>> [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-1/igt@i915_selftest@live@evict.html >>> >>> This still fails consistently, I have no clue why, and the above aren't >>> even remotely related to display. >>> >>> What now? Tvrtko? >> >> Hmm.. >> >> <4> [46.782321] Chain exists of: >> (wq_completion)i915 --> (work_completion)(&i915->mm.free_work) --> &vm->mutex >> <4> [46.782329] Possible unsafe locking scenario: >> <4> [46.782332] CPU0 CPU1 >> <4> [46.782334] ---- ---- >> <4> [46.782337] lock(&vm->mutex); >> <4> [46.782340] lock((work_completion)(&i915->mm.free_work)); >> <4> [46.782344] lock(&vm->mutex); >> <4> [46.782348] lock((wq_completion)i915); >> >> >> "(wq_completion)i915" >> >> So it's not about the new wq even. Perhaps it is this hunk: >> >> --- a/drivers/gpu/drm/i915/intel_wakeref.c >> +++ b/drivers/gpu/drm/i915/intel_wakeref.c >> @@ -75,7 +75,7 @@ void __intel_wakeref_put_last(struct intel_wakeref *wf, unsigned long flags) >> >> /* Assume we are not in process context and so cannot sleep. */ >> if (flags & INTEL_WAKEREF_PUT_ASYNC || !mutex_trylock(&wf->mutex)) { >> - mod_delayed_work(system_wq, &wf->work, >> + mod_delayed_work(wf->i915->wq, &wf->work, >> >> Transformation from this patch otherwise is system_wq with the new unordered wq, so I'd try that first. > > Indeed this seems to be exactly the block that is causing the issue. I > was sort of bisecting through all these changes and reverting this one > prevents the lockdep splat from happening... > > So there's something that needs to be synced with the system_wq here, > but what? I need to dig into it. AFAICT it is saying that i915->mm.free_work and engine->wakeref.work must not be on the same ordered wq. Otherwise execbuf call trace flushing under vm->mutex can deadlock against the free worker trying to grab vm->mutex. If engine->wakeref.work is on a separate unordered wq it would be safe since then execution will not be serialized with the free_work. So just using the new i915->unordered_wq for this hunk should work. Regards, Tvrtko
On Tue, 2023-06-06 at 14:33 +0100, Tvrtko Ursulin wrote: > On 06/06/2023 12:06, Coelho, Luciano wrote: > > On Tue, 2023-06-06 at 11:06 +0100, Tvrtko Ursulin wrote: > > > On 05/06/2023 16:06, Jani Nikula wrote: > > > > On Wed, 31 May 2023, Patchwork <patchwork@emeril.freedesktop.org> wrote: > > > > > #### Possible regressions #### > > > > > > > > > > * igt@gem_close_race@basic-process: > > > > > - fi-blb-e6850: [PASS][1] -> [ABORT][2] > > > > > [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-blb-e6850/igt@gem_close_race@basic-process.html > > > > > [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-blb-e6850/igt@gem_close_race@basic-process.html > > > > > - fi-hsw-4770: [PASS][3] -> [ABORT][4] > > > > > [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-hsw-4770/igt@gem_close_race@basic-process.html > > > > > [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-hsw-4770/igt@gem_close_race@basic-process.html > > > > > - fi-elk-e7500: [PASS][5] -> [ABORT][6] > > > > > [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-elk-e7500/igt@gem_close_race@basic-process.html > > > > > [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-elk-e7500/igt@gem_close_race@basic-process.html > > > > > > > > > > * igt@i915_selftest@live@evict: > > > > > - bat-adlp-9: [PASS][7] -> [ABORT][8] > > > > > [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlp-9/igt@i915_selftest@live@evict.html > > > > > [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlp-9/igt@i915_selftest@live@evict.html > > > > > - bat-rpls-2: [PASS][9] -> [ABORT][10] > > > > > [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-2/igt@i915_selftest@live@evict.html > > > > > [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-2/igt@i915_selftest@live@evict.html > > > > > - bat-adlm-1: [PASS][11] -> [ABORT][12] > > > > > [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlm-1/igt@i915_selftest@live@evict.html > > > > > [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlm-1/igt@i915_selftest@live@evict.html > > > > > - bat-rpls-1: [PASS][13] -> [ABORT][14] > > > > > [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-1/igt@i915_selftest@live@evict.html > > > > > [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-1/igt@i915_selftest@live@evict.html > > > > > > > > This still fails consistently, I have no clue why, and the above aren't > > > > even remotely related to display. > > > > > > > > What now? Tvrtko? > > > > > > Hmm.. > > > > > > <4> [46.782321] Chain exists of: > > > (wq_completion)i915 --> (work_completion)(&i915->mm.free_work) --> &vm->mutex > > > <4> [46.782329] Possible unsafe locking scenario: > > > <4> [46.782332] CPU0 CPU1 > > > <4> [46.782334] ---- ---- > > > <4> [46.782337] lock(&vm->mutex); > > > <4> [46.782340] lock((work_completion)(&i915->mm.free_work)); > > > <4> [46.782344] lock(&vm->mutex); > > > <4> [46.782348] lock((wq_completion)i915); > > > > > > > > > "(wq_completion)i915" > > > > > > So it's not about the new wq even. Perhaps it is this hunk: > > > > > > --- a/drivers/gpu/drm/i915/intel_wakeref.c > > > +++ b/drivers/gpu/drm/i915/intel_wakeref.c > > > @@ -75,7 +75,7 @@ void __intel_wakeref_put_last(struct intel_wakeref *wf, unsigned long flags) > > > > > > /* Assume we are not in process context and so cannot sleep. */ > > > if (flags & INTEL_WAKEREF_PUT_ASYNC || !mutex_trylock(&wf->mutex)) { > > > - mod_delayed_work(system_wq, &wf->work, > > > + mod_delayed_work(wf->i915->wq, &wf->work, > > > > > > Transformation from this patch otherwise is system_wq with the new unordered wq, so I'd try that first. > > > > Indeed this seems to be exactly the block that is causing the issue. I > > was sort of bisecting through all these changes and reverting this one > > prevents the lockdep splat from happening... > > > > So there's something that needs to be synced with the system_wq here, > > but what? I need to dig into it. > > AFAICT it is saying that i915->mm.free_work and engine->wakeref.work > must not be on the same ordered wq. Otherwise execbuf call trace > flushing under vm->mutex can deadlock against the free worker trying to > grab vm->mutex. If engine->wakeref.work is on a separate unordered wq it > would be safe since then execution will not be serialized with the > free_work. So just using the new i915->unordered_wq for this hunk should > work. Ah, great, thanks for the insight! I'll try it now and see how it goes. -- Cheers, Luca.
On Tue, 2023-06-06 at 14:30 +0000, Coelho, Luciano wrote: > On Tue, 2023-06-06 at 14:33 +0100, Tvrtko Ursulin wrote: > > On 06/06/2023 12:06, Coelho, Luciano wrote: > > > On Tue, 2023-06-06 at 11:06 +0100, Tvrtko Ursulin wrote: > > > > On 05/06/2023 16:06, Jani Nikula wrote: > > > > > On Wed, 31 May 2023, Patchwork <patchwork@emeril.freedesktop.org> wrote: > > > > > > #### Possible regressions #### > > > > > > > > > > > > * igt@gem_close_race@basic-process: > > > > > > - fi-blb-e6850: [PASS][1] -> [ABORT][2] > > > > > > [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-blb-e6850/igt@gem_close_race@basic-process.html > > > > > > [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-blb-e6850/igt@gem_close_race@basic-process.html > > > > > > - fi-hsw-4770: [PASS][3] -> [ABORT][4] > > > > > > [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-hsw-4770/igt@gem_close_race@basic-process.html > > > > > > [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-hsw-4770/igt@gem_close_race@basic-process.html > > > > > > - fi-elk-e7500: [PASS][5] -> [ABORT][6] > > > > > > [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-elk-e7500/igt@gem_close_race@basic-process.html > > > > > > [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-elk-e7500/igt@gem_close_race@basic-process.html > > > > > > > > > > > > * igt@i915_selftest@live@evict: > > > > > > - bat-adlp-9: [PASS][7] -> [ABORT][8] > > > > > > [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlp-9/igt@i915_selftest@live@evict.html > > > > > > [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlp-9/igt@i915_selftest@live@evict.html > > > > > > - bat-rpls-2: [PASS][9] -> [ABORT][10] > > > > > > [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-2/igt@i915_selftest@live@evict.html > > > > > > [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-2/igt@i915_selftest@live@evict.html > > > > > > - bat-adlm-1: [PASS][11] -> [ABORT][12] > > > > > > [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlm-1/igt@i915_selftest@live@evict.html > > > > > > [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlm-1/igt@i915_selftest@live@evict.html > > > > > > - bat-rpls-1: [PASS][13] -> [ABORT][14] > > > > > > [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-1/igt@i915_selftest@live@evict.html > > > > > > [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-1/igt@i915_selftest@live@evict.html > > > > > > > > > > This still fails consistently, I have no clue why, and the above aren't > > > > > even remotely related to display. > > > > > > > > > > What now? Tvrtko? > > > > > > > > Hmm.. > > > > > > > > <4> [46.782321] Chain exists of: > > > > (wq_completion)i915 --> (work_completion)(&i915->mm.free_work) --> &vm->mutex > > > > <4> [46.782329] Possible unsafe locking scenario: > > > > <4> [46.782332] CPU0 CPU1 > > > > <4> [46.782334] ---- ---- > > > > <4> [46.782337] lock(&vm->mutex); > > > > <4> [46.782340] lock((work_completion)(&i915->mm.free_work)); > > > > <4> [46.782344] lock(&vm->mutex); > > > > <4> [46.782348] lock((wq_completion)i915); > > > > > > > > > > > > "(wq_completion)i915" > > > > > > > > So it's not about the new wq even. Perhaps it is this hunk: > > > > > > > > --- a/drivers/gpu/drm/i915/intel_wakeref.c > > > > +++ b/drivers/gpu/drm/i915/intel_wakeref.c > > > > @@ -75,7 +75,7 @@ void __intel_wakeref_put_last(struct intel_wakeref *wf, unsigned long flags) > > > > > > > > /* Assume we are not in process context and so cannot sleep. */ > > > > if (flags & INTEL_WAKEREF_PUT_ASYNC || !mutex_trylock(&wf->mutex)) { > > > > - mod_delayed_work(system_wq, &wf->work, > > > > + mod_delayed_work(wf->i915->wq, &wf->work, > > > > > > > > Transformation from this patch otherwise is system_wq with the new unordered wq, so I'd try that first. > > > > > > Indeed this seems to be exactly the block that is causing the issue. I > > > was sort of bisecting through all these changes and reverting this one > > > prevents the lockdep splat from happening... > > > > > > So there's something that needs to be synced with the system_wq here, > > > but what? I need to dig into it. > > > > AFAICT it is saying that i915->mm.free_work and engine->wakeref.work > > must not be on the same ordered wq. Otherwise execbuf call trace > > flushing under vm->mutex can deadlock against the free worker trying to > > grab vm->mutex. If engine->wakeref.work is on a separate unordered wq it > > would be safe since then execution will not be serialized with the > > free_work. So just using the new i915->unordered_wq for this hunk should > > work. > > Ah, great, thanks for the insight! I'll try it now and see how it goes. This works now. It was quite obviously wrong, but I was completely blind to it. Thanks a lot for the catch, Tvrtko! v4 coming in a sec. -- Cheers, Luca.