[v3,0/8] drm/i915: PREEMPT_RT related fixups.

Message ID	20240628130601.1772849-1-bigeasy@linutronix.de (mailing list archive)
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> To: intel-gfx@lists.freedesktop.org, intel-xe@lists.freedesktop.org Cc: Jani Nikula <jani.nikula@linux.intel.com>, Joonas Lahtinen <joonas.lahtinen@linux.intel.com>, Rodrigo Vivi <rodrigo.vivi@intel.com>, Thomas Gleixner <tglx@linutronix.de>, Tvrtko Ursulin <tursulin@ursulin.net> Subject: [PATCH v3 0/8] drm/i915: PREEMPT_RT related fixups. Date: Fri, 28 Jun 2024 14:57:59 +0200 Message-ID: <20240628130601.1772849-1-bigeasy@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Precedence: list Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	drm/i915: PREEMPT_RT related fixups. \| expand [v3,0/8] drm/i915: PREEMPT_RT related fixups. [v3,1/8] drm/i915: Use preempt_disable/enable_rt() where recommended [v3,2/8] drm/i915: Don't disable interrupts on PREEMPT_RT during atomic updates [v3,3/8] drm/i915: Don't check for atomic context on PREEMPT_RT [v3,4/8] drm/i915: Disable tracing points on PREEMPT_RT [v3,5/8] drm/i915/gt: Use spin_lock_irq() instead of local_irq_disable() + spin_lock() [v3,6/8] drm/i915: Drop the irqs_disabled() check [v3,7/8] drm/i915/guc: Consider also RCU depth in busy loop. [v3,8/8] Revert "drm/i915: Depend on !PREEMPT_RT."

Sebastian Andrzej Siewior June 28, 2024, 12:57 p.m. UTC

Hi,

The following patches are from the PREEMPT_RT queue.  It is mostly about
disabling interrupts/preemption which leads to problems. Unfortunately
DRM_I915_LOW_LEVEL_TRACEPOINTS had to be disabled because it acquires locks
from within trace points. Making the lock a raw_spinlock_t led to higher
latencies during video playback
  https://lore.kernel.org/all/20211006164628.s2mtsdd2jdbfyf7g@linutronix.de/

and I'm not sure if I hit the worse case here.
I tested it on a SandyBridge with built-in i915 by using X, OpenGL and
playing videos without noticing any warnings. However, some code paths
were not entered.
I carry them for some time now and most issues were reported by other
people and they reported that things work for them since.

v2…v3 https://lore.kernel.org/all/20240613102818.4056866-1-bigeasy@linutronix.de/
  - Collected tags.
  - Added comment to 3/8 explaining why RT is excluded from the test.
  
v1…v2:
  - The tracing disable bits (4/8) have been reworked after Steven pointed out
    that something isn't right.
  - The irq_work() bits have been dropped because they are no longer
    needed.


Sebastian

Saarinen, Jani June 28, 2024, 2:03 p.m. UTC | #1

Hi, 
> -----Original Message-----
> From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On Behalf Of
> Patchwork
> Sent: Friday, 28 June 2024 16.43
> To: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: intel-gfx@lists.freedesktop.org
> Subject: ✗ Fi.CI.BAT: failure for drm/i915: PREEMPT_RT related fixups. (rev12)
> 
> Patch Details
> Series:	drm/i915: PREEMPT_RT related fixups. (rev12)
> URL:	https://patchwork.freedesktop.org/series/95463/
> State:	failure
> Details:	https://intel-gfx-ci.01.org/tree/drm-
> tip/Patchwork_95463v12/index.html
> 
> CI Bug Log - changes from CI_DRM_15013 -> Patchwork_95463v12
> 
> Summary
> 
> 
> FAILURE
> 
> Serious unknown changes coming with Patchwork_95463v12 absolutely need
> to be verified manually.
> 
> If you think the reported changes have nothing to do with the changes
> introduced in Patchwork_95463v12, please notify your bug team (I915-ci-
> infra@lists.freedesktop.org) to allow them to document this new failure
> mode, which will reduce false positives in CI.
> 
> External URL: https://intel-gfx-ci.01.org/tree/drm-
> tip/Patchwork_95463v12/index.html
> 
> 
> Participating hosts (42 -> 33)
> 
> 
> Missing (9): bat-adlp-9 bat-adlp-6 fi-snb-2520m bat-adln-1 fi-elk-e7500 bat-
> dg2-14 bat-dg2-13 bat-dg2-11 bat-arlh-2
^^^this would be good to understand why this many systems down or was this CI system issues? 

> 
> 
> Possible new issues
> 
> 
> Here are the unknown changes that may have been introduced in
> Patchwork_95463v12:
> 
> 
> IGT changes
> 
> 
> Possible regressions
> 
> 
> *	igt@debugfs_test@read_all_entries:
> 
> 	*	fi-kbl-7567u: PASS <https://intel-gfx-ci.01.org/tree/drm-
> tip/CI_DRM_15013/fi-kbl-7567u/igt@debugfs_test@read_all_entries.html>
> -> DMESG-WARN <https://intel-gfx-ci.01.org/tree/drm-
> tip/Patchwork_95463v12/fi-kbl-
> 7567u/igt@debugfs_test@read_all_entries.html>  +5 other tests dmesg-warn
This is known issue https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/11328 

> 
> 
> Known issues
> 
> 
> Here are the changes found in Patchwork_95463v12 that come from known
> issues:
> 
> 
> IGT changes
> 
> 
> Issues hit
> 
> 
> *	igt@i915_pm_rpm@module-reload:
> 
> 	*	fi-kbl-7567u: PASS <https://intel-gfx-ci.01.org/tree/drm-
> tip/CI_DRM_15013/fi-kbl-7567u/igt@i915_pm_rpm@module-reload.html>  -
> > DMESG-WARN <https://intel-gfx-ci.01.org/tree/drm-
> tip/Patchwork_95463v12/fi-kbl-7567u/igt@i915_pm_rpm@module-
> reload.html>  (i915#10062 <https://gitlab.freedesktop.org/drm/i915/kernel/-
> /issues/10062>  / i915#180 <https://gitlab.freedesktop.org/drm/i915/kernel/-
> /issues/180>  / i915#1982 <https://gitlab.freedesktop.org/drm/i915/kernel/-
> /issues/1982>  / i915#9925 <https://gitlab.freedesktop.org/drm/i915/kernel/-
> /issues/9925> )
> 
> *	igt@i915_selftest@live@sanitycheck:
> 
> 	*	fi-kbl-7567u: PASS <https://intel-gfx-ci.01.org/tree/drm-
> tip/CI_DRM_15013/fi-kbl-7567u/igt@i915_selftest@live@sanitycheck.html>
> -> DMESG-WARN <https://intel-gfx-ci.01.org/tree/drm-
> tip/Patchwork_95463v12/fi-kbl-
> 7567u/igt@i915_selftest@live@sanitycheck.html>  (i915#11328
> <https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/11328> ) +40 other
> tests dmesg-warn
> 
> *	igt@kms_busy@basic@flip:
> 
> 	*	fi-kbl-7567u: PASS <https://intel-gfx-ci.01.org/tree/drm-
> tip/CI_DRM_15013/fi-kbl-7567u/igt@kms_busy@basic@flip.html>  ->
> DMESG-WARN <https://intel-gfx-ci.01.org/tree/drm-
> tip/Patchwork_95463v12/fi-kbl-7567u/igt@kms_busy@basic@flip.html>
> (i915#180 <https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/180> )
> 
> *	igt@kms_frontbuffer_tracking@basic:
> 
> 	*	bat-arls-2: PASS <https://intel-gfx-ci.01.org/tree/drm-
> tip/CI_DRM_15013/bat-arls-2/igt@kms_frontbuffer_tracking@basic.html>  -
> > DMESG-WARN <https://intel-gfx-ci.01.org/tree/drm-
> tip/Patchwork_95463v12/bat-arls-
> 2/igt@kms_frontbuffer_tracking@basic.html>  (i915#7507
> <https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/7507> )
> 
> *	igt@kms_pm_rpm@basic-pci-d3-state:
> 
> 	*	fi-kbl-7567u: PASS <https://intel-gfx-ci.01.org/tree/drm-
> tip/CI_DRM_15013/fi-kbl-7567u/igt@kms_pm_rpm@basic-pci-d3-
> state.html>  -> DMESG-WARN <https://intel-gfx-ci.01.org/tree/drm-
> tip/Patchwork_95463v12/fi-kbl-7567u/igt@kms_pm_rpm@basic-pci-d3-
> state.html>  (i915#10062 <https://gitlab.freedesktop.org/drm/i915/kernel/-
> /issues/10062>  / i915#180 <https://gitlab.freedesktop.org/drm/i915/kernel/-
> /issues/180>  / i915#9925 <https://gitlab.freedesktop.org/drm/i915/kernel/-
> /issues/9925> ) +39 other tests dmesg-warn
> 
> 
> Build changes
> 
> 
> *	Linux: CI_DRM_15013 -> Patchwork_95463v12
> 
> CI-20190529: 20190529
> CI_DRM_15013: 0318a12ff6fb8c321458aa2b373e9322896ee951 @
> git://anongit.freedesktop.org/gfx-ci/linux
> IGT_7906: ae91ba26f657bf11264f64bd2dc21f471a5d18f5 @
> https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
> Patchwork_95463v12: 0318a12ff6fb8c321458aa2b373e9322896ee951 @
> git://anongit.freedesktop.org/gfx-ci/linux

Sebastian Andrzej Siewior Oct. 2, 2024, 4:25 p.m. UTC | #2

On 2024-06-28 14:57:59 [+0200], To intel-gfx@lists.freedesktop.org wrote:
Hi,

> The following patches are from the PREEMPT_RT queue.  It is mostly about
> disabling interrupts/preemption which leads to problems. Unfortunately
> DRM_I915_LOW_LEVEL_TRACEPOINTS had to be disabled because it acquires locks
> from within trace points. Making the lock a raw_spinlock_t led to higher
> latencies during video playback
>   https://lore.kernel.org/all/20211006164628.s2mtsdd2jdbfyf7g@linutronix.de/
> 
> and I'm not sure if I hit the worse case here.
> I tested it on a SandyBridge with built-in i915 by using X, OpenGL and
> playing videos without noticing any warnings. However, some code paths
> were not entered.
> I carry them for some time now and most issues were reported by other
> people and they reported that things work for them since.

These patches were not picked. Did I forget something or was this just
overseen?

Sebastian

Ville Syrjälä Oct. 2, 2024, 4:58 p.m. UTC | #3

On Wed, Oct 02, 2024 at 06:25:43PM +0200, Sebastian Andrzej Siewior wrote:
> On 2024-06-28 14:57:59 [+0200], To intel-gfx@lists.freedesktop.org wrote:
> Hi,
> 
> > The following patches are from the PREEMPT_RT queue.  It is mostly about
> > disabling interrupts/preemption which leads to problems. Unfortunately
> > DRM_I915_LOW_LEVEL_TRACEPOINTS had to be disabled because it acquires locks
> > from within trace points. Making the lock a raw_spinlock_t led to higher
> > latencies during video playback
> >   https://lore.kernel.org/all/20211006164628.s2mtsdd2jdbfyf7g@linutronix.de/
> > 
> > and I'm not sure if I hit the worse case here.
> > I tested it on a SandyBridge with built-in i915 by using X, OpenGL and
> > playing videos without noticing any warnings. However, some code paths
> > were not entered.
> > I carry them for some time now and most issues were reported by other
> > people and they reported that things work for them since.
> 
> These patches were not picked. Did I forget something or was this just
> overseen?

This looks quite poorly justified. Eg. you seem to be now
leaving interrupts enabled (and even preemption enabled I
guess) when we're racing against the raster beam. On first
blush that seems like a recipe for failure.

First step would be to set CONFIG_DRM_I915_DEBUG_VBLANK_EVADE=y,
run a bunch of tests that stress the display stuff (eg.
kms_atomic_transitions and other stuff from igt, and also
some real workloads) and probably throw in a bunch of
other load/perturbance at the system to make life hard.
After the system has been sufficiently hammered one can
compare sys/kernel/debug/dri/0/crtc-*/i915_update_info
against a baseline. Bonus points for doing it on a potato.

Sebastian Andrzej Siewior Oct. 4, 2024, 6:49 a.m. UTC | #4

On 2024-10-02 19:58:08 [+0300], Ville Syrjälä wrote:
> > These patches were not picked. Did I forget something or was this just
> > overseen?
> 
> This looks quite poorly justified. Eg. you seem to be now
> leaving interrupts enabled (and even preemption enabled I
> guess) when we're racing against the raster beam. On first
> blush that seems like a recipe for failure.

I can't really parse this. I may leave interrupts enabled in vblanc code
(the first two patches).

> First step would be to set CONFIG_DRM_I915_DEBUG_VBLANK_EVADE=y,
> run a bunch of tests that stress the display stuff (eg.
> kms_atomic_transitions and other stuff from igt, and also
> some real workloads) and probably throw in a bunch of
> other load/perturbance at the system to make life hard.
> After the system has been sufficiently hammered one can
> compare sys/kernel/debug/dri/0/crtc-*/i915_update_info
> against a baseline. Bonus points for doing it on a potato.

So you have a specific test for me to run?
 
Sebastian

Ville Syrjälä Oct. 4, 2024, 8:31 a.m. UTC | #5

On Fri, Oct 04, 2024 at 08:49:51AM +0200, Sebastian Andrzej Siewior wrote:
> On 2024-10-02 19:58:08 [+0300], Ville Syrjälä wrote:
> > > These patches were not picked. Did I forget something or was this just
> > > overseen?
> > 
> > This looks quite poorly justified. Eg. you seem to be now
> > leaving interrupts enabled (and even preemption enabled I
> > guess) when we're racing against the raster beam. On first
> > blush that seems like a recipe for failure.
> 
> I can't really parse this. I may leave interrupts enabled in vblanc code
> (the first two patches).

vblank evasion 101:
We need to write a boatload of registers atomically within one 
frame. We try to guarantee that by checking if we're too close 
the point at which the hardware latches said registers, and if 
so we defer the update into the next frame.

raster scan/time ->
.....vertical active.........><...vertical blank...><...vertical active...><...vertical blank
     <VBLANK_EVASION_TIME_US> ^^                                          ^
            ^                 ||                                          |
            |   registers are /|                            registers are /
            |   latched here   |                             latched here
            |                  |
    If we're somewhere         |
    in here we delay           |
    writing the registers...   ...until we are about here and then we have most
                               of the frame to write all the registers. Though
                               since we use an interrupt to wait for this point
                               interrupt latency does eat into the budget a bit

.....vertical active.........><..vertical blank...><...
     <VBLANK_EVASION_TIME_US> ^
  <-.                         |
    |           registers are /
    |           latched here
    |
    If we're somewhere to the left of this
    point we proceed to write the registers
    immediately and now in the worst case we
    have exactly VBLANK_EVASION_TIME_US to
    write all the registers

So once vblank evasion has declared things to be safe we might have
as short a time as VBLANK_EVASION_TIME_US to write all the registers.
If the CPU gets stolen from us at that point we can no longer guarantee
anything. The magic value has been tuned empirically over the years,
until we've found something that seems to work well enough, without
being too long to negatively affect performance.

Sebastian Andrzej Siewior Oct. 4, 2024, 8:45 a.m. UTC | #6

On 2024-10-04 11:31:22 [+0300], Ville Syrjälä wrote:
> 
> So once vblank evasion has declared things to be safe we might have
> as short a time as VBLANK_EVASION_TIME_US to write all the registers.
> If the CPU gets stolen from us at that point we can no longer guarantee
> anything. The magic value has been tuned empirically over the years,
> until we've found something that seems to work well enough, without
> being too long to negatively affect performance.

what happens if this gets delayed? Just flicker or worse?

Is this something that affects all i915 based HW or only old ones? As
far as I remember, there is a register lock which is only required on
older HW.

Sebastian

Ville Syrjälä Oct. 4, 2024, 9:07 a.m. UTC | #7

On Fri, Oct 04, 2024 at 10:45:25AM +0200, Sebastian Andrzej Siewior wrote:
> On 2024-10-04 11:31:22 [+0300], Ville Syrjälä wrote:
> > 
> > So once vblank evasion has declared things to be safe we might have
> > as short a time as VBLANK_EVASION_TIME_US to write all the registers.
> > If the CPU gets stolen from us at that point we can no longer guarantee
> > anything. The magic value has been tuned empirically over the years,
> > until we've found something that seems to work well enough, without
> > being too long to negatively affect performance.
> 
> what happens if this gets delayed? Just flicker or worse?

In the best best case it just gets you a corrupted frame
of some sort, in the worst case the hardware falls over.
Depends on what kind of update is happening, and what
platform we're dealing with.

We've tried to mitigate some of the worst issues by
trying to order the register writes more carefully,
but some of the ordering constraints (eg. scalers vs.
DDB) are more or less in conflict with each other
so making it 100% safe seems impossible.

> 
> Is this something that affects all i915 based HW or only old ones? As
> far as I remember, there is a register lock which is only required on
> older HW.

Currently it affects everything. There is a new double buffer
latching inhibit bit on some of the very latest platforms that
we could probably use to make things more safe if vblank evasion
fails, but we've not hooked that up. But vblank evasion would still
be necessary at least for cursor updates since those are
done as mailbox style updates (ie. multiple updates per frame)
and there is no way to guarantee forward progress without vblank
evasion.

Register access locks aren't relevant here, and most
register accesses in the vblank evade critical section
are lockless anyway. The locks were too expensive and we
determined that we an safely use lockless accesses here.

Sebastian Andrzej Siewior Oct. 4, 2024, 10:44 a.m. UTC | #8

On 2024-10-04 12:07:24 [+0300], Ville Syrjälä wrote:
> > what happens if this gets delayed? Just flicker or worse?
> 
> In the best best case it just gets you a corrupted frame
> of some sort, in the worst case the hardware falls over.
> Depends on what kind of update is happening, and what
> platform we're dealing with.
> 
> We've tried to mitigate some of the worst issues by
> trying to order the register writes more carefully,
> but some of the ordering constraints (eg. scalers vs.
> DDB) are more or less in conflict with each other
> so making it 100% safe seems impossible.

So based on what you are saying, this _has_ to happen with disabled
interrupts.

> > 
> > Is this something that affects all i915 based HW or only old ones? As
> > far as I remember, there is a register lock which is only required on
> > older HW.
> 
> Currently it affects everything. There is a new double buffer
> latching inhibit bit on some of the very latest platforms that
> we could probably use to make things more safe if vblank evasion
> fails, but we've not hooked that up. But vblank evasion would still
> be necessary at least for cursor updates since those are
> done as mailbox style updates (ie. multiple updates per frame)
> and there is no way to guarantee forward progress without vblank
> evasion.

This sounds sad. Especially since the delay is at 100us.

> Register access locks aren't relevant here, and most
> register accesses in the vblank evade critical section
> are lockless anyway. The locks were too expensive and we
> determined that we an safely use lockless accesses here.

The register lock question was only an example of something that is not
required on newish hardware. Also disabling interrupts within in patch
1, 2 would require to make uncore:lock a raw_spinlock_t since it is
acquire there.

What do suggest as in how do we move forward? In terms of testing, I
have here an i7 sandy bridge with embedded GPU.

Sebastian

[v3,0/8] drm/i915: PREEMPT_RT related fixups.

Message

Comments