Message ID | 20240628130601.1772849-1-bigeasy@linutronix.de (mailing list archive) |
---|---|
Headers | show |
Series | drm/i915: PREEMPT_RT related fixups. | expand |
Hi, > -----Original Message----- > From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On Behalf Of > Patchwork > Sent: Friday, 28 June 2024 16.43 > To: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > Cc: intel-gfx@lists.freedesktop.org > Subject: ✗ Fi.CI.BAT: failure for drm/i915: PREEMPT_RT related fixups. (rev12) > > Patch Details > Series: drm/i915: PREEMPT_RT related fixups. (rev12) > URL: https://patchwork.freedesktop.org/series/95463/ > State: failure > Details: https://intel-gfx-ci.01.org/tree/drm- > tip/Patchwork_95463v12/index.html > > CI Bug Log - changes from CI_DRM_15013 -> Patchwork_95463v12 > > Summary > > > FAILURE > > Serious unknown changes coming with Patchwork_95463v12 absolutely need > to be verified manually. > > If you think the reported changes have nothing to do with the changes > introduced in Patchwork_95463v12, please notify your bug team (I915-ci- > infra@lists.freedesktop.org) to allow them to document this new failure > mode, which will reduce false positives in CI. > > External URL: https://intel-gfx-ci.01.org/tree/drm- > tip/Patchwork_95463v12/index.html > > > Participating hosts (42 -> 33) > > > Missing (9): bat-adlp-9 bat-adlp-6 fi-snb-2520m bat-adln-1 fi-elk-e7500 bat- > dg2-14 bat-dg2-13 bat-dg2-11 bat-arlh-2 ^^^this would be good to understand why this many systems down or was this CI system issues? > > > Possible new issues > > > Here are the unknown changes that may have been introduced in > Patchwork_95463v12: > > > IGT changes > > > Possible regressions > > > * igt@debugfs_test@read_all_entries: > > * fi-kbl-7567u: PASS <https://intel-gfx-ci.01.org/tree/drm- > tip/CI_DRM_15013/fi-kbl-7567u/igt@debugfs_test@read_all_entries.html> > -> DMESG-WARN <https://intel-gfx-ci.01.org/tree/drm- > tip/Patchwork_95463v12/fi-kbl- > 7567u/igt@debugfs_test@read_all_entries.html> +5 other tests dmesg-warn This is known issue https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/11328 > > > Known issues > > > Here are the changes found in Patchwork_95463v12 that come from known > issues: > > > IGT changes > > > Issues hit > > > * igt@i915_pm_rpm@module-reload: > > * fi-kbl-7567u: PASS <https://intel-gfx-ci.01.org/tree/drm- > tip/CI_DRM_15013/fi-kbl-7567u/igt@i915_pm_rpm@module-reload.html> - > > DMESG-WARN <https://intel-gfx-ci.01.org/tree/drm- > tip/Patchwork_95463v12/fi-kbl-7567u/igt@i915_pm_rpm@module- > reload.html> (i915#10062 <https://gitlab.freedesktop.org/drm/i915/kernel/- > /issues/10062> / i915#180 <https://gitlab.freedesktop.org/drm/i915/kernel/- > /issues/180> / i915#1982 <https://gitlab.freedesktop.org/drm/i915/kernel/- > /issues/1982> / i915#9925 <https://gitlab.freedesktop.org/drm/i915/kernel/- > /issues/9925> ) > > * igt@i915_selftest@live@sanitycheck: > > * fi-kbl-7567u: PASS <https://intel-gfx-ci.01.org/tree/drm- > tip/CI_DRM_15013/fi-kbl-7567u/igt@i915_selftest@live@sanitycheck.html> > -> DMESG-WARN <https://intel-gfx-ci.01.org/tree/drm- > tip/Patchwork_95463v12/fi-kbl- > 7567u/igt@i915_selftest@live@sanitycheck.html> (i915#11328 > <https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/11328> ) +40 other > tests dmesg-warn > > * igt@kms_busy@basic@flip: > > * fi-kbl-7567u: PASS <https://intel-gfx-ci.01.org/tree/drm- > tip/CI_DRM_15013/fi-kbl-7567u/igt@kms_busy@basic@flip.html> -> > DMESG-WARN <https://intel-gfx-ci.01.org/tree/drm- > tip/Patchwork_95463v12/fi-kbl-7567u/igt@kms_busy@basic@flip.html> > (i915#180 <https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/180> ) > > * igt@kms_frontbuffer_tracking@basic: > > * bat-arls-2: PASS <https://intel-gfx-ci.01.org/tree/drm- > tip/CI_DRM_15013/bat-arls-2/igt@kms_frontbuffer_tracking@basic.html> - > > DMESG-WARN <https://intel-gfx-ci.01.org/tree/drm- > tip/Patchwork_95463v12/bat-arls- > 2/igt@kms_frontbuffer_tracking@basic.html> (i915#7507 > <https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/7507> ) > > * igt@kms_pm_rpm@basic-pci-d3-state: > > * fi-kbl-7567u: PASS <https://intel-gfx-ci.01.org/tree/drm- > tip/CI_DRM_15013/fi-kbl-7567u/igt@kms_pm_rpm@basic-pci-d3- > state.html> -> DMESG-WARN <https://intel-gfx-ci.01.org/tree/drm- > tip/Patchwork_95463v12/fi-kbl-7567u/igt@kms_pm_rpm@basic-pci-d3- > state.html> (i915#10062 <https://gitlab.freedesktop.org/drm/i915/kernel/- > /issues/10062> / i915#180 <https://gitlab.freedesktop.org/drm/i915/kernel/- > /issues/180> / i915#9925 <https://gitlab.freedesktop.org/drm/i915/kernel/- > /issues/9925> ) +39 other tests dmesg-warn > > > Build changes > > > * Linux: CI_DRM_15013 -> Patchwork_95463v12 > > CI-20190529: 20190529 > CI_DRM_15013: 0318a12ff6fb8c321458aa2b373e9322896ee951 @ > git://anongit.freedesktop.org/gfx-ci/linux > IGT_7906: ae91ba26f657bf11264f64bd2dc21f471a5d18f5 @ > https://gitlab.freedesktop.org/drm/igt-gpu-tools.git > Patchwork_95463v12: 0318a12ff6fb8c321458aa2b373e9322896ee951 @ > git://anongit.freedesktop.org/gfx-ci/linux
On 2024-06-28 14:57:59 [+0200], To intel-gfx@lists.freedesktop.org wrote: Hi, > The following patches are from the PREEMPT_RT queue. It is mostly about > disabling interrupts/preemption which leads to problems. Unfortunately > DRM_I915_LOW_LEVEL_TRACEPOINTS had to be disabled because it acquires locks > from within trace points. Making the lock a raw_spinlock_t led to higher > latencies during video playback > https://lore.kernel.org/all/20211006164628.s2mtsdd2jdbfyf7g@linutronix.de/ > > and I'm not sure if I hit the worse case here. > I tested it on a SandyBridge with built-in i915 by using X, OpenGL and > playing videos without noticing any warnings. However, some code paths > were not entered. > I carry them for some time now and most issues were reported by other > people and they reported that things work for them since. These patches were not picked. Did I forget something or was this just overseen? Sebastian
On Wed, Oct 02, 2024 at 06:25:43PM +0200, Sebastian Andrzej Siewior wrote: > On 2024-06-28 14:57:59 [+0200], To intel-gfx@lists.freedesktop.org wrote: > Hi, > > > The following patches are from the PREEMPT_RT queue. It is mostly about > > disabling interrupts/preemption which leads to problems. Unfortunately > > DRM_I915_LOW_LEVEL_TRACEPOINTS had to be disabled because it acquires locks > > from within trace points. Making the lock a raw_spinlock_t led to higher > > latencies during video playback > > https://lore.kernel.org/all/20211006164628.s2mtsdd2jdbfyf7g@linutronix.de/ > > > > and I'm not sure if I hit the worse case here. > > I tested it on a SandyBridge with built-in i915 by using X, OpenGL and > > playing videos without noticing any warnings. However, some code paths > > were not entered. > > I carry them for some time now and most issues were reported by other > > people and they reported that things work for them since. > > These patches were not picked. Did I forget something or was this just > overseen? This looks quite poorly justified. Eg. you seem to be now leaving interrupts enabled (and even preemption enabled I guess) when we're racing against the raster beam. On first blush that seems like a recipe for failure. First step would be to set CONFIG_DRM_I915_DEBUG_VBLANK_EVADE=y, run a bunch of tests that stress the display stuff (eg. kms_atomic_transitions and other stuff from igt, and also some real workloads) and probably throw in a bunch of other load/perturbance at the system to make life hard. After the system has been sufficiently hammered one can compare sys/kernel/debug/dri/0/crtc-*/i915_update_info against a baseline. Bonus points for doing it on a potato.
On 2024-10-02 19:58:08 [+0300], Ville Syrjälä wrote: > > These patches were not picked. Did I forget something or was this just > > overseen? > > This looks quite poorly justified. Eg. you seem to be now > leaving interrupts enabled (and even preemption enabled I > guess) when we're racing against the raster beam. On first > blush that seems like a recipe for failure. I can't really parse this. I may leave interrupts enabled in vblanc code (the first two patches). > First step would be to set CONFIG_DRM_I915_DEBUG_VBLANK_EVADE=y, > run a bunch of tests that stress the display stuff (eg. > kms_atomic_transitions and other stuff from igt, and also > some real workloads) and probably throw in a bunch of > other load/perturbance at the system to make life hard. > After the system has been sufficiently hammered one can > compare sys/kernel/debug/dri/0/crtc-*/i915_update_info > against a baseline. Bonus points for doing it on a potato. So you have a specific test for me to run? Sebastian
On Fri, Oct 04, 2024 at 08:49:51AM +0200, Sebastian Andrzej Siewior wrote: > On 2024-10-02 19:58:08 [+0300], Ville Syrjälä wrote: > > > These patches were not picked. Did I forget something or was this just > > > overseen? > > > > This looks quite poorly justified. Eg. you seem to be now > > leaving interrupts enabled (and even preemption enabled I > > guess) when we're racing against the raster beam. On first > > blush that seems like a recipe for failure. > > I can't really parse this. I may leave interrupts enabled in vblanc code > (the first two patches). vblank evasion 101: We need to write a boatload of registers atomically within one frame. We try to guarantee that by checking if we're too close the point at which the hardware latches said registers, and if so we defer the update into the next frame. raster scan/time -> .....vertical active.........><...vertical blank...><...vertical active...><...vertical blank <VBLANK_EVASION_TIME_US> ^^ ^ ^ || | | registers are /| registers are / | latched here | latched here | | If we're somewhere | in here we delay | writing the registers... ...until we are about here and then we have most of the frame to write all the registers. Though since we use an interrupt to wait for this point interrupt latency does eat into the budget a bit .....vertical active.........><..vertical blank...><... <VBLANK_EVASION_TIME_US> ^ <-. | | registers are / | latched here | If we're somewhere to the left of this point we proceed to write the registers immediately and now in the worst case we have exactly VBLANK_EVASION_TIME_US to write all the registers So once vblank evasion has declared things to be safe we might have as short a time as VBLANK_EVASION_TIME_US to write all the registers. If the CPU gets stolen from us at that point we can no longer guarantee anything. The magic value has been tuned empirically over the years, until we've found something that seems to work well enough, without being too long to negatively affect performance.
On 2024-10-04 11:31:22 [+0300], Ville Syrjälä wrote: > > So once vblank evasion has declared things to be safe we might have > as short a time as VBLANK_EVASION_TIME_US to write all the registers. > If the CPU gets stolen from us at that point we can no longer guarantee > anything. The magic value has been tuned empirically over the years, > until we've found something that seems to work well enough, without > being too long to negatively affect performance. what happens if this gets delayed? Just flicker or worse? Is this something that affects all i915 based HW or only old ones? As far as I remember, there is a register lock which is only required on older HW. Sebastian
On Fri, Oct 04, 2024 at 10:45:25AM +0200, Sebastian Andrzej Siewior wrote: > On 2024-10-04 11:31:22 [+0300], Ville Syrjälä wrote: > > > > So once vblank evasion has declared things to be safe we might have > > as short a time as VBLANK_EVASION_TIME_US to write all the registers. > > If the CPU gets stolen from us at that point we can no longer guarantee > > anything. The magic value has been tuned empirically over the years, > > until we've found something that seems to work well enough, without > > being too long to negatively affect performance. > > what happens if this gets delayed? Just flicker or worse? In the best best case it just gets you a corrupted frame of some sort, in the worst case the hardware falls over. Depends on what kind of update is happening, and what platform we're dealing with. We've tried to mitigate some of the worst issues by trying to order the register writes more carefully, but some of the ordering constraints (eg. scalers vs. DDB) are more or less in conflict with each other so making it 100% safe seems impossible. > > Is this something that affects all i915 based HW or only old ones? As > far as I remember, there is a register lock which is only required on > older HW. Currently it affects everything. There is a new double buffer latching inhibit bit on some of the very latest platforms that we could probably use to make things more safe if vblank evasion fails, but we've not hooked that up. But vblank evasion would still be necessary at least for cursor updates since those are done as mailbox style updates (ie. multiple updates per frame) and there is no way to guarantee forward progress without vblank evasion. Register access locks aren't relevant here, and most register accesses in the vblank evade critical section are lockless anyway. The locks were too expensive and we determined that we an safely use lockless accesses here.
On 2024-10-04 12:07:24 [+0300], Ville Syrjälä wrote: > > what happens if this gets delayed? Just flicker or worse? > > In the best best case it just gets you a corrupted frame > of some sort, in the worst case the hardware falls over. > Depends on what kind of update is happening, and what > platform we're dealing with. > > We've tried to mitigate some of the worst issues by > trying to order the register writes more carefully, > but some of the ordering constraints (eg. scalers vs. > DDB) are more or less in conflict with each other > so making it 100% safe seems impossible. So based on what you are saying, this _has_ to happen with disabled interrupts. > > > > Is this something that affects all i915 based HW or only old ones? As > > far as I remember, there is a register lock which is only required on > > older HW. > > Currently it affects everything. There is a new double buffer > latching inhibit bit on some of the very latest platforms that > we could probably use to make things more safe if vblank evasion > fails, but we've not hooked that up. But vblank evasion would still > be necessary at least for cursor updates since those are > done as mailbox style updates (ie. multiple updates per frame) > and there is no way to guarantee forward progress without vblank > evasion. This sounds sad. Especially since the delay is at 100us. > Register access locks aren't relevant here, and most > register accesses in the vblank evade critical section > are lockless anyway. The locks were too expensive and we > determined that we an safely use lockless accesses here. The register lock question was only an example of something that is not required on newish hardware. Also disabling interrupts within in patch 1, 2 would require to make uncore:lock a raw_spinlock_t since it is acquire there. What do suggest as in how do we move forward? In terms of testing, I have here an i7 sandy bridge with embedded GPU. Sebastian