mbox series

[v4,0/2] Avoid reading OA reports before they land

Message ID 20230605193923.1836048-1-umesh.nerlige.ramappa@intel.com (mailing list archive)
Headers show
Series Avoid reading OA reports before they land | expand

Message

Umesh Nerlige Ramappa June 5, 2023, 7:39 p.m. UTC
Fix OA issue seen on DG2 where parts of OA reports are zeroed out or
have stale values. This was due to the fact that rewind logic was not
being run when the tail pointer was aged. The series drops the complex
aging/aged logic and just checks the reports for validity.

rev1 - https://patchwork.freedesktop.org/series/118054/
v2: Drop aging logic completely
v3: Remove unnecessary renames and squash patches
v4: Indentaion fixes

Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>

Umesh Nerlige Ramappa (2):
  i915/perf: Drop the aging_tail logic in perf OA
  i915/perf: Do not add ggtt offset to hw_tail

 drivers/gpu/drm/i915/i915_perf.c       | 92 ++++++++++----------------
 drivers/gpu/drm/i915/i915_perf_types.h | 12 ----
 2 files changed, 36 insertions(+), 68 deletions(-)

Comments

Umesh Nerlige Ramappa June 7, 2023, 7:25 p.m. UTC | #1
On Mon, Jun 05, 2023 at 11:44:21PM +0000, Patchwork wrote:
>   Patch Details
>
>Series:  Avoid reading OA reports before they land
>URL:     [1]https://patchwork.freedesktop.org/series/118886/
>State:   failure
>Details: [2]https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_118886v1/index.html
>
>          CI Bug Log - changes from CI_DRM_13232 -> Patchwork_118886v1
>
>Summary
>
>   FAILURE
>
>   Serious unknown changes coming with Patchwork_118886v1 absolutely need to
>   be
>   verified manually.
>
>   If you think the reported changes have nothing to do with the changes
>   introduced in Patchwork_118886v1, please notify your bug team to allow
>   them
>   to document this new failure mode, which will reduce false positives in
>   CI.
>
>   External URL:
>   https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_118886v1/index.html
>
>Participating hosts (37 -> 37)
>
>   Additional (1): bat-rpls-2
>   Missing (1): fi-snb-2520m
>
>Possible new issues
>
>   Here are the unknown changes that may have been introduced in
>   Patchwork_118886v1:
>
>  IGT changes
>
>    Possible regressions
>
>     * igt@i915_selftest@live@gt_timelines:
>
>          * fi-apl-guc: [3]PASS -> [4]DMESG-WARN +2 similar issues

<3> [309.685038] i915 0000:00:02.0: [drm] *ERROR* Failed to probe lspcon

This warning is not related to OA or any use case from this patch.

>
>    Warnings
>
>     * igt@kms_psr@sprite_plane_onoff:
>
>          * bat-rplp-1: [5]SKIP ([6]i915#1072) -> [7]ABORT

+ John

These are not related to OA, but a known lockdep issue.

<4>[  229.036305] ======================================================
<4>[  229.036320] WARNING: possible circular locking dependency detected
<4>[  229.036334] 6.4.0-rc5-Patchwork_118886v1-g450d228e3840+ #1 Not tainted
<4>[  229.036348] ------------------------------------------------------
<4>[  229.036362] kworker/0:0H/8 is trying to acquire lock:
<4>[  229.036374] ffff888117b74f48 (&gt->reset.backoff_srcu){++++}-{0:0}, at: _intel_gt_reset_lock+0x0/0x330 [i915]
<4>[  229.036503] but task is already holding lock:
<4>[  229.036521] ffffc900000d3e60 ((work_completion)(&(&guc->timestamp.work)->work)){+.+.}-{0:0}, at: process_one_work+0x1cc/0x510
<4>[  229.036548] which lock already depends on the new lock.

<4>[  229.036574] the existing dependency chain (in reverse order) is:
<4>[  229.036598] -> #3 ((work_completion)(&(&guc->timestamp.work)->work)){+.+.}-{0:0}:
<4>[  229.036624]        lock_acquire+0xd8/0x2d0
<4>[  229.036636]        __flush_work+0x74/0x530
<4>[  229.036646]        __cancel_work_timer+0x14f/0x1f0
<4>[  229.036658]        intel_guc_submission_reset_prepare+0x81/0x4b0 [i915]
<4>[  229.036799]        intel_uc_reset_prepare+0x9c/0x120 [i915]
<4>[  229.036938]        reset_prepare+0x21/0x60 [i915]
<4>[  229.037054]        intel_gt_reset+0x1dd/0x470 [i915]
<4>[  229.037172]        intel_gt_reset_global+0xfb/0x170 [i915]
<4>[  229.037285]        intel_gt_handle_error+0x368/0x420 [i915]
<4>[  229.037401]        intel_gt_debugfs_reset_store+0x5c/0xc0 [i915]
<4>[  229.037509]        i915_wedged_set+0x29/0x40 [i915]
<4>[  229.037600]        simple_attr_write_xsigned.constprop.0+0xb4/0x110
<4>[  229.037616]        full_proxy_write+0x52/0x80
<4>[  229.037627]        vfs_write+0xc5/0x4f0
<4>[  229.037637]        ksys_write+0x64/0xe0
<4>[  229.037646]        do_syscall_64+0x3c/0x90
<4>[  229.037658]        entry_SYSCALL_64_after_hwframe+0x72/0xdc
<4>[  229.037672] -> #2 (&gt->reset.mutex){+.+.}-{3:3}:
<4>[  229.037694]        lock_acquire+0xd8/0x2d0
<4>[  229.037704]        i915_gem_shrinker_taints_mutex+0x31/0x50 [i915]
<4>[  229.037835]        intel_gt_init_reset+0x65/0x80 [i915]
<4>[  229.037948]        intel_gt_common_init_early+0xe1/0x170 [i915]
<4>[  229.038055]        intel_root_gt_init_early+0x48/0x60 [i915]
<4>[  229.038158]        i915_driver_probe+0x243/0xcd0 [i915]
<4>[  229.038247]        i915_pci_probe+0xdc/0x210 [i915]
<4>[  229.038335]        pci_device_probe+0x95/0x120
<4>[  229.038347]        really_probe+0x164/0x3c0
<4>[  229.038358]        __driver_probe_device+0x73/0x160
<4>[  229.038371]        driver_probe_device+0x19/0xa0
<4>[  229.038384]        __driver_attach+0xb6/0x180
<4>[  229.038395]        bus_for_each_dev+0x77/0xd0
<4>[  229.038405]        bus_add_driver+0x114/0x210
<4>[  229.038415]        driver_register+0x5b/0x110
<4>[  229.038425]        0xffffffffa00fd033
<4>[  229.038439]        do_one_initcall+0x57/0x270
<4>[  229.038450]        do_init_module+0x5f/0x220
<4>[  229.038461]        load_module+0x1ca4/0x1f00
<4>[  229.038472]        __do_sys_finit_module+0xb4/0x130
<4>[  229.038484]        do_syscall_64+0x3c/0x90
<4>[  229.038495]        entry_SYSCALL_64_after_hwframe+0x72/0xdc
<4>[  229.038508] -> #1 (fs_reclaim){+.+.}-{0:0}:
<4>[  229.038528]        lock_acquire+0xd8/0x2d0
<4>[  229.038538]        fs_reclaim_acquire+0xac/0xe0
<4>[  229.038550]        __kmem_cache_alloc_node+0x30/0x1b0
<4>[  229.038563]        kmalloc_trace+0x24/0xb0
<4>[  229.039296]        kernfs_fop_open+0xc0/0x3d0
<4>[  229.040028]        do_dentry_open+0x14a/0x440
<4>[  229.040754]        path_openat+0x663/0x8a0
<4>[  229.041480]        do_filp_open+0xb1/0x120
<4>[  229.042030]        do_sys_openat2+0x250/0x330
<4>[  229.042545]        do_sys_open+0x43/0x80
<4>[  229.043107]        do_syscall_64+0x3c/0x90
<4>[  229.043665]        entry_SYSCALL_64_after_hwframe+0x72/0xdc
<4>[  229.044221] -> #0 (/-1493934552){...+}-{0:0}:
<1>[  229.045307] BUG: kernel NULL pointer dereference, address: 0000000000000014
<1>[  229.045852] #PF: supervisor read access in kernel mode
<1>[  229.046390] #PF: error_code(0x0000) - not-present page
<6>[  229.046922] PGD 0 P4D 0 <4>[  229.047460] Oops: 0000 [#1] PREEMPT SMP NOPTI
<4>[  229.048034] CPU: 0 PID: 8 Comm: kworker/0:0H Not tainted 6.4.0-rc5-Patchwork_118886v1-g450d228e3840+ #1
<4>[  229.048629] Hardware name: Intel Corporation Raptor Lake Client Platform/RaptorLake-P LP5 RVP, BIOS RPLPFWI1.R00.3257.A00.2207020323 07/02/2022
<4>[  229.049233] Workqueue: events_highpri guc_timestamp_ping [i915]
<4>[  229.049965] RIP: 0010:print_circular_bug_entry.isra.0+0x44/0x50
<4>[  229.050571] Code: 53 48 89 f3 89 d6 e8 5b 74 01 00 48 8b 7d 00 e8 d2 f3 ff ff 48 c7 c7 65 21 3c 82 e8 46 74 01 00 48 8b 3b ba 06 00 00 00 5b 5d <8b> 77 14 48 83 c7 18 e9 50 d6 04 00 90 90 90 90 90 90 90 90 90 90
<4>[  229.051206] RSP: 0018:ffffc900000d3b68 EFLAGS: 00010046
<4>[  229.051853] RAX: 0000000000000001 RBX: ffff888100d9b3f0 RCX: 0000000000000000
<4>[  229.052506] RDX: 0000000000000006 RSI: ffffffff823ccb57 RDI: 0000000000000000
<4>[  229.053151] RBP: ffff888100d9b3c8 R08: 0000000000000000 R09: ffffc900000d3a10
<4>[  229.053794] R10: 000000000024fd38 R11: 000000000024fda8 R12: 0000000000000000
<4>[  229.054443] R13: ffffc9000256fd00 R14: ffff888100d9a9c0 R15: ffffffff83f8fd40
<4>[  229.055094] FS:  0000000000000000(0000) GS:ffff8882a7000000(0000) knlGS:0000000000000000
<4>[  229.055753] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  229.056409] CR2: 0000000000000014 CR3: 00000001095b2000 CR4: 0000000000f50ef0
<4>[  229.057069] PKRU: 55555554
<4>[  229.057727] Call Trace:
<4>[  229.058378]  <TASK>
<4>[  229.059023]  ? __die_body+0x1a/0x60
<4>[  229.059671]  ? page_fault_oops+0x156/0x450
<4>[  229.060319]  ? do_user_addr_fault+0x65/0xa10
<4>[  229.060976]  ? exc_page_fault+0x68/0x1a0
<4>[  229.061629]  ? asm_exc_page_fault+0x26/0x30
<4>[  229.062281]  ? print_circular_bug_entry.isra.0+0x44/0x50
<4>[  229.062926]  print_circular_bug.isra.0+0x111/0x3f0
<4>[  229.063536]  check_noncircular+0x131/0x150
<4>[  229.064154]  ? arch_stack_walk+0x87/0xf0
<4>[  229.064759]  check_prev_add+0x90/0xc60
<4>[  229.065363]  __lock_acquire+0x19a3/0x25a0
<4>[  229.065966]  ? startup_64_setup_env+0x184/0xaf0
<4>[  229.066568]  lock_acquire+0xd8/0x2d0
<4>[  229.067173]  ? __pfx__intel_gt_reset_lock+0x10/0x10 [i915]
<4>[  229.067881]  _intel_gt_reset_lock+0x57/0x330 [i915]
<4>[  229.068586]  ? __pfx__intel_gt_reset_lock+0x10/0x10 [i915]
<4>[  229.069288]  guc_timestamp_ping+0x35/0x130 [i915]
<4>[  229.070018]  process_one_work+0x250/0x510
<4>[  229.070629]  worker_thread+0x4f/0x3a0
<4>[  229.071235]  ? __pfx_worker_thread+0x10/0x10
<4>[  229.071845]  kthread+0xff/0x130
<4>[  229.072454]  ? __pfx_kthread+0x10/0x10
<4>[  229.073064]  ret_from_fork+0x29/0x50
<4>[  229.073674]  </TASK>
<4>[  229.074283] Modules linked in: vgem drm_shmem_helper snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_pcm i915 prime_numbers i2c_algo_bit ttm drm_buddy drm_display_helper drm_kms_helper fuse r8153_ecm cdc_ether usbnet x86_pkg_temp_thermal coretemp kvm_intel kvm e1000e mei_pxp mei_hdcp r8152 irqbypass crct10dif_pclmul crc32_pclmul wmi_bmof mii ghash_clmulni_intel mei_me ptp i2c_i801 mei pps_core i2c_smbus video intel_lpss_pci wmi
<4>[  229.075708] CR2: 0000000000000014
<4>[  229.076421] ---[ end trace 0000000000000000 ]---
<4>[  229.373071] RIP: 0010:print_circular_bug_entry.isra.0+0x44/0x50
<4>[  229.373942] Code: 53 48 89 f3 89 d6 e8 5b 74 01 00 48 8b 7d 00 e8 d2 f3 ff ff 48 c7 c7 65 21 3c 82 e8 46 74 01 00 48 8b 3b ba 06 00 00 00 5b 5d <8b> 77 14 48 83 c7 18 e9 50 d6 04 00 90 90 90 90 90 90 90 90 90 90
<4>[  229.374830] RSP: 0018:ffffc900000d3b68 EFLAGS: 00010046
<4>[  229.375578] RAX: 0000000000000001 RBX: ffff888100d9b3f0 RCX: 0000000000000000
<4>[  229.376235] RDX: 0000000000000006 RSI: ffffffff823ccb57 RDI: 0000000000000000
<4>[  229.376927] RBP: ffff888100d9b3c8 R08: 0000000000000000 R09: ffffc900000d3a10
<4>[  229.377649] R10: 000000000024fd38 R11: 000000000024fda8 R12: 0000000000000000
<4>[  229.378373] R13: ffffc9000256fd00 R14: ffff888100d9a9c0 R15: ffffffff83f8fd40
<4>[  229.379100] FS:  0000000000000000(0000) GS:ffff8882a7000000(0000) knlGS:0000000000000000
<4>[  229.379838] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  229.380578] CR2: 0000000000000014 CR3: 00000001095b2000 CR4: 0000000000f50ef0
<4>[  229.381331] PKRU: 55555554


>
>Known issues
>
Umesh Nerlige Ramappa June 7, 2023, 7:44 p.m. UTC | #2
On Wed, Jun 07, 2023 at 05:40:28PM +0000, Patchwork wrote:
>   Patch Details
>
>Series:  Avoid reading OA reports before they land (rev2)
>URL:     [1]https://patchwork.freedesktop.org/series/118886/
>State:   failure
>Details: [2]https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_118886v2/index.html
>
>     CI Bug Log - changes from CI_DRM_13238_full -> Patchwork_118886v2_full
>
>Summary
>
>   FAILURE
>
>   Serious unknown changes coming with Patchwork_118886v2_full absolutely
>   need to be
>   verified manually.
>
>   If you think the reported changes have nothing to do with the changes
>   introduced in Patchwork_118886v2_full, please notify your bug team to
>   allow them
>   to document this new failure mode, which will reduce false positives in
>   CI.
>
>Participating hosts (7 -> 7)
>
>   No changes in participating hosts
>
>Possible new issues
>
>   Here are the unknown changes that may have been introduced in
>   Patchwork_118886v2_full:
>
>  IGT changes
>
>    Possible regressions
>
>     * igt@kms_vblank@pipe-b-accuracy-idle:
>
>          * shard-glk: [3]PASS -> [4]FAIL

Unrelated to this patch since no OA use cases in the above test path.  

Umesh

>
>Known issues
>