diff mbox series

[v2] drm/i915: add guard page to ggtt->error_capture

Message ID 20230203130328.3303274-1-andrzej.hajda@intel.com (mailing list archive)
State New, archived
Headers show
Series [v2] drm/i915: add guard page to ggtt->error_capture | expand

Commit Message

Andrzej Hajda Feb. 3, 2023, 1:03 p.m. UTC
Write-combining memory allows speculative reads by CPU.
ggtt->error_capture is WC mapped to CPU, so CPU/MMU can try
to prefetch memory beyond the error_capture, ie it tries
to read memory pointed by next PTE in GGTT.
If this PTE points to invalid address DMAR errors will occur.
This behavior was observed on ADL, RPL, DG2 platforms.
To avoid it, guard scratch page should be added after error_capture.

Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
---
This patch tries to diminish plague of DMAR errors present
in CI for ADL*, RPL*, DG2 platforms, see for example [1] (grep DMAR).
CI is usually tolerant for these errors, so the scale of the problem
is not really visible.
To show it I have counted lines containing DMAR errors in dmesgs produced
by CI for 1st version of the patch:
CI_DRM_12680: 626 errors
Patchwork_113560v1: 136 errors
So we have about 500 DMAR error lines per one CI run due to error_capture.

[1]: http://gfx-ci.igk.intel.com/tree/drm-tip/CI_DRM_12678/bat-adln-1/dmesg0.txt

v2:
    - modified commit message (I hope the diagnosis is correct),
    - added bug checks to ensure scratch is initialized on gen3 platforms.
      CI produces strange stacktrace for it suggesting scratch[0] is NULL,
      to be removed after resolving the issue with gen3 platforms.

[2]: http://gfx-ci.igk.intel.com/tree/drm-tip/Patchwork_113560v2/fi-blb-e6850/igt@i915_module_load@load.html

Regards
Andrzej
---
 drivers/gpu/drm/i915/gt/intel_ggtt.c | 30 ++++++++++++++++++++++++----
 drivers/gpu/drm/i915/gt/intel_gtt.c  |  2 +-
 2 files changed, 27 insertions(+), 5 deletions(-)

Comments

Andrzej Hajda Feb. 6, 2023, 8:18 a.m. UTC | #1
On 03.02.2023 17:35, Patchwork wrote:
> *Patch Details*
> *Series:*	drm/i915: add guard page to ggtt->error_capture (rev3)
> *URL:*	https://patchwork.freedesktop.org/series/113560/ 
> <https://patchwork.freedesktop.org/series/113560/>
> *State:*	failure
> *Details:* 
> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113560v3/index.html 
> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113560v3/index.html>
> 
> 
>   CI Bug Log - changes from CI_DRM_12691 -> Patchwork_113560v3
> 
> 
>     Summary
> 
> *FAILURE*
> 
> Serious unknown changes coming with Patchwork_113560v3 absolutely need to be
> verified manually.
> 
> If you think the reported changes have nothing to do with the changes
> introduced in Patchwork_113560v3, please notify your bug team to allow them
> to document this new failure mode, which will reduce false positives in CI.
> 
> External URL: 
> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113560v3/index.html
> 
> 
>     Participating hosts (28 -> 26)
> 
> Missing (2): bat-atsm-1 fi-snb-2520m
> 
> 
>     Possible new issues
> 
> Here are the unknown changes that may have been introduced in 
> Patchwork_113560v3:
> 
> 
>       IGT changes
> 
> 
>         Possible regressions
> 
>   * igt@i915_module_load@load:
>       o fi-blb-e6850: PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12691/fi-blb-e6850/igt@i915_module_load@load.html> -> ABORT <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113560v3/fi-blb-e6850/igt@i915_module_load@load.html>


OK, stacktrace and code checks show clearly scratch[0] is null for ggtt 
on gen < 6:
... ggtt_probe_hw(...)
{
	...
	if (GRAPHICS_VER(i915) >= 8)
		ret = gen8_gmch_probe(ggtt);
	else if (GRAPHICS_VER(i915) >= 6)
		ret = gen6_gmch_probe(ggtt);
	else
		ret = intel_ggtt_gmch_probe(ggtt);
	...
}

And setup_scratch_page for ggtt is called only from gen[68]_gmch_probe.
Anyway, speculative read is observed since gen12 anyway, so limiting to 
gen12+ should be enough to avoid null scratch.

Regards
Andrzej

> 
> 
>         Warnings
> 
>   * igt@i915_selftest@live@execlists:
>       o fi-kbl-soraka: INCOMPLETE
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12691/fi-kbl-soraka/igt@i915_selftest@live@execlists.html> (i915#7156 <https://gitlab.freedesktop.org/drm/intel/issues/7156>) -> INCOMPLETE <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113560v3/fi-kbl-soraka/igt@i915_selftest@live@execlists.html>
> 
> 
>         Suppressed
> 
> The following results come from untrusted machines, tests, or statuses.
> They do not affect the overall result.
> 
>   * igt@kms_pipe_crc_basic@suspend-read-crc@pipe-d-dp-1:
>       o {bat-adlp-9}: PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12691/bat-adlp-9/igt@kms_pipe_crc_basic@suspend-read-crc@pipe-d-dp-1.html> -> DMESG-WARN <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113560v3/bat-adlp-9/igt@kms_pipe_crc_basic@suspend-read-crc@pipe-d-dp-1.html>
> 
> 
>     Known issues
> 
> Here are the changes found in Patchwork_113560v3 that come from known 
> issues:
> 
> 
>       IGT changes
> 
> 
>         Possible fixes
> 
>   *
> 
>     igt@i915_selftest@live@gt_heartbeat:
> 
>       o fi-apl-guc: DMESG-FAIL
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12691/fi-apl-guc/igt@i915_selftest@live@gt_heartbeat.html> (i915#5334 <https://gitlab.freedesktop.org/drm/intel/issues/5334>) -> PASS <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113560v3/fi-apl-guc/igt@i915_selftest@live@gt_heartbeat.html>
>   *
> 
>     igt@i915_selftest@live@migrate:
> 
>       o {bat-adlp-9}: DMESG-FAIL
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12691/bat-adlp-9/igt@i915_selftest@live@migrate.html> (i915#7699 <https://gitlab.freedesktop.org/drm/intel/issues/7699>) -> PASS <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113560v3/bat-adlp-9/igt@i915_selftest@live@migrate.html>
> 
> {name}: This element is suppressed. This means it is ignored when computing
> the status of the difference (SUCCESS, WARNING, or FAILURE).
> 
> 
>     Build changes
> 
>   * Linux: CI_DRM_12691 -> Patchwork_113560v3
> 
> CI-20190529: 20190529
> CI_DRM_12691: 2153bc2944d37403c6d5c4e1082d074a34d39ae9 @ 
> git://anongit.freedesktop.org/gfx-ci/linux
> IGT_7148: ee8e31cf39c44d3fdbd04d8db239f8a815f86121 @ 
> https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
> Patchwork_113560v3: 2153bc2944d37403c6d5c4e1082d074a34d39ae9 @ 
> git://anongit.freedesktop.org/gfx-ci/linux
> 
> 
>       Linux commits
> 
> 5bccb726f2f4 drm/i915: add guard page to ggtt->error_capture
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index 842e69c7b21e49..79e327003da12f 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -503,6 +503,14 @@  static void cleanup_init_ggtt(struct i915_ggtt *ggtt)
 	mutex_destroy(&ggtt->error_mutex);
 }
 
+static void ggtt_insert_scratch_page(struct i915_ggtt *ggtt, u64 offset)
+{
+	struct i915_address_space *vm = &ggtt->vm;
+
+	GEM_BUG_ON(!vm->scratch[0]);
+	vm->insert_page(vm, px_dma(vm->scratch[0]), offset, I915_CACHE_NONE, 0);
+}
+
 static int init_ggtt(struct i915_ggtt *ggtt)
 {
 	/*
@@ -551,8 +559,12 @@  static int init_ggtt(struct i915_ggtt *ggtt)
 		 * paths, and we trust that 0 will remain reserved. However,
 		 * the only likely reason for failure to insert is a driver
 		 * bug, which we expect to cause other failures...
+		 *
+		 * Since CPU can perform speculative reads on error capture
+		 * (write-combining allows it) add scratch page after it to
+		 * avoid DMAR errors.
 		 */
-		ggtt->error_capture.size = I915_GTT_PAGE_SIZE;
+		ggtt->error_capture.size = 2 * I915_GTT_PAGE_SIZE;
 		ggtt->error_capture.color = I915_COLOR_UNEVICTABLE;
 		if (drm_mm_reserve_node(&ggtt->vm.mm, &ggtt->error_capture))
 			drm_mm_insert_node_in_range(&ggtt->vm.mm,
@@ -562,11 +574,21 @@  static int init_ggtt(struct i915_ggtt *ggtt)
 						    0, ggtt->mappable_end,
 						    DRM_MM_INSERT_LOW);
 	}
-	if (drm_mm_node_allocated(&ggtt->error_capture))
+	if (drm_mm_node_allocated(&ggtt->error_capture)) {
+		u64 start = ggtt->error_capture.start;
+		u64 end = ggtt->error_capture.start + ggtt->error_capture.size;
+		u64 i;
+
+		/*
+		 * During error capture, memcpying from the GGTT is triggering a
+		 * prefetch of the following PTE, so fill it with a guard page.
+		 */
+		for (i = start + I915_GTT_PAGE_SIZE; i < end; i += I915_GTT_PAGE_SIZE)
+			ggtt_insert_scratch_page(ggtt, i);
 		drm_dbg(&ggtt->vm.i915->drm,
 			"Reserved GGTT:[%llx, %llx] for use by error capture\n",
-			ggtt->error_capture.start,
-			ggtt->error_capture.start + ggtt->error_capture.size);
+			start, end);
+	}
 
 	/*
 	 * The upper portion of the GuC address space has a sizeable hole
diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.c b/drivers/gpu/drm/i915/gt/intel_gtt.c
index 4f436ba7a3c833..dddafc33054971 100644
--- a/drivers/gpu/drm/i915/gt/intel_gtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gtt.c
@@ -293,7 +293,7 @@  void *__px_vaddr(struct drm_i915_gem_object *p)
 
 dma_addr_t __px_dma(struct drm_i915_gem_object *p)
 {
-	GEM_BUG_ON(!i915_gem_object_has_pages(p));
+	GEM_BUG_ON(!p || !i915_gem_object_has_pages(p));
 	return sg_dma_address(p->mm.pages->sgl);
 }