From patchwork Thu Nov  1 08:35:41 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Sharat Masetty <smasetty@codeaurora.org>
X-Patchwork-Id: 10663687
Return-Path: <linux-arm-msm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CDA2F13A4
	for <patchwork-linux-arm-msm@patchwork.kernel.org>;
 Thu,  1 Nov 2018 08:35:52 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B28042B8D9
	for <patchwork-linux-arm-msm@patchwork.kernel.org>;
 Thu,  1 Nov 2018 08:35:52 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id A6E6C2B8E4; Thu,  1 Nov 2018 08:35:52 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.7 required=2.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C66FA2B8D9
	for <patchwork-linux-arm-msm@patchwork.kernel.org>;
 Thu,  1 Nov 2018 08:35:51 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727949AbeKARhv (ORCPT
        <rfc822;patchwork-linux-arm-msm@patchwork.kernel.org>);
        Thu, 1 Nov 2018 13:37:51 -0400
Received: from smtp.codeaurora.org ([198.145.29.96]:51204 "EHLO
        smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727704AbeKARhv (ORCPT
        <rfc822;linux-arm-msm@vger.kernel.org>);
        Thu, 1 Nov 2018 13:37:51 -0400
Received: by smtp.codeaurora.org (Postfix, from userid 1000)
        id 12EDE60591; Thu,  1 Nov 2018 08:35:50 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=codeaurora.org;
        s=default; t=1541061350;
        bh=tlhe0hMouKeg5bIRJmzaGu3NtxwBBahOdV0U+misypg=;
        h=From:To:Cc:Subject:Date:From;
        b=CUXk7Okvc+XQVTmVfH2EyxcdljQ8kQ1G+fzm/gJTq3gEFoczZxQ5XynsZdfbXCP8P
         hNGLTf7PdraD7DGOHU/zg9mYKdlv8qtgPPPpKrjp3qQqdkbvSy0c9vE9xc7TiOA7/U
         yTSHkmczvLCMwLVh3QEP2tiShjwPKCTuATj9Ylic=
Received: from smasetty-linux.qualcomm.com
 (blr-c-bdr-fw-01_globalnat_allzones-outside.qualcomm.com [103.229.19.19])
        (using TLSv1.2 with cipher ECDHE-RSA-AES128-SHA256 (128/128 bits))
        (No client certificate requested)
        (Authenticated sender: smasetty@smtp.codeaurora.org)
        by smtp.codeaurora.org (Postfix) with ESMTPSA id F29DF60112;
        Thu,  1 Nov 2018 08:35:47 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=codeaurora.org;
        s=default; t=1541061349;
        bh=tlhe0hMouKeg5bIRJmzaGu3NtxwBBahOdV0U+misypg=;
        h=From:To:Cc:Subject:Date:From;
        b=Y4cZsigJaSKSwMleyb3VQXuXr8MSdK+FJidcbWfOGhK7RGZhShmXVkA888Duk/jxQ
         Oa7cfL9kqvuqAaOZTguIQ3pZobHOA+saLtd/Baxgq1XNwbBdydBuRZZRLsZedwsR8l
         sQJprSEqMgCg88nd5omexQaMVLvcjLnLq4qjCZk4=
DMARC-Filter: OpenDMARC Filter v1.3.2 smtp.codeaurora.org F29DF60112
Authentication-Results: pdx-caf-mail.web.codeaurora.org;
 dmarc=none (p=none dis=none) header.from=codeaurora.org
Authentication-Results: pdx-caf-mail.web.codeaurora.org;
 spf=none smtp.mailfrom=smasetty@codeaurora.org
From: Sharat Masetty <smasetty@codeaurora.org>
To: freedreno@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org, linux-arm-msm@vger.kernel.org,
        jcrouse@codeaurora.org, Sharat Masetty <smasetty@codeaurora.org>
Subject: [PATCH] drm/msm: Optimize GPU crashstate capture read path
Date: Thu,  1 Nov 2018 14:05:41 +0530
Message-Id: <1541061341-29994-1-git-send-email-smasetty@codeaurora.org>
X-Mailer: git-send-email 1.9.1
Sender: linux-arm-msm-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-arm-msm.vger.kernel.org>
X-Mailing-List: linux-arm-msm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

When the userspace tries to read the crashstate dump, the read side
implementation in the driver currently ascii85 encodes all the binary
buffers and it does this each time the read system call is called.
A userspace tool like cat typically does a page by page read and the
number of read calls depends on the size of the data captured by the
driver. This is certainly not desirable and does not scale well with
large captures.

This patch starts off with moving the encoding part to the capture time,
that is when the GPU tries to recover after a hang. Now we only encode
once per buffer object and not per page read. With this there is an
immediate >10X speed improvement in crashstate save time.

The flip side however is that the GPU recovery time increases and this
negatively impacts the user experience, so this is handled by moving the
encoding part to a worker thread and only register with the dev_coredump
once the encoding of the buffers is complete.

Signed-off-by: Sharat Masetty <smasetty@codeaurora.org>
---
 drivers/gpu/drm/msm/adreno/adreno_gpu.c | 35 +++----------
 drivers/gpu/drm/msm/msm_gpu.c           | 93 +++++++++++++++++++++++++++++++--
 drivers/gpu/drm/msm/msm_gpu.h           |  1 +
 3 files changed, 98 insertions(+), 31 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
index 141062f..7441571 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
@@ -406,7 +406,7 @@ int adreno_gpu_state_get(struct msm_gpu *gpu, struct msm_gpu_state *state)
 				size = j + 1;
 
 		if (size) {
-			state->ring[i].data = kmalloc(size << 2, GFP_KERNEL);
+			state->ring[i].data = kvmalloc(size << 2, GFP_KERNEL);
 			if (state->ring[i].data) {
 				memcpy(state->ring[i].data, gpu->rb[i]->start, size << 2);
 				state->ring[i].data_size = size << 2;
@@ -445,7 +445,7 @@ void adreno_gpu_state_destroy(struct msm_gpu_state *state)
 	int i;
 
 	for (i = 0; i < ARRAY_SIZE(state->ring); i++)
-		kfree(state->ring[i].data);
+		kvfree(state->ring[i].data);
 
 	for (i = 0; state->bos && i < state->nr_bos; i++)
 		kvfree(state->bos[i].data);
@@ -475,34 +475,15 @@ int adreno_gpu_state_put(struct msm_gpu_state *state)
 
 #if defined(CONFIG_DEBUG_FS) || defined(CONFIG_DEV_COREDUMP)
 
-static void adreno_show_object(struct drm_printer *p, u32 *ptr, int len)
+static void adreno_show_object(struct drm_printer *p, void *ptr)
 {
-	char out[ASCII85_BUFSZ];
-	long l, datalen, i;
-
-	if (!ptr || !len)
-		return;
-
-	/*
-	 * Only dump the non-zero part of the buffer - rarely will any data
-	 * completely fill the entire allocated size of the buffer
-	 */
-	for (datalen = 0, i = 0; i < len >> 2; i++) {
-		if (ptr[i])
-			datalen = (i << 2) + 1;
-	}
-
-	/* Skip printing the object if it is empty */
-	if (datalen == 0)
+	if (!ptr)
 		return;
 
-	l = ascii85_encode_len(datalen);
-
 	drm_puts(p, "    data: !!ascii85 |\n");
 	drm_puts(p, "     ");
 
-	for (i = 0; i < l; i++)
-		drm_puts(p, ascii85_encode(ptr[i], out));
+	drm_puts(p, ptr);
 
 	drm_puts(p, "\n");
 }
@@ -534,8 +515,7 @@ void adreno_show(struct msm_gpu *gpu, struct msm_gpu_state *state,
 		drm_printf(p, "    wptr: %d\n", state->ring[i].wptr);
 		drm_printf(p, "    size: %d\n", MSM_GPU_RINGBUFFER_SZ);
 
-		adreno_show_object(p, state->ring[i].data,
-			state->ring[i].data_size);
+		adreno_show_object(p, state->ring[i].data);
 	}
 
 	if (state->bos) {
@@ -546,8 +526,7 @@ void adreno_show(struct msm_gpu *gpu, struct msm_gpu_state *state,
 				state->bos[i].iova);
 			drm_printf(p, "    size: %zd\n", state->bos[i].size);
 
-			adreno_show_object(p, state->bos[i].data,
-				state->bos[i].size);
+			adreno_show_object(p, state->bos[i].data);
 		}
 	}
 
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index ff95848..d5533f1 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -21,6 +21,7 @@
 #include "msm_fence.h"
 
 #include <generated/utsrelease.h>
+#include <linux/ascii85.h>
 #include <linux/string_helpers.h>
 #include <linux/pm_opp.h>
 #include <linux/devfreq.h>
@@ -375,9 +376,7 @@ static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
 	/* Set the active crash state to be dumped on failure */
 	gpu->crashstate = state;
 
-	/* FIXME: Release the crashstate if this errors out? */
-	dev_coredumpm(gpu->dev->dev, THIS_MODULE, gpu, 0, GFP_KERNEL,
-		msm_gpu_devcoredump_read, msm_gpu_devcoredump_free);
+	schedule_work(&gpu->crashstate_work);
 }
 #else
 static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
@@ -420,6 +419,93 @@ static void update_fences(struct msm_gpu *gpu, struct msm_ringbuffer *ring,
 
 static void retire_submits(struct msm_gpu *gpu);
 
+static char *msm_gpu_ascii85_encode_buf(u32 *src, size_t len)
+{
+	void *buf;
+	size_t buf_itr = 0, buffer_size;
+	char out[ASCII85_BUFSZ];
+	long l;
+	int i;
+
+	if (!src || !len)
+		return NULL;
+
+	l = ascii85_encode_len(len);
+
+	/*
+	 * Ascii85 outputs either a 5 byte string or a 1 byte string. So we
+	 * account for the worst case of 5 bytes per dword plus the 1 for '\0'
+	 */
+	buffer_size = (l * 5) + 1;
+
+	buf = kvmalloc(buffer_size, GFP_KERNEL);
+	if (!buf)
+		return NULL;
+
+	for (i = 0; i < l; i++)
+		buf_itr += snprintf(buf + buf_itr, buffer_size - buf_itr, "%s",
+				ascii85_encode(src[i], out));
+
+	return buf;
+}
+
+/*
+ * Convert the binary data in the gpu crash state capture(like bos and
+ * ringbuffer data) to ascii85 encoded strings. Note that the encoded
+ * string is NULL terminated.
+ */
+static void crashstate_worker(struct work_struct *work)
+{
+	struct msm_gpu_state *state;
+	struct msm_gpu *gpu = container_of(work, struct msm_gpu,
+			crashstate_work);
+	int i;
+
+	state = msm_gpu_crashstate_get(gpu);
+	if (!state)
+		return;
+
+	for (i = 0; i < MSM_GPU_MAX_RINGS; i++) {
+		void *buf = NULL;
+
+		if (!state->ring[i].data)
+			continue;
+
+		buf = state->ring[i].data;
+
+		state->ring[i].data = msm_gpu_ascii85_encode_buf(buf,
+				state->ring[i].data_size);
+		kvfree(buf);
+	}
+
+	for (i = 0; i < state->nr_bos; i++) {
+		u32 *buf = NULL;
+		long datalen, j;
+
+		if (!state->bos[i].data)
+			continue;
+
+		buf = state->bos[i].data;
+
+		/*
+		 * Only dump the non-zero part of the buffer - rarely will
+		 * any data completely fill the entire allocated size of the
+		 * buffer
+		 */
+		for (datalen = 0, j = 0; j < state->bos[i].size >> 2; j++)
+			if (buf[j])
+				datalen = ((j + 1) << 2);
+
+		state->bos[i].data = msm_gpu_ascii85_encode_buf(buf, datalen);
+		kvfree(buf);
+	}
+
+	msm_gpu_crashstate_put(gpu);
+
+	dev_coredumpm(gpu->dev->dev, THIS_MODULE, gpu, 0, GFP_KERNEL,
+		msm_gpu_devcoredump_read, msm_gpu_devcoredump_free);
+}
+
 static void recover_worker(struct work_struct *work)
 {
 	struct msm_gpu *gpu = container_of(work, struct msm_gpu, recover_work);
@@ -856,6 +942,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
 	INIT_LIST_HEAD(&gpu->active_list);
 	INIT_WORK(&gpu->retire_work, retire_worker);
 	INIT_WORK(&gpu->recover_work, recover_worker);
+	INIT_WORK(&gpu->crashstate_work, crashstate_worker);
 
 
 	timer_setup(&gpu->hangcheck_timer, hangcheck_handler, 0);
diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index f82bac0..8825069 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -137,6 +137,7 @@ struct msm_gpu {
 	} devfreq;
 
 	struct msm_gpu_state *crashstate;
+	struct work_struct crashstate_work;
 };
 
 /* It turns out that all targets use the same ringbuffer size */