From patchwork Mon Sep 6 01:07:35 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Pan, Xinhui" X-Patchwork-Id: 12476191 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EB21DC433F5 for ; Mon, 6 Sep 2021 01:07:41 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 5851F60EB7 for ; Mon, 6 Sep 2021 01:07:41 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 5851F60EB7 Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=amd.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 8E2C689650; Mon, 6 Sep 2021 01:07:38 +0000 (UTC) Received: from NAM12-MW2-obe.outbound.protection.outlook.com (mail-mw2nam12on2061.outbound.protection.outlook.com [40.107.244.61]) by gabe.freedesktop.org (Postfix) with ESMTPS id C2DC78934F; Mon, 6 Sep 2021 01:07:37 +0000 (UTC) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=VXBq3nLhYK6ueuuE0D+b4lBwoIRFYZNGqjNzZsJ+fVDR/qf153exKeftYbpNqM3l3QWwlsUGRBdeppwdest26PaTFiBLKlSZM1rajWbFzX1pc703GldOiclatW/bMeozyBiGFhP2qW0yXBFfHGGk6ZsO2HxMlNFSz7d7MzX20zcwukkantqV5Hq6+Ucer7f6MQBC/d5BdouWw0W2VjBR5v00SXL+bhzFFaYSbqX9te0Vvxq9AbSZ1kCC7BBsPNI0u1Ef+Vvs3bWD7Ir8otEdtIE7IyPYlgfRCO+jMmHvURfRuZq0Xo7+v1Y4MsvjqFvH/83gAbXLLTJ7BlOUCRvVog== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=uNGh5uu+yaTYs0t4bkiCwI+fxMt+3MvkPWsF5h7ZYrg=; b=CRnfkYV3mq/PhnkqmBXQY+5DDXz4MbiKqAfHzUAN6V6fZXbrq6ubQ/OyA7JgVuyuBRJd+6ilrlFrcKz3EUETYu2KNnjp8t4QxYMPBjgPAChuUWp/TLRrsdTkXbujRpcq1OZHYYuLyPwo2cK1F1fXtpD2bOArOLlGpkd25WTXKwLHrbxKwayMa51YVDD6U1TZ4oKtdfQxEuAdo7bkfdpV937UW/PHlQHU+1dnTPaIVgFvo3ejBQLt2LGZhE0atDtYeKvQGWONlLm0kELbmcnqz0tMVb/9WUIEefe6VgHk+zzUY0HNp2HqyExfPGTFBJ3RzF0VP8xhAkNPZHqTm6x3Jw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=amd.com; dmarc=pass action=none header.from=amd.com; dkim=pass header.d=amd.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=uNGh5uu+yaTYs0t4bkiCwI+fxMt+3MvkPWsF5h7ZYrg=; b=F8fcDoNoghoclC3KAdd+akco6cZpiy6vHVHYuPRYPpC+BHP5kjNwNQHmOWFnhYcE/dQCypsCghOQznWsa+qzui6BbRZIG0gnimRig2grleDrwrNgVUdAtroWSNkosTwTOdL71x0aDN+KCaOdvcVNV/7Mm3m+1d7XKN9D5a2SYi8= Received: from DM4PR12MB5165.namprd12.prod.outlook.com (2603:10b6:5:394::9) by DM4PR12MB5101.namprd12.prod.outlook.com (2603:10b6:5:390::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4478.19; Mon, 6 Sep 2021 01:07:35 +0000 Received: from DM4PR12MB5165.namprd12.prod.outlook.com ([fe80::79d6:3902:9bcd:37ca]) by DM4PR12MB5165.namprd12.prod.outlook.com ([fe80::79d6:3902:9bcd:37ca%9]) with mapi id 15.20.4478.025; Mon, 6 Sep 2021 01:07:35 +0000 From: "Pan, Xinhui" To: "amd-gfx@lists.freedesktop.org" , "dri-devel@lists.freedesktop.org" CC: "Koenig, Christian" , "Deucher, Alexander" Subject: Subject: [PATCH v2 0/2] Fix a hung during memory pressure test Thread-Topic: Subject: [PATCH v2 0/2] Fix a hung during memory pressure test Thread-Index: AQHXort1Cq+YgXL90kSY9ry9J6E/4A== Date: Mon, 6 Sep 2021 01:07:35 +0000 Message-ID: Accept-Language: zh-CN, en-US Content-Language: zh-CN X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: MSIP_Label_88914ebd-7e6c-4e12-a031-a9906be2db14_Enabled=True; MSIP_Label_88914ebd-7e6c-4e12-a031-a9906be2db14_SiteId=3dd8961f-e488-4e60-8e11-a82d994e183d; MSIP_Label_88914ebd-7e6c-4e12-a031-a9906be2db14_SetDate=2021-09-06T01:07:35.184Z; MSIP_Label_88914ebd-7e6c-4e12-a031-a9906be2db14_Name=AMD Official Use Only; MSIP_Label_88914ebd-7e6c-4e12-a031-a9906be2db14_ContentBits=0; MSIP_Label_88914ebd-7e6c-4e12-a031-a9906be2db14_Method=Standard; suggested_attachment_session_id: 65b94917-28f5-5243-95fc-ab9060fa4bd4 authentication-results: lists.freedesktop.org; dkim=none (message not signed) header.d=none; lists.freedesktop.org; dmarc=none action=none header.from=amd.com; x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 88ceab2c-79fb-4f87-d9b7-08d970d2b682 x-ms-traffictypediagnostic: DM4PR12MB5101: x-ms-exchange-transport-forked: True x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:4125; x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: iJbH+Tk8wNqluv5HX4rDyWCRNlQwITI9ygGL+jeeVeSDeit/hJAc8zDcqOgJpYuMh54YkSRfue+oFEC7GdU2b6MmyY7oV9fPUaZH+Yjfo7cWDczQ9J5LSAl/LKf1XoSiI6wdrjhGkTPHAbO9k2vk/ii/rZp+EYN/RGsl7TqHYY+Tq40sFShnj/lwuGkgmLS/zX3utSEAw9W4YicmaPvWdlpMB75S4jVpfmKpwwq2y2vfJMIB2xm3G4G++aU/FvYJfZzSEelvAE0Z5OshdbfnOlKvBgWV/3k/ju6alfERL7SzlAlp5BgYkZgwZZ0UkXFjS2JMK87RipoliM72i6xmzwrjFcpPpPehirZNJj9hvzAJpvtoj9JCKIrXUphe5zF5cq7tkYtlkNoHJW5ye4cnyOVfrpM4/d1eNYrR+2J0TDuXqC7ZF275mnpf5W3P+HpxEl2OlKR6df4TTt4atjwLvjxn8AVi6S6gZZAdfTRkWOxHScjh+66vRN5s5VQ0SRyle2EtoEsupbHkjbQ3JYus09Xe/ZtOLfmJw4/eN341D6W5XhgJ4zerSVb/OQiZb63O9+VsKvW6vSxeaYZbcQy9G+gFAsSH4Rr50QpapmgrFOJuqhg6oNuu7ZTW1xYquk0oLVoHRffUiutPbKMPnQQ/bhj6szfbqu0aGrKEA3gEzEyG3OQEpTGmUxcP38hokYhJtRWJ/uvdI0bX1izD+v5MdA== x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:DM4PR12MB5165.namprd12.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(39860400002)(376002)(136003)(346002)(396003)(366004)(316002)(66946007)(6506007)(186003)(478600001)(4326008)(38070700005)(7696005)(83380400001)(2906002)(450100002)(71200400001)(5660300002)(91956017)(8676002)(8936002)(33656002)(66556008)(64756008)(66476007)(9686003)(122000001)(66446008)(86362001)(38100700002)(76116006)(54906003)(55016002)(26005)(110136005)(52536014); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?eucgb2312_cn?b?TWxORnRpcDlxdmtlWDQ4RDF0?= =?eucgb2312_cn?b?UjNQb2hNT1F4RWVyMW1sa3phdnNKRnVIREoxM3gzS1JWVTZXUFlOaDNwVEFmUFc0?= =?eucgb2312_cn?b?SHkwZ3l4Y3IzVGpaNUNFYTAzbGJvSDBjRjE5ellRTTFaRGxMRFhJbE1FcE9pTUtX?= =?eucgb2312_cn?b?RGlpMDNsSnQ3cXRsSXprdjRLbzk1ZEhrT3QxQndWdC9DL0FJODhlRXdhSzdXVXNW?= =?eucgb2312_cn?b?aUdNWTZoRzc3cmgxRHhnNWtEYTI0NWorTG9NUElGUkNvSTBDWVZvTU5WTEJtUlp4?= =?eucgb2312_cn?b?dTBuTTRKejNyQ25nVTVNdWlyM1JGcld4QWxYVVpkK3RSZzZPUE5DajNZY2k1eHYw?= =?eucgb2312_cn?b?eWwrMEZReUQxb0EwWjZsOXdPNE9IMUpmQ3VSb1owVmVYZFNsNWFPbGxYQ1BHUXZs?= =?eucgb2312_cn?b?NmZFSVR4WVZYQ2pIeGYzMGVEOWtwRW9nV0c0bHMwUmdldHNjaGtIQ3U0MnBlb3hz?= =?eucgb2312_cn?b?MldwV2E4Q0dJRTZHOEpIZVVFRFJVT0Z5UE84dGI5YUVMVVg0UmgzekZHSUYvWGhi?= =?eucgb2312_cn?b?UEJwODc0bGxqcVN6a2FIc3dURTRCSEI5dGh5dDBlWWdZZVlpTWZGY3NFaXVMR1My?= =?eucgb2312_cn?b?cFNGeEw3eUpqZWQySmExQzZ0OGJiNCt5dWZHSjdxREZqMURmRDFkNExiSDBEd2J6?= =?eucgb2312_cn?b?bkdpRk5LVGd6bGh6SXFaeTQxaTcyVXdyMFpqVFpnUHZzYm9QN1JOY3R3RTMxR3JG?= =?eucgb2312_cn?b?a3FaTWV1MkJIdGpFUEhRVWtrd0IrelBkL1ltZVozNGNxRXVmdWVqRlZtY3c2RDVM?= =?eucgb2312_cn?b?VDl0bjdYR3dWa2NSUGdtdkNJUWxOK2FrUW1pMkpBNXJpTHJvMjE5WHFhdkcyclRk?= =?eucgb2312_cn?b?dUx2VkNRU1VHenluYVZlOGpoU21vSVhKM2NHdEgrbk0wOWVTUjdheStHV0QxRi9o?= =?eucgb2312_cn?b?c0ZoSHR5aGhCeCtzc250UDc5ZWZ2SlpUdFFLQWRHWGFXSGd6MGZ1amhYajZtTUhi?= =?eucgb2312_cn?b?YXlwSmVUdDNFTGl5bFI4Rko2SW5OSVlsMFRvb2FJUnhhbEduR1kxelVyVXdCVVNz?= =?eucgb2312_cn?b?QzRtYVVvcE4wazd1VXNjYUplMnF5Z2t3WElmbWUxVVEzaWJzdzZsYU50cEFQTzll?= =?eucgb2312_cn?b?bGVUS1pDTmRGcGYzUFcvNVI0RW90NmRqV0Erai9Na2pyY3JFTk9kbXZiOWtJclVK?= =?eucgb2312_cn?b?ODFOVUd2dVRCWDZDRXFhR3laWEovY1BOVTFVNENyaWNLS3hiTHNxdDBKanlzYVJt?= =?eucgb2312_cn?b?V0dWMzlKMDJyT3NabXB2YzBXNFE3cXpLWGtXOWdQcGt6eWtacTRXTWp0cS8zOTlD?= =?eucgb2312_cn?b?cFFHdG8xcFBvUGZyRWd3M1BVQkJIak1JaHFsZ2hBMDhqd1ZYVDlhNjlNeWNjN0Vn?= =?eucgb2312_cn?b?Rmhib3M4c3JKUmMvOU41K2FQaWEycDZBYUhyUURERzlzeE5CcDhPV0ZNclE0dGl3?= =?eucgb2312_cn?b?VDNrYVhlLzJlSWdZOVJ0US9VWG9LZ0RFamZ0RlVYTGppOU1CUU4xVk94OGNzNnhK?= =?eucgb2312_cn?b?QkVOY3E1dGFUdUk4YmNLbXF4cnowOGNKSTh0ODAvK29KZ1hRMjM1Tzh5ZFpQVEpZ?= =?eucgb2312_cn?b?b0JDR0xVcEFDOExJNU1HY3ZIaEExcU9pdnA2U1BDODk2bXNXYytDY05wQnJ1MHZ3?= =?eucgb2312_cn?b?UHluZ0pZejFoMmc1bEhNcmI2VEVwdUViS3ZPQk5rL1NTRTJ6TzBHeG5lTTNWYmpH?= =?eucgb2312_cn?b?dmtwZlpkS2l0ZCtDWWQwejFaSmxHWXMyeC8wMlRkVTEvQjNNVXl3bHNPaVN5a1k2?= =?eucgb2312_cn?b?bW0wMDlnOXYvZ0VYaDE5NHoxL0kybjNlcmdVUzR1?= MIME-Version: 1.0 X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: DM4PR12MB5165.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 88ceab2c-79fb-4f87-d9b7-08d970d2b682 X-MS-Exchange-CrossTenant-originalarrivaltime: 06 Sep 2021 01:07:35.6360 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: KJpm80UE2NWrSuOZ2fdc2icLagODUtKz5EpA5PooTyAIF2x3vLUsCUo+I2MXPieV X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM4PR12MB5101 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" [AMD Official Use Only] A long time ago, someone reports system got hung during memory test. In recent days, I am trying to look for or understand the potential deadlock in ttm/amdgpu code. This patchset aims to fix the deadlock during ttm populate. TTM has a parameter called pages_limit, when allocated GTT memory reaches this limit, swapout would be triggered. As ttm_bo_swapout does not return the correct retval, populate might get hung. UVD ib test uses GTT which might be insufficient. So a gpu recovery would hung if populate hung. I have made one drm test which alloc two GTT BOs, submit gfx copy commands and free these BOs without waiting fence. What's more, these gfx copy commands will cause gfx ring hang. So gpu recovery would be triggered. Now here is one possible deadlock case. gpu_recovery -> stop drm scheduler -> asic reset -> ib test -> tt populate (uvd ib test) -> ttm_bo_swapout (BO A) // this always fails as the fence of BO A would not be signaled by schedluer or HW. Hit deadlock. I paste the drm test patch below. #modprobe ttm pages_limit=65536 #amdgpu_test -s 1 -t 4 --- tests/amdgpu/basic_tests.c | 32 ++++++++++++++------------------ 1 file changed, 14 insertions(+), 18 deletions(-) -- xinhui pan (2): drm/ttm: Fix a deadlock if the target BO is not idle during swap drm/amdpgu: Use VRAM domain in UVD IB test drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 8 ++++++++ drivers/gpu/drm/ttm/ttm_bo.c | 6 +++--- 2 files changed, 11 insertions(+), 3 deletions(-) -- 2.25.1 diff --git a/tests/amdgpu/basic_tests.c b/tests/amdgpu/basic_tests.c index dbf02fee..f85ed340 100644 --- a/tests/amdgpu/basic_tests.c +++ b/tests/amdgpu/basic_tests.c @@ -65,13 +65,16 @@ static void amdgpu_direct_gma_test(void); static void amdgpu_command_submission_write_linear_helper(unsigned ip_type); static void amdgpu_command_submission_const_fill_helper(unsigned ip_type); static void amdgpu_command_submission_copy_linear_helper(unsigned ip_type); -static void amdgpu_test_exec_cs_helper(amdgpu_context_handle context_handle, +static void _amdgpu_test_exec_cs_helper(amdgpu_context_handle context_handle, unsigned ip_type, int instance, int pm4_dw, uint32_t *pm4_src, int res_cnt, amdgpu_bo_handle *resources, struct amdgpu_cs_ib_info *ib_info, - struct amdgpu_cs_request *ibs_request); + struct amdgpu_cs_request *ibs_request, int sync, int repeat); +#define amdgpu_test_exec_cs_helper(...) \ + _amdgpu_test_exec_cs_helper(__VA_ARGS__, 1, 1) + CU_TestInfo basic_tests[] = { { "Query Info Test", amdgpu_query_info_test }, { "Userptr Test", amdgpu_userptr_test }, @@ -1341,12 +1344,12 @@ static void amdgpu_command_submission_compute(void) * pm4_src, resources, ib_info, and ibs_request * submit command stream described in ibs_request and wait for this IB accomplished */ -static void amdgpu_test_exec_cs_helper(amdgpu_context_handle context_handle, +static void _amdgpu_test_exec_cs_helper(amdgpu_context_handle context_handle, unsigned ip_type, int instance, int pm4_dw, uint32_t *pm4_src, int res_cnt, amdgpu_bo_handle *resources, struct amdgpu_cs_ib_info *ib_info, - struct amdgpu_cs_request *ibs_request) + struct amdgpu_cs_request *ibs_request, int sync, int repeat) { int r; uint32_t expired; @@ -1395,12 +1398,15 @@ static void amdgpu_test_exec_cs_helper(amdgpu_context_handle context_handle, CU_ASSERT_NOT_EQUAL(ibs_request, NULL); /* submit CS */ - r = amdgpu_cs_submit(context_handle, 0, ibs_request, 1); + while (repeat--) + r = amdgpu_cs_submit(context_handle, 0, ibs_request, 1); CU_ASSERT_EQUAL(r, 0); r = amdgpu_bo_list_destroy(ibs_request->resources); CU_ASSERT_EQUAL(r, 0); + if (!sync) + return; fence_status.ip_type = ip_type; fence_status.ip_instance = 0; fence_status.ring = ibs_request->ring; @@ -1667,7 +1673,7 @@ static void amdgpu_command_submission_sdma_const_fill(void) static void amdgpu_command_submission_copy_linear_helper(unsigned ip_type) { - const int sdma_write_length = 1024; + const int sdma_write_length = (255) << 20; const int pm4_dw = 256; amdgpu_context_handle context_handle; amdgpu_bo_handle bo1, bo2; @@ -1715,8 +1721,6 @@ static void amdgpu_command_submission_copy_linear_helper(unsigned ip_type) &bo1_va_handle); CU_ASSERT_EQUAL(r, 0); - /* set bo1 */ - memset((void*)bo1_cpu, 0xaa, sdma_write_length); /* allocate UC bo2 for sDMA use */ r = amdgpu_bo_alloc_and_map(device_handle, @@ -1727,8 +1731,6 @@ static void amdgpu_command_submission_copy_linear_helper(unsigned ip_type) &bo2_va_handle); CU_ASSERT_EQUAL(r, 0); - /* clear bo2 */ - memset((void*)bo2_cpu, 0, sdma_write_length); resources[0] = bo1; resources[1] = bo2; @@ -1785,17 +1787,11 @@ static void amdgpu_command_submission_copy_linear_helper(unsigned ip_type) } } - amdgpu_test_exec_cs_helper(context_handle, + _amdgpu_test_exec_cs_helper(context_handle, ip_type, ring_id, i, pm4, 2, resources, - ib_info, ibs_request); - - /* verify if SDMA test result meets with expected */ - i = 0; - while(i < sdma_write_length) { - CU_ASSERT_EQUAL(bo2_cpu[i++], 0xaa); - } + ib_info, ibs_request, 0, 100); r = amdgpu_bo_unmap_and_free(bo1, bo1_va_handle, bo1_mc, sdma_write_length); CU_ASSERT_EQUAL(r, 0);