From patchwork Tue May 7 06:19:24 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yan Zhao X-Patchwork-Id: 13656305 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.20]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C10CB6F09C; Tue, 7 May 2024 06:20:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.20 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715062818; cv=none; b=r05XGI+aLwsZ4KtOVHgSqBqjH7ckd2u2a3x6kELqVLC2MH7gfSH+Lz3IRX3WkVEX5Q+MRl2JKYrZ94oFRmgkNUn8wugw7kXnM5bPN45en/RfMnh1woUlvgeBnk/GJhzCcHcA8uQGpdsd6PAS9d3QZrmwgg90kK/0CSbCHGVkaWo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715062818; c=relaxed/simple; bh=mSaRASfGQY8AR8D1WEG7ksj0ddXO2afJXERqBKb8+K0=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References; b=tIwV9px0nP5RWYQ/O7mBmAJ85xZBOASCgpOJzvv35fITi+OEX7Z+bpjlSjVtT5+9lOEQklUsXbClwdUWtRAzL930P87SlpkOeKow0/fYUCnX4MqiqxPtRJrBLw/EAE+zLan+MVxnfws6hSnn1jdyHj2r/MEfnNZLjMmwZQZdq7A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Z1SjWqrr; arc=none smtp.client-ip=198.175.65.20 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Z1SjWqrr" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1715062817; x=1746598817; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=mSaRASfGQY8AR8D1WEG7ksj0ddXO2afJXERqBKb8+K0=; b=Z1SjWqrr1/J7A/7jDUahemLkC4I9IRYgakWRj3bLTmumGcgOPXxyqUEm ey8swK5yIDJhExjxFY69xIJgBl4cijsvn1qz99g++9CJ/R4MXDuDQP9rX wjWAN6Qfla4D0j3l/C48ZwuPBX3mf91D2VPekONMDZ2MY24GVK5OCJnp7 vlgKRjUScRtCZ5FYYwyxOgR0zT/btVG0oIunu2PSUZkCULgGOEmGadcid ARBcJPhfiZ6575q/yBEUqJaiWHgyz2SkttoRg7AoSz1PEyybnbsrPFG3L 7vneJrELgAyB0BTwtkWMjZBzQJ/GJQ+Gp8G6swqcTVv5oL9SRZsgukV4H A==; X-CSE-ConnectionGUID: Gaxhuf2xRfiUsFeVVZXZsQ== X-CSE-MsgGUID: ascVcqElTJqtOkhkLuI+mw== X-IronPort-AV: E=McAfee;i="6600,9927,11065"; a="10672818" X-IronPort-AV: E=Sophos;i="6.07,260,1708416000"; d="scan'208";a="10672818" Received: from orviesa003.jf.intel.com ([10.64.159.143]) by orvoesa112.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 May 2024 23:20:17 -0700 X-CSE-ConnectionGUID: mo9MOEQOR7OgE5r5/VzEQg== X-CSE-MsgGUID: 03sqDDM8Sde46PMnyPlv4w== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.07,260,1708416000"; d="scan'208";a="33081907" Received: from yzhao56-desk.sh.intel.com ([10.239.159.62]) by ORVIESA003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 May 2024 23:20:11 -0700 From: Yan Zhao To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, x86@kernel.org, alex.williamson@redhat.com, jgg@nvidia.com, kevin.tian@intel.com Cc: iommu@lists.linux.dev, pbonzini@redhat.com, seanjc@google.com, dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, hpa@zytor.com, corbet@lwn.net, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, baolu.lu@linux.intel.com, yi.l.liu@intel.com, Yan Zhao Subject: [PATCH 1/5] x86/pat: Let pat_pfn_immune_to_uc_mtrr() check MTRR for untracked PAT range Date: Tue, 7 May 2024 14:19:24 +0800 Message-Id: <20240507061924.20251-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20240507061802.20184-1-yan.y.zhao@intel.com> References: <20240507061802.20184-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Let pat_pfn_immune_to_uc_mtrr() check MTRR type for PFNs in untracked PAT range. pat_pfn_immune_to_uc_mtrr() is used by KVM to distinguish MMIO PFNs and give them UC memory type in the EPT page tables. When pat_pfn_immune_to_uc_mtrr() identifies a PFN as having a PAT type of UC/WC/UC-, it indicates that the PFN should be accessed using an uncacheable memory type. Consequently, KVM maps it with UC in the EPT to ensure that the guest's memory access is uncacheable. Internally, pat_pfn_immune_to_uc_mtrr() utilizes lookup_memtype() to determine PAT type for a PFN. For a PFN outside untracked PAT range, the returned PAT type is either - The type set by memtype_reserve() (which, in turn, calls pat_x_mtrr_type() to adjust the requested type to UC- if the requested type is WB but the MTRR type does not match WB), - Or UC-, if memtype_reserve() has not yet been invoked for this PFN. However, lookup_memtype() defaults to returning WB for PFNs within the untracked PAT range, regardless of their actual MTRR type. This behavior could lead KVM to misclassify the PFN as non-MMIO, permitting cacheable guest access. Such access might result in MCE on certain platforms, (e.g. clflush on VGA range (0xA0000-0xBFFFF) triggers MCE on some platforms). Hence, invoke pat_x_mtrr_type() for PFNs within the untracked PAT range so as to take MTRR type into account to mitigate potential MCEs. Fixes: b8d7044bcff7 ("x86/mm: add a function to check if a pfn is UC/UC-/WC") Cc: Kevin Tian Signed-off-by: Yan Zhao --- arch/x86/mm/pat/memtype.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c index 36b603d0cdde..e85e8c5737ad 100644 --- a/arch/x86/mm/pat/memtype.c +++ b/arch/x86/mm/pat/memtype.c @@ -705,7 +705,17 @@ static enum page_cache_mode lookup_memtype(u64 paddr) */ bool pat_pfn_immune_to_uc_mtrr(unsigned long pfn) { - enum page_cache_mode cm = lookup_memtype(PFN_PHYS(pfn)); + u64 paddr = PFN_PHYS(pfn); + enum page_cache_mode cm; + + /* + * Check MTRR type for untracked pat range since lookup_memtype() always + * returns WB for this range. + */ + if (x86_platform.is_untracked_pat_range(paddr, paddr + PAGE_SIZE)) + cm = pat_x_mtrr_type(paddr, paddr + PAGE_SIZE, _PAGE_CACHE_MODE_WB); + else + cm = lookup_memtype(paddr); return cm == _PAGE_CACHE_MODE_UC || cm == _PAGE_CACHE_MODE_UC_MINUS || From patchwork Tue May 7 06:20:09 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yan Zhao X-Patchwork-Id: 13656306 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D7D296D1C8; Tue, 7 May 2024 06:20:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715062854; cv=none; b=XpZgqXqhg1LRIyVJii5q00v/xrFs41FTa/g7m/VDgka7SDCkFZ3UPx+JF3XPS6h7wM/dKVaRoPlyhE9UvgPtStE/vHfw6opnpwmr+rwnrWh4xY75JCpRjgOUFH6gGSniM/Vz87Q4i19Kclrc2VYqnMbz0AE850x7mDZ1te99fl8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715062854; c=relaxed/simple; bh=8eu/qd0HcIEbXA6xsMR5u7EcC2Nxb+4PTTFdq2/lk6A=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References; b=oYggI4ShDAQkWHR2rTFOKeHHRRXBzMUUZSI5fMjGpApctOgAk/9twIdB/56RV+MilhyioY4SD/7HhrelpqvvZFGO5bvH3gZSIANxE9uxb6WegfBZaxEW/2T2LPMQTLxrVL4+u7ZHF6dlJ4a557Slp6rwvQA9tmfXlOOBKfI8cfo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=hjhQvOzA; arc=none smtp.client-ip=198.175.65.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="hjhQvOzA" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1715062851; x=1746598851; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=8eu/qd0HcIEbXA6xsMR5u7EcC2Nxb+4PTTFdq2/lk6A=; b=hjhQvOzA3r5o+tiyaj2hNao1zpGCm8Eol3+TEfmS5XLHFAmL65MN6sWP LracWAdcV93fEScw7JjSYbp/ZgizcyiG6nh0tLPR0oYl3IHAaiDWoVC6S ZGmtIkR+C7AeSO/y0k8TVmYDgt2ux47EIDtV+wqDOae8u4OaouYYvJ4YA QEVvz819v3xbkxXTV3k8WbvRhetVQq/Co3fsfC8i221sRZ6lSljbugv09 eosREatjpvHW+ZFHYrZlcL1RLqjsitqNvMYPqbRaI4dmRl18f71beGrz0 pUZMc08RQ9GAKu4KypZdrYMA+vVJ6QTeh+Yk+jBrcooYOuEtQylsSlv/N w==; X-CSE-ConnectionGUID: KKa3DMqCT0qcf72sucr+cg== X-CSE-MsgGUID: SlIoQ+5ERaqXsxJ90rfU6g== X-IronPort-AV: E=McAfee;i="6600,9927,11065"; a="10991991" X-IronPort-AV: E=Sophos;i="6.07,260,1708416000"; d="scan'208";a="10991991" Received: from orviesa010.jf.intel.com ([10.64.159.150]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 May 2024 23:20:51 -0700 X-CSE-ConnectionGUID: +8utow1dTNavv5QvzC3s3Q== X-CSE-MsgGUID: fDhGMvmsQ4+ufRX3B4RVDg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.07,260,1708416000"; d="scan'208";a="28296498" Received: from yzhao56-desk.sh.intel.com ([10.239.159.62]) by orviesa010-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 May 2024 23:20:46 -0700 From: Yan Zhao To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, x86@kernel.org, alex.williamson@redhat.com, jgg@nvidia.com, kevin.tian@intel.com Cc: iommu@lists.linux.dev, pbonzini@redhat.com, seanjc@google.com, dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, hpa@zytor.com, corbet@lwn.net, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, baolu.lu@linux.intel.com, yi.l.liu@intel.com, Yan Zhao Subject: [PATCH 2/5] KVM: x86/mmu: Fine-grained check of whether a invalid & RAM PFN is MMIO Date: Tue, 7 May 2024 14:20:09 +0800 Message-Id: <20240507062009.20336-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20240507061802.20184-1-yan.y.zhao@intel.com> References: <20240507061802.20184-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Fine-grained check to decide whether a PFN, which is !pfn_valid() and identified within the raw e820 table as RAM, should be treated as MMIO by KVM in order to prevent guest cachable access. Previously, a PFN that is !pfn_valid() and identified within the raw e820 table as RAM was not considered as MMIO. This is for the scenerio when "mem=" was passed to the kernel, resulting in certain valid pages lacking an associated struct page. See commit 0c55671f84ff ("kvm, x86: Properly check whether a pfn is an MMIO or not"). However, that approach is only based on guest performance perspective and may cause cacheable access to potential MMIO PFNs if pat_pfn_immune_to_uc_mtrr() identifies the PFN as having a PAT type of UC/WC/UC-. Therefore, do a fine-graned check for PAT in primary MMU so that KVM would map the PFN as UC in EPT to prevent cachable access from guest. For the rare case when PAT is not enabled, default the PFN to MMIO to avoid further checking MTRR (since functions for MTRR related checking are not exported now). Cc: Kevin Tian Signed-off-by: Yan Zhao --- arch/x86/kvm/mmu/spte.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c index 4a599130e9c9..5db0fb7b74f5 100644 --- a/arch/x86/kvm/mmu/spte.c +++ b/arch/x86/kvm/mmu/spte.c @@ -101,9 +101,21 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn) */ (!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn)); + /* + * If the PFN is invalid and not RAM in raw e820 table, keep treating it + * as MMIO. + * + * If the PFN is invalid and is RAM in raw e820 table, + * - if PAT is not enabled, always treat the PFN as MMIO to avoid futher + * checking of MTRRs. + * - if PAT is enabled, treat the PFN as MMIO if its PAT is UC/WC/UC- in + * primary MMU. + * to prevent guest cacheable access to MMIO PFNs. + */ return !e820__mapped_raw_any(pfn_to_hpa(pfn), pfn_to_hpa(pfn + 1) - 1, - E820_TYPE_RAM); + E820_TYPE_RAM) ? true : + (!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn)); } /* From patchwork Tue May 7 06:20:44 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yan Zhao X-Patchwork-Id: 13656307 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DAF106F065; Tue, 7 May 2024 06:21:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.21 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715062890; cv=none; b=hL/wdo372+Mk/R1g+ecJdsrfSRcKH1R8WjgObFYcspD1ixJqyb1CeCYEvsEX+SdTUxj9vghPZJ2gCrYVO7vxY6fZKdgj79bYrbz4dZBMnWkqy2D47sq9tIHaUkAOEkGsYGmpdbqswn99r06WkUAjQlkpo0LVvvpMyRIZJ0O5Hh0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715062890; c=relaxed/simple; bh=eqXBY8IV3KlWfbVSlPRhINiga95k0gcj6FAROJ4ib1c=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References; b=PHRBooXP4zCIvlaFMRoF5s3uMKRDgYH9V6mG95obfI3CcT44k9PxxNWHTlsvJoDKkuU8pcECNRtXB8EW2isfEgvwJ23Ky7eM8FWNFjWbueYe0Ax7riE+J+bembFLwreN7Vow5KuEP6qtbMWMryAyTFTjQG7+RcW0J1amTQoWN2M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=NfEaoMON; arc=none smtp.client-ip=198.175.65.21 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="NfEaoMON" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1715062887; x=1746598887; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=eqXBY8IV3KlWfbVSlPRhINiga95k0gcj6FAROJ4ib1c=; b=NfEaoMONorl2SLX/A2n++n1pZp93LhiYbwiKzA8lU+Z0wPKy6UOgQ5nL EpYGPrMH8dUqqpt5k5MMqbtlAhKjcsm7Un75/tc2x4WlgSSDDz/1Db4gU HZMH5Y/8vNxrDnpqEwn3N7O7oULrp+z7Ggpimw0Yb9H0rMyhMdHVC+T43 pfX8lmLfjQa4hoEe4b4Y80KMtxs/0pxnKB4oA4p1NRuw7UsrkuyzlmS1F UivKR5pEJKCZys5nyezzE9eYVwKyVzMzYVkvd/wjTD2H5His8IWbz/KRZ zWEPisxuBdUe5rZdIreIC4w40/EnckyYGTk08XWR4lIzpVy/HpkumS2d6 w==; X-CSE-ConnectionGUID: vf5zfy9MSu6ptg0LHGtdJg== X-CSE-MsgGUID: SFaJdRBLT5ychgDAjdXpBQ== X-IronPort-AV: E=McAfee;i="6600,9927,11065"; a="10766338" X-IronPort-AV: E=Sophos;i="6.07,260,1708416000"; d="scan'208";a="10766338" Received: from orviesa002.jf.intel.com ([10.64.159.142]) by orvoesa113.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 May 2024 23:21:27 -0700 X-CSE-ConnectionGUID: emaow2PKROCtfMXVpYkOpg== X-CSE-MsgGUID: kqG27CMdQ2iU743KBOFTBw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.07,260,1708416000"; d="scan'208";a="59261388" Received: from yzhao56-desk.sh.intel.com ([10.239.159.62]) by orviesa002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 May 2024 23:21:21 -0700 From: Yan Zhao To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, x86@kernel.org, alex.williamson@redhat.com, jgg@nvidia.com, kevin.tian@intel.com Cc: iommu@lists.linux.dev, pbonzini@redhat.com, seanjc@google.com, dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, hpa@zytor.com, corbet@lwn.net, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, baolu.lu@linux.intel.com, yi.l.liu@intel.com, Yan Zhao Subject: [PATCH 3/5] x86/mm: Introduce and export interface arch_clean_nonsnoop_dma() Date: Tue, 7 May 2024 14:20:44 +0800 Message-Id: <20240507062044.20399-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20240507061802.20184-1-yan.y.zhao@intel.com> References: <20240507061802.20184-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Introduce and export interface arch_clean_nonsnoop_dma() to flush CPU caches for memory involved in non-coherent DMAs (DMAs that lack CPU cache snooping). When IOMMU does not enforce cache coherency, devices are allowed to perform non-coherent DMAs. This scenario poses a risk of information leakage when the device is assigned into a VM. Specifically, a malicious guest could potentially retrieve stale host data through non-coherent DMA reads to physical memory, with data initialized by host (e.g., zeros) still residing in the cache. Additionally, host kernel (e.g. by a ksm kthread) is possible to read inconsistent data from CPU cache/memory (left by a malicious guest) after a page is unpinned for non-coherent DMA but before it's freed. Therefore, VFIO/IOMMUFD must initiate a CPU cache flush for pages involved in non-coherent DMAs prior to or following their mapping or unmapping to or from the IOMMU. Introduce and export an interface accepting a contiguous physical address range as input to help flush CPU caches in architecture specific way for VFIO/IOMMUFD. (Currently, x86 only). Given CLFLUSH on MMIOs in x86 is generally undesired and sometimes will cause MCE on certain platforms (e.g. executing CLFLUSH on VGA ranges 0xA0000-0xBFFFF causes MCE on some platforms). Meanwhile, some MMIOs are cacheable and demands CLFLUSH (e.g. certain MMIOs for PMEM). Hence, a method of checking host PAT/MTRR for uncacheable memory is adopted. This implementation always performs CLFLUSH on "pfn_valid() && !reserved" pages (since they are not possible to be MMIOs). For the reserved or !pfn_valid() cases, check host PAT/MTRR to bypass uncacheable physical ranges in host and do CFLUSH on the rest cacheable ranges. Cc: Alex Williamson Cc: Jason Gunthorpe Cc: Kevin Tian Suggested-by: Jason Gunthorpe Signed-off-by: Yan Zhao --- arch/x86/include/asm/cacheflush.h | 3 ++ arch/x86/mm/pat/set_memory.c | 88 +++++++++++++++++++++++++++++++ include/linux/cacheflush.h | 6 +++ 3 files changed, 97 insertions(+) diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h index b192d917a6d0..b63607994285 100644 --- a/arch/x86/include/asm/cacheflush.h +++ b/arch/x86/include/asm/cacheflush.h @@ -10,4 +10,7 @@ void clflush_cache_range(void *addr, unsigned int size); +void arch_clean_nonsnoop_dma(phys_addr_t phys, size_t length); +#define arch_clean_nonsnoop_dma arch_clean_nonsnoop_dma + #endif /* _ASM_X86_CACHEFLUSH_H */ diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 80c9037ffadf..7ff08ad20369 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -34,6 +34,7 @@ #include #include #include +#include #include "../mm_internal.h" @@ -349,6 +350,93 @@ void arch_invalidate_pmem(void *addr, size_t size) EXPORT_SYMBOL_GPL(arch_invalidate_pmem); #endif +/* + * Flush pfn_valid() and !PageReserved() page + */ +static void clflush_page(struct page *page) +{ + const int size = boot_cpu_data.x86_clflush_size; + unsigned int i; + void *va; + + va = kmap_local_page(page); + + /* CLFLUSHOPT is unordered and requires full memory barrier */ + mb(); + for (i = 0; i < PAGE_SIZE; i += size) + clflushopt(va + i); + /* CLFLUSHOPT is unordered and requires full memory barrier */ + mb(); + + kunmap_local(va); +} + +/* + * Flush a reserved page or !pfn_valid() PFN. + * Flush is not performed if the PFN is accessed in uncacheable type. i.e. + * - PAT type is UC/UC-/WC when PAT is enabled + * - MTRR type is UC/WC/WT/WP when PAT is not enabled. + * (no need to do CLFLUSH though WT/WP is cacheable). + */ +static void clflush_reserved_or_invalid_pfn(unsigned long pfn) +{ + const int size = boot_cpu_data.x86_clflush_size; + unsigned int i; + void *va; + + if (!pat_enabled()) { + u64 start = PFN_PHYS(pfn), end = start + PAGE_SIZE; + u8 mtrr_type, uniform; + + mtrr_type = mtrr_type_lookup(start, end, &uniform); + if (mtrr_type != MTRR_TYPE_WRBACK) + return; + } else if (pat_pfn_immune_to_uc_mtrr(pfn)) { + return; + } + + va = memremap(pfn << PAGE_SHIFT, PAGE_SIZE, MEMREMAP_WB); + if (!va) + return; + + /* CLFLUSHOPT is unordered and requires full memory barrier */ + mb(); + for (i = 0; i < PAGE_SIZE; i += size) + clflushopt(va + i); + /* CLFLUSHOPT is unordered and requires full memory barrier */ + mb(); + + memunmap(va); +} + +static inline void clflush_pfn(unsigned long pfn) +{ + if (pfn_valid(pfn) && + (!PageReserved(pfn_to_page(pfn)) || is_zero_pfn(pfn))) + return clflush_page(pfn_to_page(pfn)); + + clflush_reserved_or_invalid_pfn(pfn); +} + +/** + * arch_clean_nonsnoop_dma - flush a cache range for non-coherent DMAs + * (DMAs that lack CPU cache snooping). + * @phys_addr: physical address start + * @length: number of bytes to flush + */ +void arch_clean_nonsnoop_dma(phys_addr_t phys_addr, size_t length) +{ + unsigned long nrpages, pfn; + unsigned long i; + + pfn = PHYS_PFN(phys_addr); + nrpages = PAGE_ALIGN((phys_addr & ~PAGE_MASK) + length) >> PAGE_SHIFT; + + for (i = 0; i < nrpages; i++, pfn++) + clflush_pfn(pfn); +} +EXPORT_SYMBOL_GPL(arch_clean_nonsnoop_dma); + #ifdef CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION bool cpu_cache_has_invalidate_memregion(void) { diff --git a/include/linux/cacheflush.h b/include/linux/cacheflush.h index 55f297b2c23f..0bfc6551c6d3 100644 --- a/include/linux/cacheflush.h +++ b/include/linux/cacheflush.h @@ -26,4 +26,10 @@ static inline void flush_icache_pages(struct vm_area_struct *vma, #define flush_icache_page(vma, page) flush_icache_pages(vma, page, 1) +#ifndef arch_clean_nonsnoop_dma +static inline void arch_clean_nonsnoop_dma(phys_addr_t phys, size_t length) +{ +} +#endif + #endif /* _LINUX_CACHEFLUSH_H */ From patchwork Tue May 7 06:21:38 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yan Zhao X-Patchwork-Id: 13656308 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 234506F07E; Tue, 7 May 2024 06:22:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.16 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715062941; cv=none; b=S8A2w4A3Q7OTwdzVTbq8YJdGNZJppxftIAsQzElVf3XqIAxzncraF+aZCjpu0ib8pav6hQMisE2y2KsuKN7fTRKgXmljN2VhjzIKy2+WFgnGHbkR7HkBmJbznop9ClNun0SEgWGLlIVCx0UD/9xJ4m4LDmBeZzGHGjal2w4Hong= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715062941; c=relaxed/simple; bh=VNRR4fx4YmPYaR7+Cgk91GS1CDvBbw1zKYy24fWdEs4=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References; b=IopaWhFGc5zEN/iCBPwDmorReUmCTNI6MEZJK1TL8JcSArimzbmwe/MgclX6Pdhm+3U8Sku2LIcqJd63vUCOrw/Wpt793Ye80I31yjHQL31SQXACgw2fZWPKnE4h65aiE5K1sIS0GHyy+hJ1qaRKEzvuUMliPWroSI3RGDUSjfw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=RXjWKWOC; arc=none smtp.client-ip=192.198.163.16 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="RXjWKWOC" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1715062940; x=1746598940; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=VNRR4fx4YmPYaR7+Cgk91GS1CDvBbw1zKYy24fWdEs4=; b=RXjWKWOCnkNVGO02M8loVB3fzcKnd+I6ncMSMRNHqrfV8tAVO9PVJ2gT dBx4PiSe2NwtKbVEqVfePuxJg0PxPf/0u872x3PwIAYQ7l77pBgjjy8Hm ThOrHgnpoDHwcid4FOfcQG754pw9mwIYTvp64FLOvEaNX9g916CfMLz3f 0s9S45iyXM1XFT44Ne02vGwfDO1yxMX4/dwUj617mzVAA4Y9z6KA4I1Pl nzO47soLevEBo5xpIqLXUOfI32VfbXnafj+ig15RsKbd6n+5riMX0SWa4 XVNsjeQDxJTKMxlDy55mus2qYDQhH4sq9hAlxW8mFKHLHFx4nez0u3WI8 A==; X-CSE-ConnectionGUID: 16a3ubdyRHO9QYTA++w5Vg== X-CSE-MsgGUID: CSAaFozYRTGl0eTMWsgSeA== X-IronPort-AV: E=McAfee;i="6600,9927,11065"; a="11376138" X-IronPort-AV: E=Sophos;i="6.07,260,1708416000"; d="scan'208";a="11376138" Received: from orviesa007.jf.intel.com ([10.64.159.147]) by fmvoesa110.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 May 2024 23:22:19 -0700 X-CSE-ConnectionGUID: vZ4DaZevSs2y/SHiQir9dA== X-CSE-MsgGUID: 3vFs5M/fS6yR8m5yKpRQbA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.07,260,1708416000"; d="scan'208";a="28930445" Received: from yzhao56-desk.sh.intel.com ([10.239.159.62]) by orviesa007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 May 2024 23:22:14 -0700 From: Yan Zhao To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, x86@kernel.org, alex.williamson@redhat.com, jgg@nvidia.com, kevin.tian@intel.com Cc: iommu@lists.linux.dev, pbonzini@redhat.com, seanjc@google.com, dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, hpa@zytor.com, corbet@lwn.net, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, baolu.lu@linux.intel.com, yi.l.liu@intel.com, Yan Zhao Subject: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains Date: Tue, 7 May 2024 14:21:38 +0800 Message-Id: <20240507062138.20465-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20240507061802.20184-1-yan.y.zhao@intel.com> References: <20240507061802.20184-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Flush CPU cache on DMA pages before mapping them into the first non-coherent domain (domain that does not enforce cache coherency, i.e. CPU caches are not force-snooped) and after unmapping them from the last domain. Devices attached to non-coherent domains can execute non-coherent DMAs (DMAs that lack CPU cache snooping) to access physical memory with CPU caches bypassed. Such a scenario could be exploited by a malicious guest, allowing them to access stale host data in memory rather than the data initialized by the host (e.g., zeros) in the cache, thus posing a risk of information leakage attack. Furthermore, the host kernel (e.g. a ksm thread) might encounter inconsistent data between the CPU cache and memory (left by a malicious guest) after a page is unpinned for DMA but before it's recycled. Therefore, it is required to flush the CPU cache before a page is accessible to non-coherent DMAs and after the page is inaccessible to non-coherent DMAs. However, the CPU cache is not flushed immediately when the page is unmapped from the last non-coherent domain. Instead, the flushing is performed lazily, right before the page is unpinned. Take the following example to illustrate the process. The CPU cache is flushed right before step 2 and step 5. 1. A page is mapped into a coherent domain. 2. The page is mapped into a non-coherent domain. 3. The page is unmapped from the non-coherent domain e.g.due to hot-unplug. 4. The page is unmapped from the coherent domain. 5. The page is unpinned. Reasons for adopting this lazily flushing design include: - There're several unmap paths and only one unpin path. Lazily flush before unpin wipes out the inconsistency between cache and physical memory before a page is globally visible and produces code that is simpler, more maintainable and easier to backport. - Avoid dividing a large unmap range into several smaller ones or allocating additional memory to hold IOVA to HPA relationship. Reported-by: Jason Gunthorpe Closes: https://lore.kernel.org/lkml/20240109002220.GA439767@nvidia.com Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation") Cc: Alex Williamson Cc: Jason Gunthorpe Cc: Kevin Tian Signed-off-by: Yan Zhao --- drivers/vfio/vfio_iommu_type1.c | 51 +++++++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index b5c15fe8f9fc..ce873f4220bf 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -74,6 +74,7 @@ struct vfio_iommu { bool v2; bool nesting; bool dirty_page_tracking; + bool has_noncoherent_domain; struct list_head emulated_iommu_groups; }; @@ -99,6 +100,7 @@ struct vfio_dma { unsigned long *bitmap; struct mm_struct *mm; size_t locked_vm; + bool cache_flush_required; /* For noncoherent domain */ }; struct vfio_batch { @@ -716,6 +718,9 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova, long unlocked = 0, locked = 0; long i; + if (dma->cache_flush_required) + arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, npage << PAGE_SHIFT); + for (i = 0; i < npage; i++, iova += PAGE_SIZE) { if (put_pfn(pfn++, dma->prot)) { unlocked++; @@ -1099,6 +1104,8 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma, &iotlb_gather); } + dma->cache_flush_required = false; + if (do_accounting) { vfio_lock_acct(dma, -unlocked, true); return 0; @@ -1120,6 +1127,21 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) iommu->dma_avail++; } +static void vfio_update_noncoherent_domain_state(struct vfio_iommu *iommu) +{ + struct vfio_domain *domain; + bool has_noncoherent = false; + + list_for_each_entry(domain, &iommu->domain_list, next) { + if (domain->enforce_cache_coherency) + continue; + + has_noncoherent = true; + break; + } + iommu->has_noncoherent_domain = has_noncoherent; +} + static void vfio_update_pgsize_bitmap(struct vfio_iommu *iommu) { struct vfio_domain *domain; @@ -1455,6 +1477,12 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, vfio_batch_init(&batch); + /* + * Record necessity to flush CPU cache to make sure CPU cache is flushed + * for both pin & map and unmap & unpin (for unwind) paths. + */ + dma->cache_flush_required = iommu->has_noncoherent_domain; + while (size) { /* Pin a contiguous chunk of memory */ npage = vfio_pin_pages_remote(dma, vaddr + dma->size, @@ -1466,6 +1494,10 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, break; } + if (dma->cache_flush_required) + arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, + npage << PAGE_SHIFT); + /* Map it! */ ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, dma->prot); @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, for (; n; n = rb_next(n)) { struct vfio_dma *dma; dma_addr_t iova; + bool cache_flush_required; dma = rb_entry(n, struct vfio_dma, node); iova = dma->iova; + cache_flush_required = !domain->enforce_cache_coherency && + !dma->cache_flush_required; + if (cache_flush_required) + dma->cache_flush_required = true; while (iova < dma->iova + dma->size) { phys_addr_t phys; @@ -1737,6 +1774,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, size = npage << PAGE_SHIFT; } + if (cache_flush_required) + arch_clean_nonsnoop_dma(phys, size); + ret = iommu_map(domain->domain, iova, phys, size, dma->prot | IOMMU_CACHE, GFP_KERNEL_ACCOUNT); @@ -1801,6 +1841,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, vfio_unpin_pages_remote(dma, iova, phys >> PAGE_SHIFT, size >> PAGE_SHIFT, true); } + dma->cache_flush_required = false; } vfio_batch_fini(&batch); @@ -1828,6 +1869,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head * if (!pages) return; + if (!domain->enforce_cache_coherency) + arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2); + list_for_each_entry(region, regions, list) { start = ALIGN(region->start, PAGE_SIZE * 2); if (start >= region->end || (region->end - start < PAGE_SIZE * 2)) @@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head * break; } + if (!domain->enforce_cache_coherency) + arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2); + __free_pages(pages, order); } @@ -2308,6 +2355,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, list_add(&domain->next, &iommu->domain_list); vfio_update_pgsize_bitmap(iommu); + if (!domain->enforce_cache_coherency) + vfio_update_noncoherent_domain_state(iommu); done: /* Delete the old one and insert new iova list */ vfio_iommu_iova_insert_copy(iommu, &iova_copy); @@ -2508,6 +2557,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data, } iommu_domain_free(domain->domain); list_del(&domain->next); + if (!domain->enforce_cache_coherency) + vfio_update_noncoherent_domain_state(iommu); kfree(domain); vfio_iommu_aper_expand(iommu, &iova_copy); vfio_update_pgsize_bitmap(iommu); From patchwork Tue May 7 06:22:12 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yan Zhao X-Patchwork-Id: 13656309 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 26C1A71B3A; Tue, 7 May 2024 06:22:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715062978; cv=none; b=Na7OvryLTmv8At474OQ33rwWQ5p21DeRgyb37XDv0hmL0Mj22LLLChrwScGuvYf8bWkM3W8tNOOBv0AWLeIaAty1+4PsSffHPmqOE7QEw7bsz72QineASjFUqwPrUuCr+1rjDOoxXzYturSOzhjx+ab3tOONaUPwc2Hr09Z9NpY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715062978; c=relaxed/simple; bh=TL0UZXMfyUTYYFYNxhfZM0gx/hJkCl1fIxqe8cLgHB0=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References; b=jgiGH4OWFkV3ZTrhKMyTK6VX+IMqGQY+Dj/7Z4MTCv/S62HGXK8PU/rXXsX+r1fezrj4rkJ0FF+mWwYrXf6uYiOXA/Uoir2twSwzmFYIg7uebSQQskYq5F5S8wWUIaRuPARYBNwsVqFbdJi7pDStvogebgZtJNQKCLtspUv+gPQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=DdKxo+O3; arc=none smtp.client-ip=192.198.163.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="DdKxo+O3" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1715062974; x=1746598974; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=TL0UZXMfyUTYYFYNxhfZM0gx/hJkCl1fIxqe8cLgHB0=; b=DdKxo+O36SyUhE0W7MSwC43ZvL++v0DmAj3qS3B5SVMsRMMhuBFM9RHI 2Q0jtXvBTl+e7JShS+350r2gA0s5SmbOTzURCaWdv6nrcW4GBBRKLPPLx T50IskMtycQ0gbbexB+tbiZo0M5nHfnUVu4/tnHgca+0Uf1NmLrpymI0O EvGkcr3A3neo1C4O8TGHyK6kxuB0A44Rcl4SfGE3KgsmVD709oxt5oEhh 4ZSfU9Bhp6JrMzd0Wob1q7XWCs/1LiyaYaBXNoCeVRopJZz/xRMkRmfeP x8ay+r8wNAtJuVm3v1qMT+pDDwhQqoGamoYjlrZnIqs7htDMEXRQcSfqh A==; X-CSE-ConnectionGUID: OUec294VRwuXogdB6eGP/Q== X-CSE-MsgGUID: ULSFlFI1SjKZkYTKhK4Jgw== X-IronPort-AV: E=McAfee;i="6600,9927,11065"; a="10720354" X-IronPort-AV: E=Sophos;i="6.07,260,1708416000"; d="scan'208";a="10720354" Received: from orviesa001.jf.intel.com ([10.64.159.141]) by fmvoesa111.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 May 2024 23:22:53 -0700 X-CSE-ConnectionGUID: /8n9UBt7RNevDb1XsRFpXw== X-CSE-MsgGUID: XGQwIr8gRyGZnywFHvC53w== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.07,260,1708416000"; d="scan'208";a="65857923" Received: from yzhao56-desk.sh.intel.com ([10.239.159.62]) by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 May 2024 23:22:48 -0700 From: Yan Zhao To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, x86@kernel.org, alex.williamson@redhat.com, jgg@nvidia.com, kevin.tian@intel.com Cc: iommu@lists.linux.dev, pbonzini@redhat.com, seanjc@google.com, dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, hpa@zytor.com, corbet@lwn.net, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, baolu.lu@linux.intel.com, yi.l.liu@intel.com, Yan Zhao Subject: [PATCH 5/5] iommufd: Flush CPU caches on DMA pages in non-coherent domains Date: Tue, 7 May 2024 14:22:12 +0800 Message-Id: <20240507062212.20535-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20240507061802.20184-1-yan.y.zhao@intel.com> References: <20240507061802.20184-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Flush CPU cache on DMA pages before mapping them into the first non-coherent domain (domain that does not enforce cache coherency, i.e. CPU caches are not force-snooped) and after unmapping them from the last domain. Devices attached to non-coherent domains can execute non-coherent DMAs (DMAs that lack CPU cache snooping) to access physical memory with CPU caches bypassed. Such a scenario could be exploited by a malicious guest, allowing them to access stale host data in memory rather than the data initialized by the host (e.g., zeros) in the cache, thus posing a risk of information leakage attack. Furthermore, the host kernel (e.g. a ksm thread) might encounter inconsistent data between the CPU cache and memory (left by a malicious guest) after a page is unpinned for DMA but before it's recycled. Therefore, it is required to flush the CPU cache before a page is accessible to non-coherent DMAs and after the page is inaccessible to non-coherent DMAs. However, the CPU cache is not flushed immediately when the page is unmapped from the last non-coherent domain. Instead, the flushing is performed lazily, right before the page is unpinned. Take the following example to illustrate the process. The CPU cache is flushed right before step 2 and step 5. 1. A page is mapped into a coherent domain. 2. The page is mapped into a non-coherent domain. 3. The page is unmapped from the non-coherent domain e.g.due to hot-unplug. 4. The page is unmapped from the coherent domain. 5. The page is unpinned. Reasons for adopting this lazily flushing design include: - There're several unmap paths and only one unpin path. Lazily flush before unpin wipes out the inconsistency between cache and physical memory before a page is globally visible and produces code that is simpler, more maintainable and easier to backport. - Avoid dividing a large unmap range into several smaller ones or allocating additional memory to hold IOVA to HPA relationship. Unlike "has_noncoherent_domain" flag used in vfio_iommu, the "noncoherent_domain_cnt" counter is implemented in io_pagetable to track whether an iopt has non-coherent domains attached. Such a difference is because in iommufd only hwpt of type paging contains flag "enforce_cache_coherency" and iommu domains in io_pagetable has no flag "enforce_cache_coherency" as that in vfio_domain. A counter in io_pagetable can avoid traversing ioas->hwpt_list and holding ioas->mutex. Reported-by: Jason Gunthorpe Closes: https://lore.kernel.org/lkml/20240109002220.GA439767@nvidia.com Fixes: e8d57210035b ("iommufd: Add kAPI toward external drivers for physical devices") Cc: Alex Williamson Cc: Jason Gunthorpe Cc: Kevin Tian Signed-off-by: Yan Zhao --- drivers/iommu/iommufd/hw_pagetable.c | 19 +++++++++-- drivers/iommu/iommufd/io_pagetable.h | 5 +++ drivers/iommu/iommufd/iommufd_private.h | 1 + drivers/iommu/iommufd/pages.c | 44 +++++++++++++++++++++++-- 4 files changed, 65 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index 33d142f8057d..e3099d732c5c 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -14,12 +14,18 @@ void iommufd_hwpt_paging_destroy(struct iommufd_object *obj) container_of(obj, struct iommufd_hwpt_paging, common.obj); if (!list_empty(&hwpt_paging->hwpt_item)) { + struct io_pagetable *iopt = &hwpt_paging->ioas->iopt; mutex_lock(&hwpt_paging->ioas->mutex); list_del(&hwpt_paging->hwpt_item); mutex_unlock(&hwpt_paging->ioas->mutex); - iopt_table_remove_domain(&hwpt_paging->ioas->iopt, - hwpt_paging->common.domain); + iopt_table_remove_domain(iopt, hwpt_paging->common.domain); + + if (!hwpt_paging->enforce_cache_coherency) { + down_write(&iopt->domains_rwsem); + iopt->noncoherent_domain_cnt--; + up_write(&iopt->domains_rwsem); + } } if (hwpt_paging->common.domain) @@ -176,6 +182,12 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, goto out_abort; } + if (!hwpt_paging->enforce_cache_coherency) { + down_write(&ioas->iopt.domains_rwsem); + ioas->iopt.noncoherent_domain_cnt++; + up_write(&ioas->iopt.domains_rwsem); + } + rc = iopt_table_add_domain(&ioas->iopt, hwpt->domain); if (rc) goto out_detach; @@ -183,6 +195,9 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, return hwpt_paging; out_detach: + down_write(&ioas->iopt.domains_rwsem); + ioas->iopt.noncoherent_domain_cnt--; + up_write(&ioas->iopt.domains_rwsem); if (immediate_attach) iommufd_hw_pagetable_detach(idev); out_abort: diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h index 0ec3509b7e33..557da8fb83d9 100644 --- a/drivers/iommu/iommufd/io_pagetable.h +++ b/drivers/iommu/iommufd/io_pagetable.h @@ -198,6 +198,11 @@ struct iopt_pages { void __user *uptr; bool writable:1; u8 account_mode; + /* + * CPU cache flush is required before mapping the pages to or after + * unmapping it from a noncoherent domain + */ + bool cache_flush_required:1; struct xarray pinned_pfns; /* Of iopt_pages_access::node */ diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 991f864d1f9b..fc77fd43b232 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -53,6 +53,7 @@ struct io_pagetable { struct rb_root_cached reserved_itree; u8 disable_large_pages; unsigned long iova_alignment; + unsigned int noncoherent_domain_cnt; }; void iopt_init_table(struct io_pagetable *iopt); diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c index 528f356238b3..8f4b939cba5b 100644 --- a/drivers/iommu/iommufd/pages.c +++ b/drivers/iommu/iommufd/pages.c @@ -272,6 +272,17 @@ struct pfn_batch { unsigned int total_pfns; }; +static void iopt_cache_flush_pfn_batch(struct pfn_batch *batch) +{ + unsigned long cur, i; + + for (cur = 0; cur < batch->end; cur++) { + for (i = 0; i < batch->npfns[cur]; i++) + arch_clean_nonsnoop_dma(PFN_PHYS(batch->pfns[cur] + i), + PAGE_SIZE); + } +} + static void batch_clear(struct pfn_batch *batch) { batch->total_pfns = 0; @@ -637,10 +648,18 @@ static void batch_unpin(struct pfn_batch *batch, struct iopt_pages *pages, while (npages) { size_t to_unpin = min_t(size_t, npages, batch->npfns[cur] - first_page_off); + unsigned long pfn = batch->pfns[cur] + first_page_off; + + /* + * Lazily flushing CPU caches when a page is about to be + * unpinned if the page was mapped into a noncoherent domain + */ + if (pages->cache_flush_required) + arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, + to_unpin << PAGE_SHIFT); unpin_user_page_range_dirty_lock( - pfn_to_page(batch->pfns[cur] + first_page_off), - to_unpin, pages->writable); + pfn_to_page(pfn), to_unpin, pages->writable); iopt_pages_sub_npinned(pages, to_unpin); cur++; first_page_off = 0; @@ -1358,10 +1377,17 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain) { unsigned long done_end_index; struct pfn_reader pfns; + bool cache_flush_required; int rc; lockdep_assert_held(&area->pages->mutex); + cache_flush_required = area->iopt->noncoherent_domain_cnt && + !area->pages->cache_flush_required; + + if (cache_flush_required) + area->pages->cache_flush_required = true; + rc = pfn_reader_first(&pfns, area->pages, iopt_area_index(area), iopt_area_last_index(area)); if (rc) @@ -1369,6 +1395,9 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain) while (!pfn_reader_done(&pfns)) { done_end_index = pfns.batch_start_index; + if (cache_flush_required) + iopt_cache_flush_pfn_batch(&pfns.batch); + rc = batch_to_domain(&pfns.batch, domain, area, pfns.batch_start_index); if (rc) @@ -1413,6 +1442,7 @@ int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages) unsigned long unmap_index; struct pfn_reader pfns; unsigned long index; + bool cache_flush_required; int rc; lockdep_assert_held(&area->iopt->domains_rwsem); @@ -1426,9 +1456,19 @@ int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages) if (rc) goto out_unlock; + cache_flush_required = area->iopt->noncoherent_domain_cnt && + !pages->cache_flush_required; + + if (cache_flush_required) + pages->cache_flush_required = true; + while (!pfn_reader_done(&pfns)) { done_first_end_index = pfns.batch_end_index; done_all_end_index = pfns.batch_start_index; + + if (cache_flush_required) + iopt_cache_flush_pfn_batch(&pfns.batch); + xa_for_each(&area->iopt->domains, index, domain) { rc = batch_to_domain(&pfns.batch, domain, area, pfns.batch_start_index);