diff mbox series

[RFC] arm64: cpufeatures: Add support for tlbi range maintenance

Message ID 1572417685-32955-1-git-send-email-zhangshaokun@hisilicon.com (mailing list archive)
State New, archived
Headers show
Series [RFC] arm64: cpufeatures: Add support for tlbi range maintenance | expand

Commit Message

Shaokun Zhang Oct. 30, 2019, 6:41 a.m. UTC
From: Tangnianyao <tangnianyao@huawei.com>

ARMv8.4-TLBI provides TLBI invalidation instruction that apply to a
range of input addresses. This patch adds support for this feature.
And provide another implementation for flush_tlb_range with tlbi
range instruction.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org> 
Cc: Marc Zyngier <maz@kernel.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Tangnianyao <tangnianyao@huawei.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
---
For ARMv8.4-TLBI TLB range instructions, it needs binutils which
shall be 2.30 or later version.

 arch/arm64/include/asm/cpucaps.h  |  3 +-
 arch/arm64/include/asm/sysreg.h   |  5 +++
 arch/arm64/include/asm/tlbflush.h | 75 ++++++++++++++++++++++++++++++++++++++-
 arch/arm64/kernel/cpufeature.c    | 10 ++++++
 4 files changed, 91 insertions(+), 2 deletions(-)

Comments

Catalin Marinas Oct. 30, 2019, 5:34 p.m. UTC | #1
On Wed, Oct 30, 2019 at 02:41:25PM +0800, Shaokun Zhang wrote:
> ARMv8.4-TLBI provides TLBI invalidation instruction that apply to a
> range of input addresses. This patch adds support for this feature.
> And provide another implementation for flush_tlb_range with tlbi
> range instruction.

Do you have any performance numbers in favour of this patch? Last time
we looked, it didn't really matter for Linux since most user TLBI ranges
were 1 or 2 pages long. Of course you can write some mprotect() loop to
show that it matters but I'm rather interested in real world impact.
Shaokun Zhang Oct. 31, 2019, 7:38 a.m. UTC | #2
Hi Catalin,

On 2019/10/31 1:34, Catalin Marinas wrote:
> On Wed, Oct 30, 2019 at 02:41:25PM +0800, Shaokun Zhang wrote:
>> ARMv8.4-TLBI provides TLBI invalidation instruction that apply to a
>> range of input addresses. This patch adds support for this feature.
>> And provide another implementation for flush_tlb_range with tlbi
>> range instruction.
> 
> Do you have any performance numbers in favour of this patch? Last time

There is no performance data and we also want to see the profit from this
feature, but the kernel doesn't support it now. I sent it as a RFC firstly.

> we looked, it didn't really matter for Linux since most user TLBI ranges
> were 1 or 2 pages long. Of course you can write some mprotect() loop to

Thanks your guidance, I almost don't pay attention on this. I will do more
learning on it. Since ARM ARM supports this, I think it shall be some useful
for performance. ;-)

> show that it matters but I'm rather interested in real world impact.

Okay, I will run some server works to check it whether continuous TLBI range
will be issued, like JAVA application: Hadoop etc.

Thanks,
Shaokun

>
Will Deacon Oct. 31, 2019, 1:16 p.m. UTC | #3
On Wed, Oct 30, 2019 at 05:34:00PM +0000, Catalin Marinas wrote:
> On Wed, Oct 30, 2019 at 02:41:25PM +0800, Shaokun Zhang wrote:
> > ARMv8.4-TLBI provides TLBI invalidation instruction that apply to a
> > range of input addresses. This patch adds support for this feature.
> > And provide another implementation for flush_tlb_range with tlbi
> > range instruction.
> 
> Do you have any performance numbers in favour of this patch? Last time
> we looked, it didn't really matter for Linux since most user TLBI ranges
> were 1 or 2 pages long. Of course you can write some mprotect() loop to
> show that it matters but I'm rather interested in real world impact.

The other major concern I have with this patch relates to the feature in
general: my understanding is that probing for the presence of the
instruction at the CPU level tells you nothing about whether or not it's
support by the interconnect/DVM implementation.

There's this ominous and badly-worded note in the Arm ARM:

  | The set of masters containing TLBs that can be affected by the TLB range
  | maintenance instructions are defined by the system architecture. This means
  | that all masters in a system might not contain TLBs within the defined
  | shareability domains.

which I think makes this practically useless without something like a
firmware-based discoverability mechanism.

Will
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/cpucaps.h b/arch/arm64/include/asm/cpucaps.h
index ac1dbca3d0cd..2f88725263d2 100644
--- a/arch/arm64/include/asm/cpucaps.h
+++ b/arch/arm64/include/asm/cpucaps.h
@@ -54,7 +54,8 @@ 
 #define ARM64_WORKAROUND_1463225		44
 #define ARM64_WORKAROUND_CAVIUM_TX2_219_TVM	45
 #define ARM64_WORKAROUND_CAVIUM_TX2_219_PRFM	46
+#define ARM64_HAS_TLBI_EXT			47
 
-#define ARM64_NCAPS				47
+#define ARM64_NCAPS				48
 
 #endif /* __ASM_CPUCAPS_H */
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 6e919fafb43d..cfb7551ea37d 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -539,6 +539,7 @@ 
 			 ENDIAN_SET_EL1 | SCTLR_EL1_UCI  | SCTLR_EL1_RES1)
 
 /* id_aa64isar0 */
+#define ID_AA64ISAR0_TLB_SHIFT		56
 #define ID_AA64ISAR0_TS_SHIFT		52
 #define ID_AA64ISAR0_FHM_SHIFT		48
 #define ID_AA64ISAR0_DP_SHIFT		44
@@ -552,6 +553,10 @@ 
 #define ID_AA64ISAR0_SHA1_SHIFT		8
 #define ID_AA64ISAR0_AES_SHIFT		4
 
+#define ID_AA64ISAR0_TLB_NI		0
+#define ID_AA64ISAR0_TLB_OS		1
+#define ID_AA64ISAR0_TLB_OS_RANGE	2
+
 /* id_aa64isar1 */
 #define ID_AA64ISAR1_SB_SHIFT		36
 #define ID_AA64ISAR1_FRINTTS_SHIFT	32
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index bc3949064725..c4ece64aa500 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -59,6 +59,27 @@ 
 		__ta;						\
 	})
 
+/* This macro creates a properly formatted VA operand for the TLBI extension*/
+#define __TLBI_VADDR_EXT(addr, asid, tg, scale, num, ttl)	\
+	({							\
+		unsigned long __ta = (addr) >> 12;		\
+		__ta &= GENMASK_ULL(43, 0);			\
+		__ta |= (unsigned long)(asid) << 48;		\
+		__ta |= (unsigned long)(tg) << 46;		\
+		__ta |= (unsigned long)(scale) << 44;		\
+		__ta |= (unsigned long)(num) << 39;		\
+		__ta |= (unsigned long)(ttl) << 37;		\
+		__ta;						\
+	})
+
+#ifdef CONFIG_ARM64_64K_PAGES
+#define TLBI_TG_FLAGS	UL(1)
+#elif defined(CONFIG_ARM64_16K_PAGES)
+#define TLBI_TG_FLAGS	UL(2)
+#else /* CONFIG_ARM64_4K_PAGES */
+#define TLBI_TG_FLAGS	UL(0)
+#endif
+
 /*
  *	TLB Invalidation
  *	================
@@ -211,6 +232,55 @@  static inline void __flush_tlb_range(struct vm_area_struct *vma,
 	dsb(ish);
 }
 
+static inline void __flush_tlb_range_ext(struct vm_area_struct *vma,
+				     unsigned long start, unsigned long end,
+				     unsigned long stride, bool last_level)
+{
+	unsigned long asid = ASID(vma->vm_mm);
+	unsigned long addr;
+	unsigned long scale;
+	unsigned long num;
+	unsigned long range;
+	unsigned long unit_shift;
+
+	start = round_down(start, stride);
+	end = round_up(end, stride);
+
+	if ((end - start) >= (MAX_TLBI_OPS * stride)) {
+		flush_tlb_mm(vma->vm_mm);
+		return;
+	}
+
+	/* Convert the stride into units of 4k */
+	stride >>= 12;
+
+	start = __TLBI_VADDR(start, asid);
+	end = __TLBI_VADDR(end, asid);
+
+	range = ((end - start) >> (PAGE_SHIFT + 5));
+	scale = 0;
+	while (range) {
+		range = (range >> 5);
+		scale++;
+	}
+
+	unit_shift = PAGE_SHIFT + 5*scale + 1;
+	num = DIV_ROUND_UP(end-start, BIT(unit_shift)) - 1;
+	addr = start;
+	__TLBI_VADDR_EXT(addr, asid, TLBI_TG_FLAGS,
+			scale, num, 0);
+
+	dsb(ishst);
+	if (last_level) {
+		__tlbi(rvale1is, addr);
+		__tlbi_user(rvale1is, addr);
+	} else {
+		__tlbi(rvae1is, addr);
+		__tlbi_user(rvae1is, addr);
+	}
+	dsb(ish);
+}
+
 static inline void flush_tlb_range(struct vm_area_struct *vma,
 				   unsigned long start, unsigned long end)
 {
@@ -218,7 +288,10 @@  static inline void flush_tlb_range(struct vm_area_struct *vma,
 	 * We cannot use leaf-only invalidation here, since we may be invalidating
 	 * table entries as part of collapsing hugepages or moving page tables.
 	 */
-	__flush_tlb_range(vma, start, end, PAGE_SIZE, false);
+	if (!preemptible() && this_cpu_has_cap(ARM64_HAS_TLBI_EXT))
+		__flush_tlb_range_ext(vma, start, end, PAGE_SIZE, false);
+	else
+		__flush_tlb_range(vma, start, end, PAGE_SIZE, false);
 }
 
 static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end)
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 80f459ad0190..a0aaab3f7fd5 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1566,6 +1566,16 @@  static const struct arm64_cpu_capabilities arm64_features[] = {
 		.min_field_value = 1,
 	},
 #endif
+	{
+		.desc = "TLB maintenance and TLB range instructions",
+		.capability = ARM64_HAS_TLBI_EXT,
+		.type = ARM64_CPUCAP_SYSTEM_FEATURE,
+		.matches = has_cpuid_feature,
+		.sys_reg = SYS_ID_AA64ISAR0_EL1,
+		.field_pos = ID_AA64ISAR0_TLB_SHIFT,
+		.sign = FTR_UNSIGNED,
+		.min_field_value = ID_AA64ISAR0_TLB_NI,
+	},
 	{},
 };