diff mbox series

arm64: Implement prctl(PR_{G,S}ET_TSC)

Message ID 20240427052754.2014024-1-pcc@google.com (mailing list archive)
State New, archived
Headers show
Series arm64: Implement prctl(PR_{G,S}ET_TSC) | expand

Commit Message

Peter Collingbourne April 27, 2024, 5:27 a.m. UTC
On arm64, this prctl controls access to CNTVCT_EL0, CNTVCTSS_EL0 and
CNTFRQ_EL0 via CNTKCTL_EL1.EL0VCTEN. Since this bit is also used to
implement various erratum workarounds, check whether the CPU needs
a workaround whenever we potentially need to change it.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/I233a1867d1ccebe2933a347552e7eae862344421
---
 arch/arm64/include/asm/processor.h   |  5 ++
 arch/arm64/include/asm/thread_info.h |  2 +
 arch/arm64/kernel/process.c          | 70 +++++++++++++++++++++++-----
 arch/arm64/kernel/traps.c            | 20 +++++---
 4 files changed, 79 insertions(+), 18 deletions(-)

Comments

Marc Zyngier April 28, 2024, 10:37 a.m. UTC | #1
On Sat, 27 Apr 2024 06:27:51 +0100,
Peter Collingbourne <pcc@google.com> wrote:
> 
> On arm64, this prctl controls access to CNTVCT_EL0, CNTVCTSS_EL0 and
> CNTFRQ_EL0 via CNTKCTL_EL1.EL0VCTEN. Since this bit is also used to
> implement various erratum workarounds, check whether the CPU needs
> a workaround whenever we potentially need to change it.

Why would we ever consider preventing access to CNTVTL_EL0?  This
register is part of the arm64 ABI, together with CNTFRQ_EL0.
Disabling it also prevents the VDSO from working correctly, making
something as simple as getttimeofday() unexpectedly fail.

I'm sure you have some rationale behind it, but it is nowhere to be
found above, and I cannot see why we'd want to mimic whatever x86
does.

Thanks,

	M.
Peter Collingbourne April 28, 2024, 6:47 p.m. UTC | #2
On Sun, Apr 28, 2024 at 3:37 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Sat, 27 Apr 2024 06:27:51 +0100,
> Peter Collingbourne <pcc@google.com> wrote:
> >
> > On arm64, this prctl controls access to CNTVCT_EL0, CNTVCTSS_EL0 and
> > CNTFRQ_EL0 via CNTKCTL_EL1.EL0VCTEN. Since this bit is also used to
> > implement various erratum workarounds, check whether the CPU needs
> > a workaround whenever we potentially need to change it.
>
> Why would we ever consider preventing access to CNTVTL_EL0?  This
> register is part of the arm64 ABI, together with CNTFRQ_EL0.
> Disabling it also prevents the VDSO from working correctly, making
> something as simple as getttimeofday() unexpectedly fail.
>
> I'm sure you have some rationale behind it, but it is nowhere to be
> found above, and I cannot see why we'd want to mimic whatever x86
> does.

Hi Marc,

This is needed for a correct implementation of non-instrumenting
record-replay debugging on arm64 (i.e. rr; https://rr-project.org/).
rr must trap and record any sources of non-determinism from the
userspace program's perspective so it can be replayed later. This
includes the results of syscalls as well as the results of access to
architected timers exposed directly to the program. rr traps RDTSC on
x86 for the same reason.

I'd be happy to add that to the commit message for v2.

Peter
Marc Zyngier April 29, 2024, 7:34 a.m. UTC | #3
On Sun, 28 Apr 2024 19:47:56 +0100,
Peter Collingbourne <pcc@google.com> wrote:
> 
> On Sun, Apr 28, 2024 at 3:37 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Sat, 27 Apr 2024 06:27:51 +0100,
> > Peter Collingbourne <pcc@google.com> wrote:
> > >
> > > On arm64, this prctl controls access to CNTVCT_EL0, CNTVCTSS_EL0 and
> > > CNTFRQ_EL0 via CNTKCTL_EL1.EL0VCTEN. Since this bit is also used to
> > > implement various erratum workarounds, check whether the CPU needs
> > > a workaround whenever we potentially need to change it.
> >
> > Why would we ever consider preventing access to CNTVTL_EL0?  This
> > register is part of the arm64 ABI, together with CNTFRQ_EL0.
> > Disabling it also prevents the VDSO from working correctly, making
> > something as simple as getttimeofday() unexpectedly fail.
> >
> > I'm sure you have some rationale behind it, but it is nowhere to be
> > found above, and I cannot see why we'd want to mimic whatever x86
> > does.
> 
> Hi Marc,
> 
> This is needed for a correct implementation of non-instrumenting
> record-replay debugging on arm64 (i.e. rr; https://rr-project.org/).
> rr must trap and record any sources of non-determinism from the
> userspace program's perspective so it can be replayed later. This
> includes the results of syscalls as well as the results of access to
> architected timers exposed directly to the program. rr traps RDTSC on
> x86 for the same reason.

It seems to me that this sort of "trap and inspect" behaviour is in
the realm of ptrace(), and not that of prctl(), because I can't
imagine the debugged program calling that by itself.

My rationale for this is that on x86, TSC wasn't always present. It
was only added fairly late in the game and thus userspace couldn't
always rely on it being present. Which is pretty different from arm64,
which has always had CNTVCT_EL0 and co an integral part of the ABI.

Giving a way to limit the current ABI seems at best odd. On the
contrary, ptrace() does what it says on the tin: it intercepts what
the traced process does.

How does rr work on non-x86 architectures? How does this interoperate
with AArch32?

> I'd be happy to add that to the commit message for v2.

Please, in any case. Possibly with a reference to 8fb402bccf203 so
that people can perform some archaeology on the origins of this thing.

Thanks,

	M.
Keno Fischer April 29, 2024, 8:35 a.m. UTC | #4
First a note that this has been previously discussed on this list for
the same motivation and with the same questions asked, so let
me link that first:

Link: https://lore.kernel.org/linux-arm-kernel/CAP045ApiMSvP--f2E0=VdMbjE8oibvy921m8JASf4kaCCuU2RA@mail.gmail.com/T/

> It seems to me that this sort of "trap and inspect" behaviour is in
> the realm of ptrace(), and not that of prctl(), because I can't
> imagine the debugged program calling that by itself.

In the rr use case, this is indeed used to force a ptrace signal stop
for trap/inspect (or emulate in replay mode). Of course the history
of PR_SET_TSC is complicated and was not originally intended for
this use case. Rather, it was intended as a hardening mechanism
against cache-timing/speculative execution attacks (this was
pre-Spectre, so all the discussion at the time is a bit theoretical).
Of course, the Spectre experience has shown that you don't really
actually need architectural timers for any of this, so perhaps it wasn't
all that useful in the first place.

I think this could reasonably be made a PTRACE_EVENT, but that would
not be useable for hardening if somebody wanted to do that in the future
(I am not personally aware of any such request, but it doesn't seem
unreasonable). Of course, my concerns about adding yet another kind
of ptrace stop mentioned last time still exist.

Perhaps a third option would be to put this into seccomp instead. One
could imagine a flag addition to SECCOMP_SET_MODE_FILTER that
would enable filtering for all otherwise emulated instructions (not just
a CNTVCT_EL0 read) and an appropriate bpf program to decide
allow/disallow/trap/signal/whatever. Of course, there are folks already
unhappy with using seccomp for non-security purposes (as rr already
does), so there may be some objection there.

Similar concerns about making ptrace stops more complicated apply
(basically the issue is that what you can do to a tracee at a
ptrace stop very much depends on where exactly in the kernel the
tracee is stopped, which ptrace doesn't always tell you,
so to be fully correct, ptracers need to do some very elaborate
state machine tracking).

> How does rr work on non-x86 architectures? How does this interoperate
> with AArch32?

rr has significant microarchitecture dependence. It works on most modern x86
chips (AMD worse than Intel with the exception of some Intel Atom uarchs)
and more recent high-end AArch64 chips as long as the programs only use
lse atomics and not llsc. AArch32 is not supported. See here for a list:

https://github.com/rr-debugger/rr/blob/822863c92d8f52778700f0375ac1b705193a2152/src/PerfCounters.cc#L198-L237

Thanks,
Keno
Peter Collingbourne April 29, 2024, 6:46 p.m. UTC | #5
On Mon, Apr 29, 2024 at 1:36 AM Keno Fischer <keno@juliahub.com> wrote:
>
> First a note that this has been previously discussed on this list for
> the same motivation and with the same questions asked, so let
> me link that first:
>
> Link: https://lore.kernel.org/linux-arm-kernel/CAP045ApiMSvP--f2E0=VdMbjE8oibvy921m8JASf4kaCCuU2RA@mail.gmail.com/T/

Ah, I wasn't aware that this was previously discussed. I agree with
your points from that thread regarding prctl vs ptrace.

> > It seems to me that this sort of "trap and inspect" behaviour is in
> > the realm of ptrace(), and not that of prctl(), because I can't
> > imagine the debugged program calling that by itself.
>
> In the rr use case, this is indeed used to force a ptrace signal stop
> for trap/inspect (or emulate in replay mode). Of course the history
> of PR_SET_TSC is complicated and was not originally intended for
> this use case. Rather, it was intended as a hardening mechanism
> against cache-timing/speculative execution attacks (this was
> pre-Spectre, so all the discussion at the time is a bit theoretical).
> Of course, the Spectre experience has shown that you don't really
> actually need architectural timers for any of this, so perhaps it wasn't
> all that useful in the first place.

That was what I originally thought as well, but then I went back to
look at the history:
https://lore.kernel.org/all/Pine.GSO.4.56.0804131126080.28138@keg.few.vu.nl/

It looks like the original author was also using this for
deterministic replay. So rr is using this prctl for its original
purpose.

It's also worth noting that the name of the prctl was chosen with the
intent that it could be implemented on other architectures.
https://lore.kernel.org/all/4813B1EF.5090806@zytor.com/

> I think this could reasonably be made a PTRACE_EVENT, but that would
> not be useable for hardening if somebody wanted to do that in the future
> (I am not personally aware of any such request, but it doesn't seem
> unreasonable). Of course, my concerns about adding yet another kind
> of ptrace stop mentioned last time still exist.

Yes, in general an in-process control seems more useful than an
out-of-process control, since anything that you would be able to do
with ptrace could also be done with prctl (tracer can inject a call to
the prctl and handle signal-delivery-stops), and it avoids needing an
additional process (which will complicate debugging of the ptraced
process since it cannot have more than one tracer, and will be
incompatible with ptrace_scope=3) in cases where that is not otherwise
necessary.

Peter

> Perhaps a third option would be to put this into seccomp instead. One
> could imagine a flag addition to SECCOMP_SET_MODE_FILTER that
> would enable filtering for all otherwise emulated instructions (not just
> a CNTVCT_EL0 read) and an appropriate bpf program to decide
> allow/disallow/trap/signal/whatever. Of course, there are folks already
> unhappy with using seccomp for non-security purposes (as rr already
> does), so there may be some objection there.
>
> Similar concerns about making ptrace stops more complicated apply
> (basically the issue is that what you can do to a tracee at a
> ptrace stop very much depends on where exactly in the kernel the
> tracee is stopped, which ptrace doesn't always tell you,
> so to be fully correct, ptracers need to do some very elaborate
> state machine tracking).
>
> > How does rr work on non-x86 architectures? How does this interoperate
> > with AArch32?
>
> rr has significant microarchitecture dependence. It works on most modern x86
> chips (AMD worse than Intel with the exception of some Intel Atom uarchs)
> and more recent high-end AArch64 chips as long as the programs only use
> lse atomics and not llsc. AArch32 is not supported. See here for a list:
>
> https://github.com/rr-debugger/rr/blob/822863c92d8f52778700f0375ac1b705193a2152/src/PerfCounters.cc#L198-L237
>
> Thanks,
> Keno
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index f77371232d8c6..347bd3464fcbe 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -402,5 +402,10 @@  long get_tagged_addr_ctrl(struct task_struct *task);
 #define GET_TAGGED_ADDR_CTRL()		get_tagged_addr_ctrl(current)
 #endif
 
+int get_tsc_mode(unsigned long adr);
+int set_tsc_mode(unsigned int val);
+#define GET_TSC_CTL(adr)        get_tsc_mode((adr))
+#define SET_TSC_CTL(val)        set_tsc_mode((val))
+
 #endif /* __ASSEMBLY__ */
 #endif /* __ASM_PROCESSOR_H */
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index e72a3bf9e5634..1114c1c3300a1 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -81,6 +81,7 @@  void arch_setup_new_exec(void);
 #define TIF_SME			27	/* SME in use */
 #define TIF_SME_VL_INHERIT	28	/* Inherit SME vl_onexec across exec */
 #define TIF_KERNEL_FPSTATE	29	/* Task is in a kernel mode FPSIMD section */
+#define TIF_TSC_SIGSEGV		30	/* SIGSEGV on counter-timer access */
 
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
@@ -97,6 +98,7 @@  void arch_setup_new_exec(void);
 #define _TIF_SVE		(1 << TIF_SVE)
 #define _TIF_MTE_ASYNC_FAULT	(1 << TIF_MTE_ASYNC_FAULT)
 #define _TIF_NOTIFY_SIGNAL	(1 << TIF_NOTIFY_SIGNAL)
+#define _TIF_TSC_SIGSEGV	(1 << TIF_TSC_SIGSEGV)
 
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 4ae31b7af6c31..1a2ae7830c179 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -43,6 +43,7 @@ 
 #include <linux/stacktrace.h>
 
 #include <asm/alternative.h>
+#include <asm/arch_timer.h>
 #include <asm/compat.h>
 #include <asm/cpufeature.h>
 #include <asm/cacheflush.h>
@@ -472,27 +473,49 @@  static void entry_task_switch(struct task_struct *next)
 }
 
 /*
- * ARM erratum 1418040 handling, affecting the 32bit view of CNTVCT.
- * Ensure access is disabled when switching to a 32bit task, ensure
- * access is enabled when switching to a 64bit task.
+ * Handle sysreg updates for ARM erratum 1418040 which affects the 32bit view of
+ * CNTVCT, various other errata which require trapping all CNTVCT{,_EL0}
+ * accesses and prctl(PR_SET_TSC). Ensure access is disabled iff a workaround is
+ * required or PR_TSC_SIGSEGV is set.
  */
-static void erratum_1418040_thread_switch(struct task_struct *next)
+static void update_cntkctl_el1(struct task_struct *next)
 {
-	if (!IS_ENABLED(CONFIG_ARM64_ERRATUM_1418040) ||
-	    !this_cpu_has_cap(ARM64_WORKAROUND_1418040))
-		return;
+	struct thread_info *ti = task_thread_info(next);
 
-	if (is_compat_thread(task_thread_info(next)))
+	if (test_ti_thread_flag(ti, TIF_TSC_SIGSEGV) ||
+	    has_erratum_handler(read_cntvct_el0) ||
+	    (IS_ENABLED(CONFIG_ARM64_ERRATUM_1418040) &&
+	     this_cpu_has_cap(ARM64_WORKAROUND_1418040) &&
+	     is_compat_thread(ti)))
 		sysreg_clear_set(cntkctl_el1, ARCH_TIMER_USR_VCT_ACCESS_EN, 0);
 	else
 		sysreg_clear_set(cntkctl_el1, 0, ARCH_TIMER_USR_VCT_ACCESS_EN);
 }
 
-static void erratum_1418040_new_exec(void)
+static void cntkctl_thread_switch(struct task_struct *prev,
+				  struct task_struct *next)
+{
+	if ((read_ti_thread_flags(task_thread_info(prev)) &
+	     (_TIF_32BIT | _TIF_TSC_SIGSEGV)) !=
+	    (read_ti_thread_flags(task_thread_info(next)) &
+	     (_TIF_32BIT | _TIF_TSC_SIGSEGV)))
+		update_cntkctl_el1(next);
+}
+
+static int do_set_tsc_mode(unsigned int val)
 {
+	if (val == PR_TSC_SIGSEGV)
+		set_thread_flag(TIF_TSC_SIGSEGV);
+	else if (val == PR_TSC_ENABLE)
+		clear_thread_flag(TIF_TSC_SIGSEGV);
+	else
+		return -EINVAL;
+
 	preempt_disable();
-	erratum_1418040_thread_switch(current);
+	update_cntkctl_el1(current);
 	preempt_enable();
+
+	return 0;
 }
 
 /*
@@ -528,7 +551,7 @@  struct task_struct *__switch_to(struct task_struct *prev,
 	contextidr_thread_switch(next);
 	entry_task_switch(next);
 	ssbs_thread_switch(next);
-	erratum_1418040_thread_switch(next);
+	cntkctl_thread_switch(prev, next);
 	ptrauth_thread_switch_user(next);
 
 	/*
@@ -645,7 +668,7 @@  void arch_setup_new_exec(void)
 	current->mm->context.flags = mmflags;
 	ptrauth_thread_init_user();
 	mte_thread_init_user();
-	erratum_1418040_new_exec();
+	do_set_tsc_mode(PR_TSC_ENABLE);
 
 	if (task_spec_ssb_noexec(current)) {
 		arch_prctl_spec_ctrl_set(current, PR_SPEC_STORE_BYPASS,
@@ -754,3 +777,26 @@  int arch_elf_adjust_prot(int prot, const struct arch_elf_state *state,
 	return prot;
 }
 #endif
+
+int get_tsc_mode(unsigned long adr)
+{
+	unsigned int val;
+
+	if (is_compat_task())
+		return -EINVAL;
+
+	if (test_thread_flag(TIF_TSC_SIGSEGV))
+		val = PR_TSC_SIGSEGV;
+	else
+		val = PR_TSC_ENABLE;
+
+	return put_user(val, (unsigned int __user *)adr);
+}
+
+int set_tsc_mode(unsigned int val)
+{
+	if (is_compat_task())
+		return -EINVAL;
+
+	return do_set_tsc_mode(val);
+}
diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index 215e6d7f2df8c..63fe64aafb7ad 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -601,18 +601,26 @@  static void ctr_read_handler(unsigned long esr, struct pt_regs *regs)
 
 static void cntvct_read_handler(unsigned long esr, struct pt_regs *regs)
 {
-	int rt = ESR_ELx_SYS64_ISS_RT(esr);
+	if (test_thread_flag(TIF_TSC_SIGSEGV)) {
+		force_sig(SIGSEGV);
+	} else {
+		int rt = ESR_ELx_SYS64_ISS_RT(esr);
 
-	pt_regs_write_reg(regs, rt, arch_timer_read_counter());
-	arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
+		pt_regs_write_reg(regs, rt, arch_timer_read_counter());
+		arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
+	}
 }
 
 static void cntfrq_read_handler(unsigned long esr, struct pt_regs *regs)
 {
-	int rt = ESR_ELx_SYS64_ISS_RT(esr);
+	if (test_thread_flag(TIF_TSC_SIGSEGV)) {
+		force_sig(SIGSEGV);
+	} else {
+		int rt = ESR_ELx_SYS64_ISS_RT(esr);
 
-	pt_regs_write_reg(regs, rt, arch_timer_get_rate());
-	arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
+		pt_regs_write_reg(regs, rt, arch_timer_get_rate());
+		arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
+	}
 }
 
 static void mrs_handler(unsigned long esr, struct pt_regs *regs)