Message ID | 20240427052754.2014024-1-pcc@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: Implement prctl(PR_{G,S}ET_TSC) | expand |
On Sat, 27 Apr 2024 06:27:51 +0100, Peter Collingbourne <pcc@google.com> wrote: > > On arm64, this prctl controls access to CNTVCT_EL0, CNTVCTSS_EL0 and > CNTFRQ_EL0 via CNTKCTL_EL1.EL0VCTEN. Since this bit is also used to > implement various erratum workarounds, check whether the CPU needs > a workaround whenever we potentially need to change it. Why would we ever consider preventing access to CNTVTL_EL0? This register is part of the arm64 ABI, together with CNTFRQ_EL0. Disabling it also prevents the VDSO from working correctly, making something as simple as getttimeofday() unexpectedly fail. I'm sure you have some rationale behind it, but it is nowhere to be found above, and I cannot see why we'd want to mimic whatever x86 does. Thanks, M.
On Sun, Apr 28, 2024 at 3:37 AM Marc Zyngier <maz@kernel.org> wrote: > > On Sat, 27 Apr 2024 06:27:51 +0100, > Peter Collingbourne <pcc@google.com> wrote: > > > > On arm64, this prctl controls access to CNTVCT_EL0, CNTVCTSS_EL0 and > > CNTFRQ_EL0 via CNTKCTL_EL1.EL0VCTEN. Since this bit is also used to > > implement various erratum workarounds, check whether the CPU needs > > a workaround whenever we potentially need to change it. > > Why would we ever consider preventing access to CNTVTL_EL0? This > register is part of the arm64 ABI, together with CNTFRQ_EL0. > Disabling it also prevents the VDSO from working correctly, making > something as simple as getttimeofday() unexpectedly fail. > > I'm sure you have some rationale behind it, but it is nowhere to be > found above, and I cannot see why we'd want to mimic whatever x86 > does. Hi Marc, This is needed for a correct implementation of non-instrumenting record-replay debugging on arm64 (i.e. rr; https://rr-project.org/). rr must trap and record any sources of non-determinism from the userspace program's perspective so it can be replayed later. This includes the results of syscalls as well as the results of access to architected timers exposed directly to the program. rr traps RDTSC on x86 for the same reason. I'd be happy to add that to the commit message for v2. Peter
On Sun, 28 Apr 2024 19:47:56 +0100, Peter Collingbourne <pcc@google.com> wrote: > > On Sun, Apr 28, 2024 at 3:37 AM Marc Zyngier <maz@kernel.org> wrote: > > > > On Sat, 27 Apr 2024 06:27:51 +0100, > > Peter Collingbourne <pcc@google.com> wrote: > > > > > > On arm64, this prctl controls access to CNTVCT_EL0, CNTVCTSS_EL0 and > > > CNTFRQ_EL0 via CNTKCTL_EL1.EL0VCTEN. Since this bit is also used to > > > implement various erratum workarounds, check whether the CPU needs > > > a workaround whenever we potentially need to change it. > > > > Why would we ever consider preventing access to CNTVTL_EL0? This > > register is part of the arm64 ABI, together with CNTFRQ_EL0. > > Disabling it also prevents the VDSO from working correctly, making > > something as simple as getttimeofday() unexpectedly fail. > > > > I'm sure you have some rationale behind it, but it is nowhere to be > > found above, and I cannot see why we'd want to mimic whatever x86 > > does. > > Hi Marc, > > This is needed for a correct implementation of non-instrumenting > record-replay debugging on arm64 (i.e. rr; https://rr-project.org/). > rr must trap and record any sources of non-determinism from the > userspace program's perspective so it can be replayed later. This > includes the results of syscalls as well as the results of access to > architected timers exposed directly to the program. rr traps RDTSC on > x86 for the same reason. It seems to me that this sort of "trap and inspect" behaviour is in the realm of ptrace(), and not that of prctl(), because I can't imagine the debugged program calling that by itself. My rationale for this is that on x86, TSC wasn't always present. It was only added fairly late in the game and thus userspace couldn't always rely on it being present. Which is pretty different from arm64, which has always had CNTVCT_EL0 and co an integral part of the ABI. Giving a way to limit the current ABI seems at best odd. On the contrary, ptrace() does what it says on the tin: it intercepts what the traced process does. How does rr work on non-x86 architectures? How does this interoperate with AArch32? > I'd be happy to add that to the commit message for v2. Please, in any case. Possibly with a reference to 8fb402bccf203 so that people can perform some archaeology on the origins of this thing. Thanks, M.
First a note that this has been previously discussed on this list for the same motivation and with the same questions asked, so let me link that first: Link: https://lore.kernel.org/linux-arm-kernel/CAP045ApiMSvP--f2E0=VdMbjE8oibvy921m8JASf4kaCCuU2RA@mail.gmail.com/T/ > It seems to me that this sort of "trap and inspect" behaviour is in > the realm of ptrace(), and not that of prctl(), because I can't > imagine the debugged program calling that by itself. In the rr use case, this is indeed used to force a ptrace signal stop for trap/inspect (or emulate in replay mode). Of course the history of PR_SET_TSC is complicated and was not originally intended for this use case. Rather, it was intended as a hardening mechanism against cache-timing/speculative execution attacks (this was pre-Spectre, so all the discussion at the time is a bit theoretical). Of course, the Spectre experience has shown that you don't really actually need architectural timers for any of this, so perhaps it wasn't all that useful in the first place. I think this could reasonably be made a PTRACE_EVENT, but that would not be useable for hardening if somebody wanted to do that in the future (I am not personally aware of any such request, but it doesn't seem unreasonable). Of course, my concerns about adding yet another kind of ptrace stop mentioned last time still exist. Perhaps a third option would be to put this into seccomp instead. One could imagine a flag addition to SECCOMP_SET_MODE_FILTER that would enable filtering for all otherwise emulated instructions (not just a CNTVCT_EL0 read) and an appropriate bpf program to decide allow/disallow/trap/signal/whatever. Of course, there are folks already unhappy with using seccomp for non-security purposes (as rr already does), so there may be some objection there. Similar concerns about making ptrace stops more complicated apply (basically the issue is that what you can do to a tracee at a ptrace stop very much depends on where exactly in the kernel the tracee is stopped, which ptrace doesn't always tell you, so to be fully correct, ptracers need to do some very elaborate state machine tracking). > How does rr work on non-x86 architectures? How does this interoperate > with AArch32? rr has significant microarchitecture dependence. It works on most modern x86 chips (AMD worse than Intel with the exception of some Intel Atom uarchs) and more recent high-end AArch64 chips as long as the programs only use lse atomics and not llsc. AArch32 is not supported. See here for a list: https://github.com/rr-debugger/rr/blob/822863c92d8f52778700f0375ac1b705193a2152/src/PerfCounters.cc#L198-L237 Thanks, Keno
On Mon, Apr 29, 2024 at 1:36 AM Keno Fischer <keno@juliahub.com> wrote: > > First a note that this has been previously discussed on this list for > the same motivation and with the same questions asked, so let > me link that first: > > Link: https://lore.kernel.org/linux-arm-kernel/CAP045ApiMSvP--f2E0=VdMbjE8oibvy921m8JASf4kaCCuU2RA@mail.gmail.com/T/ Ah, I wasn't aware that this was previously discussed. I agree with your points from that thread regarding prctl vs ptrace. > > It seems to me that this sort of "trap and inspect" behaviour is in > > the realm of ptrace(), and not that of prctl(), because I can't > > imagine the debugged program calling that by itself. > > In the rr use case, this is indeed used to force a ptrace signal stop > for trap/inspect (or emulate in replay mode). Of course the history > of PR_SET_TSC is complicated and was not originally intended for > this use case. Rather, it was intended as a hardening mechanism > against cache-timing/speculative execution attacks (this was > pre-Spectre, so all the discussion at the time is a bit theoretical). > Of course, the Spectre experience has shown that you don't really > actually need architectural timers for any of this, so perhaps it wasn't > all that useful in the first place. That was what I originally thought as well, but then I went back to look at the history: https://lore.kernel.org/all/Pine.GSO.4.56.0804131126080.28138@keg.few.vu.nl/ It looks like the original author was also using this for deterministic replay. So rr is using this prctl for its original purpose. It's also worth noting that the name of the prctl was chosen with the intent that it could be implemented on other architectures. https://lore.kernel.org/all/4813B1EF.5090806@zytor.com/ > I think this could reasonably be made a PTRACE_EVENT, but that would > not be useable for hardening if somebody wanted to do that in the future > (I am not personally aware of any such request, but it doesn't seem > unreasonable). Of course, my concerns about adding yet another kind > of ptrace stop mentioned last time still exist. Yes, in general an in-process control seems more useful than an out-of-process control, since anything that you would be able to do with ptrace could also be done with prctl (tracer can inject a call to the prctl and handle signal-delivery-stops), and it avoids needing an additional process (which will complicate debugging of the ptraced process since it cannot have more than one tracer, and will be incompatible with ptrace_scope=3) in cases where that is not otherwise necessary. Peter > Perhaps a third option would be to put this into seccomp instead. One > could imagine a flag addition to SECCOMP_SET_MODE_FILTER that > would enable filtering for all otherwise emulated instructions (not just > a CNTVCT_EL0 read) and an appropriate bpf program to decide > allow/disallow/trap/signal/whatever. Of course, there are folks already > unhappy with using seccomp for non-security purposes (as rr already > does), so there may be some objection there. > > Similar concerns about making ptrace stops more complicated apply > (basically the issue is that what you can do to a tracee at a > ptrace stop very much depends on where exactly in the kernel the > tracee is stopped, which ptrace doesn't always tell you, > so to be fully correct, ptracers need to do some very elaborate > state machine tracking). > > > How does rr work on non-x86 architectures? How does this interoperate > > with AArch32? > > rr has significant microarchitecture dependence. It works on most modern x86 > chips (AMD worse than Intel with the exception of some Intel Atom uarchs) > and more recent high-end AArch64 chips as long as the programs only use > lse atomics and not llsc. AArch32 is not supported. See here for a list: > > https://github.com/rr-debugger/rr/blob/822863c92d8f52778700f0375ac1b705193a2152/src/PerfCounters.cc#L198-L237 > > Thanks, > Keno
diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h index f77371232d8c6..347bd3464fcbe 100644 --- a/arch/arm64/include/asm/processor.h +++ b/arch/arm64/include/asm/processor.h @@ -402,5 +402,10 @@ long get_tagged_addr_ctrl(struct task_struct *task); #define GET_TAGGED_ADDR_CTRL() get_tagged_addr_ctrl(current) #endif +int get_tsc_mode(unsigned long adr); +int set_tsc_mode(unsigned int val); +#define GET_TSC_CTL(adr) get_tsc_mode((adr)) +#define SET_TSC_CTL(val) set_tsc_mode((val)) + #endif /* __ASSEMBLY__ */ #endif /* __ASM_PROCESSOR_H */ diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index e72a3bf9e5634..1114c1c3300a1 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -81,6 +81,7 @@ void arch_setup_new_exec(void); #define TIF_SME 27 /* SME in use */ #define TIF_SME_VL_INHERIT 28 /* Inherit SME vl_onexec across exec */ #define TIF_KERNEL_FPSTATE 29 /* Task is in a kernel mode FPSIMD section */ +#define TIF_TSC_SIGSEGV 30 /* SIGSEGV on counter-timer access */ #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) @@ -97,6 +98,7 @@ void arch_setup_new_exec(void); #define _TIF_SVE (1 << TIF_SVE) #define _TIF_MTE_ASYNC_FAULT (1 << TIF_MTE_ASYNC_FAULT) #define _TIF_NOTIFY_SIGNAL (1 << TIF_NOTIFY_SIGNAL) +#define _TIF_TSC_SIGSEGV (1 << TIF_TSC_SIGSEGV) #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \ diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c index 4ae31b7af6c31..1a2ae7830c179 100644 --- a/arch/arm64/kernel/process.c +++ b/arch/arm64/kernel/process.c @@ -43,6 +43,7 @@ #include <linux/stacktrace.h> #include <asm/alternative.h> +#include <asm/arch_timer.h> #include <asm/compat.h> #include <asm/cpufeature.h> #include <asm/cacheflush.h> @@ -472,27 +473,49 @@ static void entry_task_switch(struct task_struct *next) } /* - * ARM erratum 1418040 handling, affecting the 32bit view of CNTVCT. - * Ensure access is disabled when switching to a 32bit task, ensure - * access is enabled when switching to a 64bit task. + * Handle sysreg updates for ARM erratum 1418040 which affects the 32bit view of + * CNTVCT, various other errata which require trapping all CNTVCT{,_EL0} + * accesses and prctl(PR_SET_TSC). Ensure access is disabled iff a workaround is + * required or PR_TSC_SIGSEGV is set. */ -static void erratum_1418040_thread_switch(struct task_struct *next) +static void update_cntkctl_el1(struct task_struct *next) { - if (!IS_ENABLED(CONFIG_ARM64_ERRATUM_1418040) || - !this_cpu_has_cap(ARM64_WORKAROUND_1418040)) - return; + struct thread_info *ti = task_thread_info(next); - if (is_compat_thread(task_thread_info(next))) + if (test_ti_thread_flag(ti, TIF_TSC_SIGSEGV) || + has_erratum_handler(read_cntvct_el0) || + (IS_ENABLED(CONFIG_ARM64_ERRATUM_1418040) && + this_cpu_has_cap(ARM64_WORKAROUND_1418040) && + is_compat_thread(ti))) sysreg_clear_set(cntkctl_el1, ARCH_TIMER_USR_VCT_ACCESS_EN, 0); else sysreg_clear_set(cntkctl_el1, 0, ARCH_TIMER_USR_VCT_ACCESS_EN); } -static void erratum_1418040_new_exec(void) +static void cntkctl_thread_switch(struct task_struct *prev, + struct task_struct *next) +{ + if ((read_ti_thread_flags(task_thread_info(prev)) & + (_TIF_32BIT | _TIF_TSC_SIGSEGV)) != + (read_ti_thread_flags(task_thread_info(next)) & + (_TIF_32BIT | _TIF_TSC_SIGSEGV))) + update_cntkctl_el1(next); +} + +static int do_set_tsc_mode(unsigned int val) { + if (val == PR_TSC_SIGSEGV) + set_thread_flag(TIF_TSC_SIGSEGV); + else if (val == PR_TSC_ENABLE) + clear_thread_flag(TIF_TSC_SIGSEGV); + else + return -EINVAL; + preempt_disable(); - erratum_1418040_thread_switch(current); + update_cntkctl_el1(current); preempt_enable(); + + return 0; } /* @@ -528,7 +551,7 @@ struct task_struct *__switch_to(struct task_struct *prev, contextidr_thread_switch(next); entry_task_switch(next); ssbs_thread_switch(next); - erratum_1418040_thread_switch(next); + cntkctl_thread_switch(prev, next); ptrauth_thread_switch_user(next); /* @@ -645,7 +668,7 @@ void arch_setup_new_exec(void) current->mm->context.flags = mmflags; ptrauth_thread_init_user(); mte_thread_init_user(); - erratum_1418040_new_exec(); + do_set_tsc_mode(PR_TSC_ENABLE); if (task_spec_ssb_noexec(current)) { arch_prctl_spec_ctrl_set(current, PR_SPEC_STORE_BYPASS, @@ -754,3 +777,26 @@ int arch_elf_adjust_prot(int prot, const struct arch_elf_state *state, return prot; } #endif + +int get_tsc_mode(unsigned long adr) +{ + unsigned int val; + + if (is_compat_task()) + return -EINVAL; + + if (test_thread_flag(TIF_TSC_SIGSEGV)) + val = PR_TSC_SIGSEGV; + else + val = PR_TSC_ENABLE; + + return put_user(val, (unsigned int __user *)adr); +} + +int set_tsc_mode(unsigned int val) +{ + if (is_compat_task()) + return -EINVAL; + + return do_set_tsc_mode(val); +} diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c index 215e6d7f2df8c..63fe64aafb7ad 100644 --- a/arch/arm64/kernel/traps.c +++ b/arch/arm64/kernel/traps.c @@ -601,18 +601,26 @@ static void ctr_read_handler(unsigned long esr, struct pt_regs *regs) static void cntvct_read_handler(unsigned long esr, struct pt_regs *regs) { - int rt = ESR_ELx_SYS64_ISS_RT(esr); + if (test_thread_flag(TIF_TSC_SIGSEGV)) { + force_sig(SIGSEGV); + } else { + int rt = ESR_ELx_SYS64_ISS_RT(esr); - pt_regs_write_reg(regs, rt, arch_timer_read_counter()); - arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE); + pt_regs_write_reg(regs, rt, arch_timer_read_counter()); + arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE); + } } static void cntfrq_read_handler(unsigned long esr, struct pt_regs *regs) { - int rt = ESR_ELx_SYS64_ISS_RT(esr); + if (test_thread_flag(TIF_TSC_SIGSEGV)) { + force_sig(SIGSEGV); + } else { + int rt = ESR_ELx_SYS64_ISS_RT(esr); - pt_regs_write_reg(regs, rt, arch_timer_get_rate()); - arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE); + pt_regs_write_reg(regs, rt, arch_timer_get_rate()); + arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE); + } } static void mrs_handler(unsigned long esr, struct pt_regs *regs)
On arm64, this prctl controls access to CNTVCT_EL0, CNTVCTSS_EL0 and CNTFRQ_EL0 via CNTKCTL_EL1.EL0VCTEN. Since this bit is also used to implement various erratum workarounds, check whether the CPU needs a workaround whenever we potentially need to change it. Signed-off-by: Peter Collingbourne <pcc@google.com> Link: https://linux-review.googlesource.com/id/I233a1867d1ccebe2933a347552e7eae862344421 --- arch/arm64/include/asm/processor.h | 5 ++ arch/arm64/include/asm/thread_info.h | 2 + arch/arm64/kernel/process.c | 70 +++++++++++++++++++++++----- arch/arm64/kernel/traps.c | 20 +++++--- 4 files changed, 79 insertions(+), 18 deletions(-)