Message ID | 20240221020258.1210148-1-jeremy.linton@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] arm64: syscall: Direct PRNG kstack randomization | expand |
On Tue, Feb 20, 2024 at 08:02:58PM -0600, Jeremy Linton wrote: > The existing arm64 stack randomization uses the kernel rng to acquire > 5 bits of address space randomization. This is problematic because it > creates non determinism in the syscall path when the rng needs to be > generated or reseeded. This shows up as large tail latencies in some > benchmarks and directly affects the minimum RT latencies as seen by > cyclictest. Some questions: - for benchmarks, why not disable kstack randomization? - if the existing pRNG reseeding is a problem here, why isn't it a problem in the many other places it's used? - I though the pRNG already did out-of-line reseeding? > Other architectures are using timers/cycle counters for this function, > which is sketchy from a randomization perspective because it should be > possible to estimate this value from knowledge of the syscall return > time, and from reading the current value of the timer/counters. The expectation is that it would be, at best, unstable. > So, a poor rng should be better than the cycle counter if it is hard > to extract the stack offsets sufficiently to be able to detect the > PRNG's period. > > So, we can potentially choose a 'better' or larger PRNG, going as far > as using one of the CSPRNGs already in the kernel, but the overhead > increases appropriately. Further, there are a few options for > reseeding, possibly out of the syscall path, but is it even useful in > this case? I'd love to find a way to avoid an pRNG that could be reconstructed given enough samples. (But perhaps this xorshift RNG resists that?) -Kees > Reported-by: James Yang <james.yang@arm.com> > Reported-by: Shiyou Huang <shiyou.huang@arm.com> > Signed-off-by: Jeremy Linton <jeremy.linton@arm.com> > --- > arch/arm64/kernel/syscall.c | 55 ++++++++++++++++++++++++++++++++++++- > 1 file changed, 54 insertions(+), 1 deletion(-) > > diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c > index 9a70d9746b66..70143cb8c7be 100644 > --- a/arch/arm64/kernel/syscall.c > +++ b/arch/arm64/kernel/syscall.c > @@ -37,6 +37,59 @@ static long __invoke_syscall(struct pt_regs *regs, syscall_fn_t syscall_fn) > return syscall_fn(regs); > } > > +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET > +DEFINE_PER_CPU(u32, kstackrng); > +static u32 xorshift32(u32 state) > +{ > + /* > + * From top of page 4 of Marsaglia, "Xorshift RNGs" > + * This algorithm is intended to have a period 2^32 -1 > + * And should not be used anywhere else outside of this > + * code path. > + */ > + state ^= state << 13; > + state ^= state >> 17; > + state ^= state << 5; > + return state; > +} > + > +static u16 kstack_rng(void) > +{ > + u32 rng = raw_cpu_read(kstackrng); > + > + rng = xorshift32(rng); > + raw_cpu_write(kstackrng, rng); > + return rng & 0x1ff; > +} > + > +/* Should we reseed? */ > +static int kstack_rng_setup(unsigned int cpu) > +{ > + u32 rng_seed; > + > + do { > + rng_seed = get_random_u32(); > + } while (!rng_seed); > + raw_cpu_write(kstackrng, rng_seed); > + return 0; > +} > + > +static int kstack_init(void) > +{ > + int ret; > + > + ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "arm64/cpuinfo:kstackrandomize", > + kstack_rng_setup, NULL); > + if (ret < 0) > + pr_err("kstack: failed to register rng callbacks.\n"); > + return 0; > +} > + > +arch_initcall(kstack_init); > +#else > +static u16 kstack_rng(void) { return 0; } > +#endif /* CONFIG_RANDOMIZE_KSTACK_OFFSET */ > + > static void invoke_syscall(struct pt_regs *regs, unsigned int scno, > unsigned int sc_nr, > const syscall_fn_t syscall_table[]) > @@ -66,7 +119,7 @@ static void invoke_syscall(struct pt_regs *regs, unsigned int scno, > * > * The resulting 5 bits of entropy is seen in SP[8:4]. > */ > - choose_random_kstack_offset(get_random_u16() & 0x1FF); > + choose_random_kstack_offset(kstack_rng()); > } > > static inline bool has_syscall_work(unsigned long flags) > -- > 2.43.0 >
On Wed, Feb 21, 2024, at 03:02, Jeremy Linton wrote: > The existing arm64 stack randomization uses the kernel rng to acquire > 5 bits of address space randomization. This is problematic because it > creates non determinism in the syscall path when the rng needs to be > generated or reseeded. This shows up as large tail latencies in some > benchmarks and directly affects the minimum RT latencies as seen by > cyclictest. Hi Jeremy, I think from your description it's clear that reseeding the rng is a problem for predictable RT latencies, but at the same time we have too many things going on to fix this by special-casing kstack randomization on one architecture: - if reseeding latency is a problem, can we be sure that none of the other ~500 files containing a call to get_random_{bytes,long,u8,u16,u32,u64} are in an equally critical path for RT? Maybe those are just harder to hit? - CONFIG_RANDOMIZE_KSTACK_OFFSET can already be disabled at compile or at at boot time to avoid the overhead entirely, which may be the right thing to do for users that care more deeply about syscall latencies than the fairly weak stack randomization. Most architectures don't implement it at all. - It looks like the unpredictable latency from reseeding started with f5b98461cb81 ("random: use chacha20 for get_random_int/long"), which was intended to make get_random() faster and better, but it could be seen as regression for real-time latency guarantees. If this turns out to be a general problem for RT workloads, the answer might be to bring back an option to make get_random() have predictable overhead everywhere rather than special-casing the stack randomization. > Other architectures are using timers/cycle counters for this function, > which is sketchy from a randomization perspective because it should be > possible to estimate this value from knowledge of the syscall return > time, and from reading the current value of the timer/counters. > > So, a poor rng should be better than the cycle counter if it is hard > to extract the stack offsets sufficiently to be able to detect the > PRNG's period. I'm not convinced by the argument that the implementation you have here is less predictable than the cycle counter, but I have not done any particular research here and would rely on others to take a closer look. The 32 bit global state variable does appear weak, and I know that OTOH if we can show that a particular implementation is in fact better than a cycle counter, I strongly think we should use the same one across all architectures that currently use the cycle counter. Arnd
Hi, On Wed, Feb 21, 2024 at 7:33 AM Kees Cook <keescook@chromium.org> wrote: > > +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET > > +DEFINE_PER_CPU(u32, kstackrng); > > +static u32 xorshift32(u32 state) > > +{ > > + /* > > + * From top of page 4 of Marsaglia, "Xorshift RNGs" > > + * This algorithm is intended to have a period 2^32 -1 > > + * And should not be used anywhere else outside of this > > + * code path. > > + */ > > + state ^= state << 13; > > + state ^= state >> 17; > > + state ^= state << 5; > > + return state; > > +} Can we please *not* introduce yet another RNG? You can't just sprinkle this stuff all over the place with no rhyme or reason. If you need repeatable randomness, use prandom_u32_state() or similar. If you need non-repeatable randomness, use get_random_bytes() or similar. If you think prandom_u32_state() is insufficient for some reason or doesn't have some property or performance that you want, submit a patch to make it better. Looking at the actual intention here, of using repeatable randomness, I find the intent pretty weird. Isn't the whole point of kstack randomization that you can't predict it? If so, get_random_u*() is what you want. If performance isn't sufficient, let's figure out some way to improve performance. And as Kees said, if the point of this is to have some repeatable benchmarks, maybe just don't enable the security-intended code whose purpose is non-determinism? Both exploits and now apparently benchmarks like determinism. Jason
Hi, Thanks for looking at this! On 2/21/24 00:33, Kees Cook wrote: > On Tue, Feb 20, 2024 at 08:02:58PM -0600, Jeremy Linton wrote: >> The existing arm64 stack randomization uses the kernel rng to acquire >> 5 bits of address space randomization. This is problematic because it >> creates non determinism in the syscall path when the rng needs to be >> generated or reseeded. This shows up as large tail latencies in some >> benchmarks and directly affects the minimum RT latencies as seen by >> cyclictest. > > Some questions: > > - for benchmarks, why not disable kstack randomization? Benchmark isn't the right word here, maybe workload characterization? Its hard to justify disabling what is perceived as a security feature, and enabled by default in downstream distros in a production environment. > - if the existing pRNG reseeding is a problem here, why isn't it a > problem in the many other places it's used? I don't have an answer for this, maybe it is? Our workloads/perf team which analyses end user problems tripped over this again, and with a bit of digging noticed it had been seen more than once with differing workloads. Its maybe more of a problem here because it affects everything making syscalls rather than just the subset of users requesting things which rely on the rng? Some of it could be the HW. The machine most of these tests have been run on has lots of cores and can have fairly long cache line latency. > - I though the pRNG already did out-of-line reseeding? Yes in 6.2. My understanding from some traces is that the latency in recent kernel is largely from crng_make_state grabbing that global lock and doing crng_fast_key_erasure() under it which is getting worse with more cores active in the system. But now i'm a bit worried my own test doesn't fully match the workload system although I don't think they have seen the crng_reseed() in the syscall path. > >> Other architectures are using timers/cycle counters for this function, >> which is sketchy from a randomization perspective because it should be >> possible to estimate this value from knowledge of the syscall return >> time, and from reading the current value of the timer/counters. > > The expectation is that it would be, at best, unstable. > >> So, a poor rng should be better than the cycle counter if it is hard >> to extract the stack offsets sufficiently to be able to detect the >> PRNG's period. >> >> So, we can potentially choose a 'better' or larger PRNG, going as far >> as using one of the CSPRNGs already in the kernel, but the overhead >> increases appropriately. Further, there are a few options for >> reseeding, possibly out of the syscall path, but is it even useful in >> this case? > > I'd love to find a way to avoid an pRNG that could be reconstructed > given enough samples. (But perhaps this xorshift RNG resists that?) Agree. I don't think it does. Thanks again, > > -Kees > >> Reported-by: James Yang <james.yang@arm.com> >> Reported-by: Shiyou Huang <shiyou.huang@arm.com> >> Signed-off-by: Jeremy Linton <jeremy.linton@arm.com> >> --- >> arch/arm64/kernel/syscall.c | 55 ++++++++++++++++++++++++++++++++++++- >> 1 file changed, 54 insertions(+), 1 deletion(-) >> >> diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c >> index 9a70d9746b66..70143cb8c7be 100644 >> --- a/arch/arm64/kernel/syscall.c >> +++ b/arch/arm64/kernel/syscall.c >> @@ -37,6 +37,59 @@ static long __invoke_syscall(struct pt_regs *regs, syscall_fn_t syscall_fn) >> return syscall_fn(regs); >> } >> >> +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET >> +DEFINE_PER_CPU(u32, kstackrng); >> +static u32 xorshift32(u32 state) >> +{ >> + /* >> + * From top of page 4 of Marsaglia, "Xorshift RNGs" >> + * This algorithm is intended to have a period 2^32 -1 >> + * And should not be used anywhere else outside of this >> + * code path. >> + */ >> + state ^= state << 13; >> + state ^= state >> 17; >> + state ^= state << 5; >> + return state; >> +} >> + >> +static u16 kstack_rng(void) >> +{ >> + u32 rng = raw_cpu_read(kstackrng); >> + >> + rng = xorshift32(rng); >> + raw_cpu_write(kstackrng, rng); >> + return rng & 0x1ff; >> +} >> + >> +/* Should we reseed? */ >> +static int kstack_rng_setup(unsigned int cpu) >> +{ >> + u32 rng_seed; >> + >> + do { >> + rng_seed = get_random_u32(); >> + } while (!rng_seed); >> + raw_cpu_write(kstackrng, rng_seed); >> + return 0; >> +} >> + >> +static int kstack_init(void) >> +{ >> + int ret; >> + >> + ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "arm64/cpuinfo:kstackrandomize", >> + kstack_rng_setup, NULL); >> + if (ret < 0) >> + pr_err("kstack: failed to register rng callbacks.\n"); >> + return 0; >> +} >> + >> +arch_initcall(kstack_init); >> +#else >> +static u16 kstack_rng(void) { return 0; } >> +#endif /* CONFIG_RANDOMIZE_KSTACK_OFFSET */ >> + >> static void invoke_syscall(struct pt_regs *regs, unsigned int scno, >> unsigned int sc_nr, >> const syscall_fn_t syscall_table[]) >> @@ -66,7 +119,7 @@ static void invoke_syscall(struct pt_regs *regs, unsigned int scno, >> * >> * The resulting 5 bits of entropy is seen in SP[8:4]. >> */ >> - choose_random_kstack_offset(get_random_u16() & 0x1FF); >> + choose_random_kstack_offset(kstack_rng()); >> } >> >> static inline bool has_syscall_work(unsigned long flags) >> -- >> 2.43.0 >> >
Hi, Thanks for looking at this. On 2/21/24 06:44, Jason A. Donenfeld wrote: > Hi, > > On Wed, Feb 21, 2024 at 7:33 AM Kees Cook <keescook@chromium.org> wrote: >>> +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET >>> +DEFINE_PER_CPU(u32, kstackrng); >>> +static u32 xorshift32(u32 state) >>> +{ >>> + /* >>> + * From top of page 4 of Marsaglia, "Xorshift RNGs" >>> + * This algorithm is intended to have a period 2^32 -1 >>> + * And should not be used anywhere else outside of this >>> + * code path. >>> + */ >>> + state ^= state << 13; >>> + state ^= state >> 17; >>> + state ^= state << 5; >>> + return state; >>> +} > > Can we please *not* introduce yet another RNG? You can't just sprinkle > this stuff all over the place with no rhyme or reason. > > If you need repeatable randomness, use prandom_u32_state() or similar. > If you need non-repeatable randomness, use get_random_bytes() or > similar. Sure prandom_u32_state() should have a similar effect being a bit slower, and a bit better due to the extra hidden state. > > If you think prandom_u32_state() is insufficient for some reason or > doesn't have some property or performance that you want, submit a > patch to make it better. > > Looking at the actual intention here, of using repeatable randomness, > I find the intent pretty weird. Isn't the whole point of kstack > randomization that you can't predict it? If so, get_random_u*() is > what you want. If performance isn't sufficient, let's figure out some There isn't anything wrong with get_random_u16 from a kstack randomization standpoint, except for the latency spikes of course. > way to improve performance. And as Kees said, if the point of this is > to have some repeatable benchmarks, maybe just don't enable the > security-intended code whose purpose is non-determinism? Both exploits > and now apparently benchmarks like determinism. As I mentioned in the other email, benchmark is probably the wrong word. Its a better QoS response time distributions for a given workload. And its not strictly in RT kernel latency test types of things, but normal memcached style workloads on !RT kernels as well.
diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c index 9a70d9746b66..70143cb8c7be 100644 --- a/arch/arm64/kernel/syscall.c +++ b/arch/arm64/kernel/syscall.c @@ -37,6 +37,59 @@ static long __invoke_syscall(struct pt_regs *regs, syscall_fn_t syscall_fn) return syscall_fn(regs); } +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET +DEFINE_PER_CPU(u32, kstackrng); +static u32 xorshift32(u32 state) +{ + /* + * From top of page 4 of Marsaglia, "Xorshift RNGs" + * This algorithm is intended to have a period 2^32 -1 + * And should not be used anywhere else outside of this + * code path. + */ + state ^= state << 13; + state ^= state >> 17; + state ^= state << 5; + return state; +} + +static u16 kstack_rng(void) +{ + u32 rng = raw_cpu_read(kstackrng); + + rng = xorshift32(rng); + raw_cpu_write(kstackrng, rng); + return rng & 0x1ff; +} + +/* Should we reseed? */ +static int kstack_rng_setup(unsigned int cpu) +{ + u32 rng_seed; + + do { + rng_seed = get_random_u32(); + } while (!rng_seed); + raw_cpu_write(kstackrng, rng_seed); + return 0; +} + +static int kstack_init(void) +{ + int ret; + + ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "arm64/cpuinfo:kstackrandomize", + kstack_rng_setup, NULL); + if (ret < 0) + pr_err("kstack: failed to register rng callbacks.\n"); + return 0; +} + +arch_initcall(kstack_init); +#else +static u16 kstack_rng(void) { return 0; } +#endif /* CONFIG_RANDOMIZE_KSTACK_OFFSET */ + static void invoke_syscall(struct pt_regs *regs, unsigned int scno, unsigned int sc_nr, const syscall_fn_t syscall_table[]) @@ -66,7 +119,7 @@ static void invoke_syscall(struct pt_regs *regs, unsigned int scno, * * The resulting 5 bits of entropy is seen in SP[8:4]. */ - choose_random_kstack_offset(get_random_u16() & 0x1FF); + choose_random_kstack_offset(kstack_rng()); } static inline bool has_syscall_work(unsigned long flags)
The existing arm64 stack randomization uses the kernel rng to acquire 5 bits of address space randomization. This is problematic because it creates non determinism in the syscall path when the rng needs to be generated or reseeded. This shows up as large tail latencies in some benchmarks and directly affects the minimum RT latencies as seen by cyclictest. Other architectures are using timers/cycle counters for this function, which is sketchy from a randomization perspective because it should be possible to estimate this value from knowledge of the syscall return time, and from reading the current value of the timer/counters. So, a poor rng should be better than the cycle counter if it is hard to extract the stack offsets sufficiently to be able to detect the PRNG's period. So, we can potentially choose a 'better' or larger PRNG, going as far as using one of the CSPRNGs already in the kernel, but the overhead increases appropriately. Further, there are a few options for reseeding, possibly out of the syscall path, but is it even useful in this case? Reported-by: James Yang <james.yang@arm.com> Reported-by: Shiyou Huang <shiyou.huang@arm.com> Signed-off-by: Jeremy Linton <jeremy.linton@arm.com> --- arch/arm64/kernel/syscall.c | 55 ++++++++++++++++++++++++++++++++++++- 1 file changed, 54 insertions(+), 1 deletion(-)