Message ID | 20240131155929.169961-4-alexghiti@rivosinc.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Svvptc extension to remove preventive sfence.vma | expand |
Hi Alexandre, On Thu, Feb 1, 2024 at 12:03 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote: > > In 6.5, we removed the vmalloc fault path because that can't work (see > [1] [2]). Then in order to make sure that new page table entries were > seen by the page table walker, we had to preventively emit a sfence.vma > on all harts [3] but this solution is very costly since it relies on IPI. > > And even there, we could end up in a loop of vmalloc faults if a vmalloc > allocation is done in the IPI path (for example if it is traced, see > [4]), which could result in a kernel stack overflow. > > Those preventive sfence.vma needed to be emitted because: > > - if the uarch caches invalid entries, the new mapping may not be > observed by the page table walker and an invalidation may be needed. > - if the uarch does not cache invalid entries, a reordered access > could "miss" the new mapping and traps: in that case, we would actually > only need to retry the access, no sfence.vma is required. > > So this patch removes those preventive sfence.vma and actually handles > the possible (and unlikely) exceptions. And since the kernel stacks > mappings lie in the vmalloc area, this handling must be done very early > when the trap is taken, at the very beginning of handle_exception: this > also rules out the vmalloc allocations in the fault path. > > Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1] > Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2] > Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3] > Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4] > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> > --- > arch/riscv/include/asm/cacheflush.h | 18 +++++- > arch/riscv/include/asm/thread_info.h | 5 ++ > arch/riscv/kernel/asm-offsets.c | 5 ++ > arch/riscv/kernel/entry.S | 84 ++++++++++++++++++++++++++++ > arch/riscv/mm/init.c | 2 + > 5 files changed, 113 insertions(+), 1 deletion(-) > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h > index a129dac4521d..b0d631701757 100644 > --- a/arch/riscv/include/asm/cacheflush.h > +++ b/arch/riscv/include/asm/cacheflush.h > @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page *page) > flush_icache_mm(vma->vm_mm, 0) > > #ifdef CONFIG_64BIT > -#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end) > +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1]; > +extern char _end[]; > +#define flush_cache_vmap flush_cache_vmap > +static inline void flush_cache_vmap(unsigned long start, unsigned long end) > +{ > + if (is_vmalloc_or_module_addr((void *)start)) { > + int i; > + > + /* > + * We don't care if concurrently a cpu resets this value since > + * the only place this can happen is in handle_exception() where > + * an sfence.vma is emitted. > + */ > + for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i) > + new_vmalloc[i] = -1ULL; > + } > +} > #define flush_cache_vmap_early(start, end) local_flush_tlb_kernel_range(start, end) > #endif > > diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h > index 5d473343634b..32631acdcdd4 100644 > --- a/arch/riscv/include/asm/thread_info.h > +++ b/arch/riscv/include/asm/thread_info.h > @@ -60,6 +60,11 @@ struct thread_info { > void *scs_base; > void *scs_sp; > #endif > + /* > + * Used in handle_exception() to save a0, a1 and a2 before knowing if we > + * can access the kernel stack. > + */ > + unsigned long a0, a1, a2; > }; > > #ifdef CONFIG_SHADOW_CALL_STACK > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c > index a03129f40c46..939ddc0e3c6e 100644 > --- a/arch/riscv/kernel/asm-offsets.c > +++ b/arch/riscv/kernel/asm-offsets.c > @@ -35,6 +35,8 @@ void asm_offsets(void) > OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]); > OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]); > OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]); > + > + OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu); > OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags); > OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count); > OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp); > @@ -42,6 +44,9 @@ void asm_offsets(void) > #ifdef CONFIG_SHADOW_CALL_STACK > OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp); > #endif > + OFFSET(TASK_TI_A0, task_struct, thread_info.a0); > + OFFSET(TASK_TI_A1, task_struct, thread_info.a1); > + OFFSET(TASK_TI_A2, task_struct, thread_info.a2); > > OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu); > OFFSET(TASK_THREAD_F0, task_struct, thread.fstate.f[0]); > diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S > index 9d1a305d5508..c1ffaeaba7aa 100644 > --- a/arch/riscv/kernel/entry.S > +++ b/arch/riscv/kernel/entry.S > @@ -19,6 +19,78 @@ > > .section .irqentry.text, "ax" > > +.macro new_vmalloc_check > + REG_S a0, TASK_TI_A0(tp) > + REG_S a1, TASK_TI_A1(tp) > + REG_S a2, TASK_TI_A2(tp) > + > + csrr a0, CSR_CAUSE > + /* Exclude IRQs */ > + blt a0, zero, _new_vmalloc_restore_context > + /* Only check new_vmalloc if we are in page/protection fault */ > + li a1, EXC_LOAD_PAGE_FAULT > + beq a0, a1, _new_vmalloc_kernel_address > + li a1, EXC_STORE_PAGE_FAULT > + beq a0, a1, _new_vmalloc_kernel_address > + li a1, EXC_INST_PAGE_FAULT > + bne a0, a1, _new_vmalloc_restore_context > + > +_new_vmalloc_kernel_address: > + /* Is it a kernel address? */ > + csrr a0, CSR_TVAL > + bge a0, zero, _new_vmalloc_restore_context > + > + /* Check if a new vmalloc mapping appeared that could explain the trap */ > + > + /* > + * Computes: > + * a0 = &new_vmalloc[BIT_WORD(cpu)] > + * a1 = BIT_MASK(cpu) > + */ > + REG_L a2, TASK_TI_CPU(tp) > + /* > + * Compute the new_vmalloc element position: > + * (cpu / 64) * 8 = (cpu >> 6) << 3 > + */ > + srli a1, a2, 6 > + slli a1, a1, 3 > + la a0, new_vmalloc > + add a0, a0, a1 > + /* > + * Compute the bit position in the new_vmalloc element: > + * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6 > + * = cpu - ((cpu >> 6) << 3) << 3 > + */ > + slli a1, a1, 3 > + sub a1, a2, a1 > + /* Compute the "get mask": 1 << bit_pos */ > + li a2, 1 > + sll a1, a2, a1 > + > + /* Check the value of new_vmalloc for this cpu */ > + REG_L a2, 0(a0) > + and a2, a2, a1 > + beq a2, zero, _new_vmalloc_restore_context > + > + /* Atomically reset the current cpu bit in new_vmalloc */ > + amoxor.w a0, a1, (a0) > + > + /* Only emit a sfence.vma if the uarch caches invalid entries */ > + ALTERNATIVE("sfence.vma", "nop", 0, RISCV_ISA_EXT_SVVPTC, 1) > + > + REG_L a0, TASK_TI_A0(tp) > + REG_L a1, TASK_TI_A1(tp) > + REG_L a2, TASK_TI_A2(tp) > + csrw CSR_SCRATCH, x0 > + sret > + > +_new_vmalloc_restore_context: > + REG_L a0, TASK_TI_A0(tp) > + REG_L a1, TASK_TI_A1(tp) > + REG_L a2, TASK_TI_A2(tp) > +.endm > + > + > SYM_CODE_START(handle_exception) > /* > * If coming from userspace, preserve the user thread pointer and load > @@ -30,6 +102,18 @@ SYM_CODE_START(handle_exception) > > .Lrestore_kernel_tpsp: > csrr tp, CSR_SCRATCH > + > + /* > + * The RISC-V kernel does not eagerly emit a sfence.vma after each > + * new vmalloc mapping, which may result in exceptions: > + * - if the uarch caches invalid entries, the new mapping would not be > + * observed by the page table walker and an invalidation is needed. > + * - if the uarch does not cache invalid entries, a reordered access > + * could "miss" the new mapping and traps: in that case, we only need > + * to retry the access, no sfence.vma is required. > + */ > + new_vmalloc_check > + > REG_S sp, TASK_TI_KERNEL_SP(tp) > > #ifdef CONFIG_VMAP_STACK > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c > index eafc4c2200f2..54c9fdeda11e 100644 > --- a/arch/riscv/mm/init.c > +++ b/arch/riscv/mm/init.c > @@ -36,6 +36,8 @@ > > #include "../kernel/head.h" > > +u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1]; > + > struct kernel_mapping kernel_map __ro_after_init; > EXPORT_SYMBOL(kernel_map); > #ifdef CONFIG_XIP_KERNEL > -- > 2.39.2 > > Can we consider using new_vmalloc as a percpu variable, so that we don't need to add a0/1/2 in thread_info? Also, try not to do too much calculation logic in new_vmalloc_check, after all, handle_exception is a high-frequency path. In this case, can we consider writing new_vmalloc_check in C language to increase readability? Thanks, Yunhui
Hi Yunhui, On Mon, Jun 3, 2024 at 4:26 AM yunhui cui <cuiyunhui@bytedance.com> wrote: > > Hi Alexandre, > > On Thu, Feb 1, 2024 at 12:03 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote: > > > > In 6.5, we removed the vmalloc fault path because that can't work (see > > [1] [2]). Then in order to make sure that new page table entries were > > seen by the page table walker, we had to preventively emit a sfence.vma > > on all harts [3] but this solution is very costly since it relies on IPI. > > > > And even there, we could end up in a loop of vmalloc faults if a vmalloc > > allocation is done in the IPI path (for example if it is traced, see > > [4]), which could result in a kernel stack overflow. > > > > Those preventive sfence.vma needed to be emitted because: > > > > - if the uarch caches invalid entries, the new mapping may not be > > observed by the page table walker and an invalidation may be needed. > > - if the uarch does not cache invalid entries, a reordered access > > could "miss" the new mapping and traps: in that case, we would actually > > only need to retry the access, no sfence.vma is required. > > > > So this patch removes those preventive sfence.vma and actually handles > > the possible (and unlikely) exceptions. And since the kernel stacks > > mappings lie in the vmalloc area, this handling must be done very early > > when the trap is taken, at the very beginning of handle_exception: this > > also rules out the vmalloc allocations in the fault path. > > > > Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1] > > Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2] > > Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3] > > Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4] > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> > > --- > > arch/riscv/include/asm/cacheflush.h | 18 +++++- > > arch/riscv/include/asm/thread_info.h | 5 ++ > > arch/riscv/kernel/asm-offsets.c | 5 ++ > > arch/riscv/kernel/entry.S | 84 ++++++++++++++++++++++++++++ > > arch/riscv/mm/init.c | 2 + > > 5 files changed, 113 insertions(+), 1 deletion(-) > > > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h > > index a129dac4521d..b0d631701757 100644 > > --- a/arch/riscv/include/asm/cacheflush.h > > +++ b/arch/riscv/include/asm/cacheflush.h > > @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page *page) > > flush_icache_mm(vma->vm_mm, 0) > > > > #ifdef CONFIG_64BIT > > -#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end) > > +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1]; > > +extern char _end[]; > > +#define flush_cache_vmap flush_cache_vmap > > +static inline void flush_cache_vmap(unsigned long start, unsigned long end) > > +{ > > + if (is_vmalloc_or_module_addr((void *)start)) { > > + int i; > > + > > + /* > > + * We don't care if concurrently a cpu resets this value since > > + * the only place this can happen is in handle_exception() where > > + * an sfence.vma is emitted. > > + */ > > + for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i) > > + new_vmalloc[i] = -1ULL; > > + } > > +} > > #define flush_cache_vmap_early(start, end) local_flush_tlb_kernel_range(start, end) > > #endif > > > > diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h > > index 5d473343634b..32631acdcdd4 100644 > > --- a/arch/riscv/include/asm/thread_info.h > > +++ b/arch/riscv/include/asm/thread_info.h > > @@ -60,6 +60,11 @@ struct thread_info { > > void *scs_base; > > void *scs_sp; > > #endif > > + /* > > + * Used in handle_exception() to save a0, a1 and a2 before knowing if we > > + * can access the kernel stack. > > + */ > > + unsigned long a0, a1, a2; > > }; > > > > #ifdef CONFIG_SHADOW_CALL_STACK > > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c > > index a03129f40c46..939ddc0e3c6e 100644 > > --- a/arch/riscv/kernel/asm-offsets.c > > +++ b/arch/riscv/kernel/asm-offsets.c > > @@ -35,6 +35,8 @@ void asm_offsets(void) > > OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]); > > OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]); > > OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]); > > + > > + OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu); > > OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags); > > OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count); > > OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp); > > @@ -42,6 +44,9 @@ void asm_offsets(void) > > #ifdef CONFIG_SHADOW_CALL_STACK > > OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp); > > #endif > > + OFFSET(TASK_TI_A0, task_struct, thread_info.a0); > > + OFFSET(TASK_TI_A1, task_struct, thread_info.a1); > > + OFFSET(TASK_TI_A2, task_struct, thread_info.a2); > > > > OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu); > > OFFSET(TASK_THREAD_F0, task_struct, thread.fstate.f[0]); > > diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S > > index 9d1a305d5508..c1ffaeaba7aa 100644 > > --- a/arch/riscv/kernel/entry.S > > +++ b/arch/riscv/kernel/entry.S > > @@ -19,6 +19,78 @@ > > > > .section .irqentry.text, "ax" > > > > +.macro new_vmalloc_check > > + REG_S a0, TASK_TI_A0(tp) > > + REG_S a1, TASK_TI_A1(tp) > > + REG_S a2, TASK_TI_A2(tp) > > + > > + csrr a0, CSR_CAUSE > > + /* Exclude IRQs */ > > + blt a0, zero, _new_vmalloc_restore_context > > + /* Only check new_vmalloc if we are in page/protection fault */ > > + li a1, EXC_LOAD_PAGE_FAULT > > + beq a0, a1, _new_vmalloc_kernel_address > > + li a1, EXC_STORE_PAGE_FAULT > > + beq a0, a1, _new_vmalloc_kernel_address > > + li a1, EXC_INST_PAGE_FAULT > > + bne a0, a1, _new_vmalloc_restore_context > > + > > +_new_vmalloc_kernel_address: > > + /* Is it a kernel address? */ > > + csrr a0, CSR_TVAL > > + bge a0, zero, _new_vmalloc_restore_context > > + > > + /* Check if a new vmalloc mapping appeared that could explain the trap */ > > + > > + /* > > + * Computes: > > + * a0 = &new_vmalloc[BIT_WORD(cpu)] > > + * a1 = BIT_MASK(cpu) > > + */ > > + REG_L a2, TASK_TI_CPU(tp) > > + /* > > + * Compute the new_vmalloc element position: > > + * (cpu / 64) * 8 = (cpu >> 6) << 3 > > + */ > > + srli a1, a2, 6 > > + slli a1, a1, 3 > > + la a0, new_vmalloc > > + add a0, a0, a1 > > + /* > > + * Compute the bit position in the new_vmalloc element: > > + * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6 > > + * = cpu - ((cpu >> 6) << 3) << 3 > > + */ > > + slli a1, a1, 3 > > + sub a1, a2, a1 > > + /* Compute the "get mask": 1 << bit_pos */ > > + li a2, 1 > > + sll a1, a2, a1 > > + > > + /* Check the value of new_vmalloc for this cpu */ > > + REG_L a2, 0(a0) > > + and a2, a2, a1 > > + beq a2, zero, _new_vmalloc_restore_context > > + > > + /* Atomically reset the current cpu bit in new_vmalloc */ > > + amoxor.w a0, a1, (a0) > > + > > + /* Only emit a sfence.vma if the uarch caches invalid entries */ > > + ALTERNATIVE("sfence.vma", "nop", 0, RISCV_ISA_EXT_SVVPTC, 1) > > + > > + REG_L a0, TASK_TI_A0(tp) > > + REG_L a1, TASK_TI_A1(tp) > > + REG_L a2, TASK_TI_A2(tp) > > + csrw CSR_SCRATCH, x0 > > + sret > > + > > +_new_vmalloc_restore_context: > > + REG_L a0, TASK_TI_A0(tp) > > + REG_L a1, TASK_TI_A1(tp) > > + REG_L a2, TASK_TI_A2(tp) > > +.endm > > + > > + > > SYM_CODE_START(handle_exception) > > /* > > * If coming from userspace, preserve the user thread pointer and load > > @@ -30,6 +102,18 @@ SYM_CODE_START(handle_exception) > > > > .Lrestore_kernel_tpsp: > > csrr tp, CSR_SCRATCH > > + > > + /* > > + * The RISC-V kernel does not eagerly emit a sfence.vma after each > > + * new vmalloc mapping, which may result in exceptions: > > + * - if the uarch caches invalid entries, the new mapping would not be > > + * observed by the page table walker and an invalidation is needed. > > + * - if the uarch does not cache invalid entries, a reordered access > > + * could "miss" the new mapping and traps: in that case, we only need > > + * to retry the access, no sfence.vma is required. > > + */ > > + new_vmalloc_check > > + > > REG_S sp, TASK_TI_KERNEL_SP(tp) > > > > #ifdef CONFIG_VMAP_STACK > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c > > index eafc4c2200f2..54c9fdeda11e 100644 > > --- a/arch/riscv/mm/init.c > > +++ b/arch/riscv/mm/init.c > > @@ -36,6 +36,8 @@ > > > > #include "../kernel/head.h" > > > > +u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1]; > > + > > struct kernel_mapping kernel_map __ro_after_init; > > EXPORT_SYMBOL(kernel_map); > > #ifdef CONFIG_XIP_KERNEL > > -- > > 2.39.2 > > > > > > Can we consider using new_vmalloc as a percpu variable, so that we > don't need to add a0/1/2 in thread_info? At first, I used percpu variables. But then I realized that percpu areas are allocated in the vmalloc area, so if somehow we take a trap when accessing the new_vmalloc percpu variable, we could not recover from this as we would trap forever in new_vmalloc_check. But admittedly, not sure that can happen. And how would that remove a0, a1 and a2 from thread_info? We'd still need to save some registers somewhere to access the percpu variable right? > Also, try not to do too much > calculation logic in new_vmalloc_check, after all, handle_exception is > a high-frequency path. In this case, can we consider writing > new_vmalloc_check in C language to increase readability? If we write that in C, we don't have the control over the allocated registers and then we can't correctly save the context. Thanks for your interest in this patchset :) Alex > > Thanks, > Yunhui
Hi Alexandre, On Mon, Jun 3, 2024 at 8:02 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote: > > Hi Yunhui, > > On Mon, Jun 3, 2024 at 4:26 AM yunhui cui <cuiyunhui@bytedance.com> wrote: > > > > Hi Alexandre, > > > > On Thu, Feb 1, 2024 at 12:03 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote: > > > > > > In 6.5, we removed the vmalloc fault path because that can't work (see > > > [1] [2]). Then in order to make sure that new page table entries were > > > seen by the page table walker, we had to preventively emit a sfence.vma > > > on all harts [3] but this solution is very costly since it relies on IPI. > > > > > > And even there, we could end up in a loop of vmalloc faults if a vmalloc > > > allocation is done in the IPI path (for example if it is traced, see > > > [4]), which could result in a kernel stack overflow. > > > > > > Those preventive sfence.vma needed to be emitted because: > > > > > > - if the uarch caches invalid entries, the new mapping may not be > > > observed by the page table walker and an invalidation may be needed. > > > - if the uarch does not cache invalid entries, a reordered access > > > could "miss" the new mapping and traps: in that case, we would actually > > > only need to retry the access, no sfence.vma is required. > > > > > > So this patch removes those preventive sfence.vma and actually handles > > > the possible (and unlikely) exceptions. And since the kernel stacks > > > mappings lie in the vmalloc area, this handling must be done very early > > > when the trap is taken, at the very beginning of handle_exception: this > > > also rules out the vmalloc allocations in the fault path. > > > > > > Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1] > > > Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2] > > > Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3] > > > Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4] > > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> > > > --- > > > arch/riscv/include/asm/cacheflush.h | 18 +++++- > > > arch/riscv/include/asm/thread_info.h | 5 ++ > > > arch/riscv/kernel/asm-offsets.c | 5 ++ > > > arch/riscv/kernel/entry.S | 84 ++++++++++++++++++++++++++++ > > > arch/riscv/mm/init.c | 2 + > > > 5 files changed, 113 insertions(+), 1 deletion(-) > > > > > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h > > > index a129dac4521d..b0d631701757 100644 > > > --- a/arch/riscv/include/asm/cacheflush.h > > > +++ b/arch/riscv/include/asm/cacheflush.h > > > @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page *page) > > > flush_icache_mm(vma->vm_mm, 0) > > > > > > #ifdef CONFIG_64BIT > > > -#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end) > > > +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1]; > > > +extern char _end[]; > > > +#define flush_cache_vmap flush_cache_vmap > > > +static inline void flush_cache_vmap(unsigned long start, unsigned long end) > > > +{ > > > + if (is_vmalloc_or_module_addr((void *)start)) { > > > + int i; > > > + > > > + /* > > > + * We don't care if concurrently a cpu resets this value since > > > + * the only place this can happen is in handle_exception() where > > > + * an sfence.vma is emitted. > > > + */ > > > + for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i) > > > + new_vmalloc[i] = -1ULL; > > > + } > > > +} > > > #define flush_cache_vmap_early(start, end) local_flush_tlb_kernel_range(start, end) > > > #endif > > > > > > diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h > > > index 5d473343634b..32631acdcdd4 100644 > > > --- a/arch/riscv/include/asm/thread_info.h > > > +++ b/arch/riscv/include/asm/thread_info.h > > > @@ -60,6 +60,11 @@ struct thread_info { > > > void *scs_base; > > > void *scs_sp; > > > #endif > > > + /* > > > + * Used in handle_exception() to save a0, a1 and a2 before knowing if we > > > + * can access the kernel stack. > > > + */ > > > + unsigned long a0, a1, a2; > > > }; > > > > > > #ifdef CONFIG_SHADOW_CALL_STACK > > > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c > > > index a03129f40c46..939ddc0e3c6e 100644 > > > --- a/arch/riscv/kernel/asm-offsets.c > > > +++ b/arch/riscv/kernel/asm-offsets.c > > > @@ -35,6 +35,8 @@ void asm_offsets(void) > > > OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]); > > > OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]); > > > OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]); > > > + > > > + OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu); > > > OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags); > > > OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count); > > > OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp); > > > @@ -42,6 +44,9 @@ void asm_offsets(void) > > > #ifdef CONFIG_SHADOW_CALL_STACK > > > OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp); > > > #endif > > > + OFFSET(TASK_TI_A0, task_struct, thread_info.a0); > > > + OFFSET(TASK_TI_A1, task_struct, thread_info.a1); > > > + OFFSET(TASK_TI_A2, task_struct, thread_info.a2); > > > > > > OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu); > > > OFFSET(TASK_THREAD_F0, task_struct, thread.fstate.f[0]); > > > diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S > > > index 9d1a305d5508..c1ffaeaba7aa 100644 > > > --- a/arch/riscv/kernel/entry.S > > > +++ b/arch/riscv/kernel/entry.S > > > @@ -19,6 +19,78 @@ > > > > > > .section .irqentry.text, "ax" > > > > > > +.macro new_vmalloc_check > > > + REG_S a0, TASK_TI_A0(tp) > > > + REG_S a1, TASK_TI_A1(tp) > > > + REG_S a2, TASK_TI_A2(tp) > > > + > > > + csrr a0, CSR_CAUSE > > > + /* Exclude IRQs */ > > > + blt a0, zero, _new_vmalloc_restore_context > > > + /* Only check new_vmalloc if we are in page/protection fault */ > > > + li a1, EXC_LOAD_PAGE_FAULT > > > + beq a0, a1, _new_vmalloc_kernel_address > > > + li a1, EXC_STORE_PAGE_FAULT > > > + beq a0, a1, _new_vmalloc_kernel_address > > > + li a1, EXC_INST_PAGE_FAULT > > > + bne a0, a1, _new_vmalloc_restore_context > > > + > > > +_new_vmalloc_kernel_address: > > > + /* Is it a kernel address? */ > > > + csrr a0, CSR_TVAL > > > + bge a0, zero, _new_vmalloc_restore_context > > > + > > > + /* Check if a new vmalloc mapping appeared that could explain the trap */ > > > + > > > + /* > > > + * Computes: > > > + * a0 = &new_vmalloc[BIT_WORD(cpu)] > > > + * a1 = BIT_MASK(cpu) > > > + */ > > > + REG_L a2, TASK_TI_CPU(tp) > > > + /* > > > + * Compute the new_vmalloc element position: > > > + * (cpu / 64) * 8 = (cpu >> 6) << 3 > > > + */ > > > + srli a1, a2, 6 > > > + slli a1, a1, 3 > > > + la a0, new_vmalloc > > > + add a0, a0, a1 > > > + /* > > > + * Compute the bit position in the new_vmalloc element: > > > + * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6 > > > + * = cpu - ((cpu >> 6) << 3) << 3 > > > + */ > > > + slli a1, a1, 3 > > > + sub a1, a2, a1 > > > + /* Compute the "get mask": 1 << bit_pos */ > > > + li a2, 1 > > > + sll a1, a2, a1 > > > + > > > + /* Check the value of new_vmalloc for this cpu */ > > > + REG_L a2, 0(a0) > > > + and a2, a2, a1 > > > + beq a2, zero, _new_vmalloc_restore_context > > > + > > > + /* Atomically reset the current cpu bit in new_vmalloc */ > > > + amoxor.w a0, a1, (a0) > > > + > > > + /* Only emit a sfence.vma if the uarch caches invalid entries */ > > > + ALTERNATIVE("sfence.vma", "nop", 0, RISCV_ISA_EXT_SVVPTC, 1) > > > + > > > + REG_L a0, TASK_TI_A0(tp) > > > + REG_L a1, TASK_TI_A1(tp) > > > + REG_L a2, TASK_TI_A2(tp) > > > + csrw CSR_SCRATCH, x0 > > > + sret > > > + > > > +_new_vmalloc_restore_context: > > > + REG_L a0, TASK_TI_A0(tp) > > > + REG_L a1, TASK_TI_A1(tp) > > > + REG_L a2, TASK_TI_A2(tp) > > > +.endm > > > + > > > + > > > SYM_CODE_START(handle_exception) > > > /* > > > * If coming from userspace, preserve the user thread pointer and load > > > @@ -30,6 +102,18 @@ SYM_CODE_START(handle_exception) > > > > > > .Lrestore_kernel_tpsp: > > > csrr tp, CSR_SCRATCH > > > + > > > + /* > > > + * The RISC-V kernel does not eagerly emit a sfence.vma after each > > > + * new vmalloc mapping, which may result in exceptions: > > > + * - if the uarch caches invalid entries, the new mapping would not be > > > + * observed by the page table walker and an invalidation is needed. > > > + * - if the uarch does not cache invalid entries, a reordered access > > > + * could "miss" the new mapping and traps: in that case, we only need > > > + * to retry the access, no sfence.vma is required. > > > + */ > > > + new_vmalloc_check > > > + > > > REG_S sp, TASK_TI_KERNEL_SP(tp) > > > > > > #ifdef CONFIG_VMAP_STACK > > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c > > > index eafc4c2200f2..54c9fdeda11e 100644 > > > --- a/arch/riscv/mm/init.c > > > +++ b/arch/riscv/mm/init.c > > > @@ -36,6 +36,8 @@ > > > > > > #include "../kernel/head.h" > > > > > > +u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1]; > > > + > > > struct kernel_mapping kernel_map __ro_after_init; > > > EXPORT_SYMBOL(kernel_map); > > > #ifdef CONFIG_XIP_KERNEL > > > -- > > > 2.39.2 > > > > > > > > > > Can we consider using new_vmalloc as a percpu variable, so that we > > don't need to add a0/1/2 in thread_info? > > At first, I used percpu variables. But then I realized that percpu > areas are allocated in the vmalloc area, so if somehow we take a trap > when accessing the new_vmalloc percpu variable, we could not recover > from this as we would trap forever in new_vmalloc_check. But > admittedly, not sure that can happen. > > And how would that remove a0, a1 and a2 from thread_info? We'd still > need to save some registers somewhere to access the percpu variable > right? > > > Also, try not to do too much > > calculation logic in new_vmalloc_check, after all, handle_exception is > > a high-frequency path. In this case, can we consider writing > > new_vmalloc_check in C language to increase readability? > > If we write that in C, we don't have the control over the allocated > registers and then we can't correctly save the context. If we use C language, new_vmalloc_check is written just like do_irq(), then we need _save_context, but for new_vmalloc_check, it is not worth the loss, because exceptions from user mode do not need new_vmalloc_check, which also shows that it is reasonable to put new_vmalloc_check after _restore_kernel_tpsp. Saving is necessary. We can save a0, a1, a2 without using thread_info. We can choose to save on the kernel stack of the current tp, but we need to add the following instructions: REG_S sp, TASK_TI_USER_SP(tp) REG_L sp, TASK_TI_KERNEL_SP(tp) addi sp, sp, -(PT_SIZE_ON_STACK) It seems that saving directly on thread_info is more direct, but saving on the kernel stack is more logically consistent, and there is no need to increase the size of thread_info. As for the current status of the patch, there are two points that can be optimized: 1. Some chip hardware implementations may not cache TLB invalid entries, so it doesn't matter whether svvptc is available or not. Can we consider adding a CONFIG_RISCV_SVVPTC to control it? 2. .macro new_vmalloc_check REG_S a0, TASK_TI_A0(tp) REG_S a1, TASK_TI_A1(tp) REG_S a2, TASK_TI_A2(tp) When executing blt a0, zero, _new_vmalloc_restore_context, you can not save a1, a2 first > > Thanks for your interest in this patchset :) > > Alex > > > > > Thanks, > > Yunhui Thanks, Yunhui
Hi Yunhui, On Tue, Jun 4, 2024 at 8:21 AM yunhui cui <cuiyunhui@bytedance.com> wrote: > > Hi Alexandre, > > On Mon, Jun 3, 2024 at 8:02 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote: > > > > Hi Yunhui, > > > > On Mon, Jun 3, 2024 at 4:26 AM yunhui cui <cuiyunhui@bytedance.com> wrote: > > > > > > Hi Alexandre, > > > > > > On Thu, Feb 1, 2024 at 12:03 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote: > > > > > > > > In 6.5, we removed the vmalloc fault path because that can't work (see > > > > [1] [2]). Then in order to make sure that new page table entries were > > > > seen by the page table walker, we had to preventively emit a sfence.vma > > > > on all harts [3] but this solution is very costly since it relies on IPI. > > > > > > > > And even there, we could end up in a loop of vmalloc faults if a vmalloc > > > > allocation is done in the IPI path (for example if it is traced, see > > > > [4]), which could result in a kernel stack overflow. > > > > > > > > Those preventive sfence.vma needed to be emitted because: > > > > > > > > - if the uarch caches invalid entries, the new mapping may not be > > > > observed by the page table walker and an invalidation may be needed. > > > > - if the uarch does not cache invalid entries, a reordered access > > > > could "miss" the new mapping and traps: in that case, we would actually > > > > only need to retry the access, no sfence.vma is required. > > > > > > > > So this patch removes those preventive sfence.vma and actually handles > > > > the possible (and unlikely) exceptions. And since the kernel stacks > > > > mappings lie in the vmalloc area, this handling must be done very early > > > > when the trap is taken, at the very beginning of handle_exception: this > > > > also rules out the vmalloc allocations in the fault path. > > > > > > > > Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1] > > > > Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2] > > > > Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3] > > > > Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4] > > > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> > > > > --- > > > > arch/riscv/include/asm/cacheflush.h | 18 +++++- > > > > arch/riscv/include/asm/thread_info.h | 5 ++ > > > > arch/riscv/kernel/asm-offsets.c | 5 ++ > > > > arch/riscv/kernel/entry.S | 84 ++++++++++++++++++++++++++++ > > > > arch/riscv/mm/init.c | 2 + > > > > 5 files changed, 113 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h > > > > index a129dac4521d..b0d631701757 100644 > > > > --- a/arch/riscv/include/asm/cacheflush.h > > > > +++ b/arch/riscv/include/asm/cacheflush.h > > > > @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page *page) > > > > flush_icache_mm(vma->vm_mm, 0) > > > > > > > > #ifdef CONFIG_64BIT > > > > -#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end) > > > > +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1]; > > > > +extern char _end[]; > > > > +#define flush_cache_vmap flush_cache_vmap > > > > +static inline void flush_cache_vmap(unsigned long start, unsigned long end) > > > > +{ > > > > + if (is_vmalloc_or_module_addr((void *)start)) { > > > > + int i; > > > > + > > > > + /* > > > > + * We don't care if concurrently a cpu resets this value since > > > > + * the only place this can happen is in handle_exception() where > > > > + * an sfence.vma is emitted. > > > > + */ > > > > + for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i) > > > > + new_vmalloc[i] = -1ULL; > > > > + } > > > > +} > > > > #define flush_cache_vmap_early(start, end) local_flush_tlb_kernel_range(start, end) > > > > #endif > > > > > > > > diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h > > > > index 5d473343634b..32631acdcdd4 100644 > > > > --- a/arch/riscv/include/asm/thread_info.h > > > > +++ b/arch/riscv/include/asm/thread_info.h > > > > @@ -60,6 +60,11 @@ struct thread_info { > > > > void *scs_base; > > > > void *scs_sp; > > > > #endif > > > > + /* > > > > + * Used in handle_exception() to save a0, a1 and a2 before knowing if we > > > > + * can access the kernel stack. > > > > + */ > > > > + unsigned long a0, a1, a2; > > > > }; > > > > > > > > #ifdef CONFIG_SHADOW_CALL_STACK > > > > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c > > > > index a03129f40c46..939ddc0e3c6e 100644 > > > > --- a/arch/riscv/kernel/asm-offsets.c > > > > +++ b/arch/riscv/kernel/asm-offsets.c > > > > @@ -35,6 +35,8 @@ void asm_offsets(void) > > > > OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]); > > > > OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]); > > > > OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]); > > > > + > > > > + OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu); > > > > OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags); > > > > OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count); > > > > OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp); > > > > @@ -42,6 +44,9 @@ void asm_offsets(void) > > > > #ifdef CONFIG_SHADOW_CALL_STACK > > > > OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp); > > > > #endif > > > > + OFFSET(TASK_TI_A0, task_struct, thread_info.a0); > > > > + OFFSET(TASK_TI_A1, task_struct, thread_info.a1); > > > > + OFFSET(TASK_TI_A2, task_struct, thread_info.a2); > > > > > > > > OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu); > > > > OFFSET(TASK_THREAD_F0, task_struct, thread.fstate.f[0]); > > > > diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S > > > > index 9d1a305d5508..c1ffaeaba7aa 100644 > > > > --- a/arch/riscv/kernel/entry.S > > > > +++ b/arch/riscv/kernel/entry.S > > > > @@ -19,6 +19,78 @@ > > > > > > > > .section .irqentry.text, "ax" > > > > > > > > +.macro new_vmalloc_check > > > > + REG_S a0, TASK_TI_A0(tp) > > > > + REG_S a1, TASK_TI_A1(tp) > > > > + REG_S a2, TASK_TI_A2(tp) > > > > + > > > > + csrr a0, CSR_CAUSE > > > > + /* Exclude IRQs */ > > > > + blt a0, zero, _new_vmalloc_restore_context > > > > + /* Only check new_vmalloc if we are in page/protection fault */ > > > > + li a1, EXC_LOAD_PAGE_FAULT > > > > + beq a0, a1, _new_vmalloc_kernel_address > > > > + li a1, EXC_STORE_PAGE_FAULT > > > > + beq a0, a1, _new_vmalloc_kernel_address > > > > + li a1, EXC_INST_PAGE_FAULT > > > > + bne a0, a1, _new_vmalloc_restore_context > > > > + > > > > +_new_vmalloc_kernel_address: > > > > + /* Is it a kernel address? */ > > > > + csrr a0, CSR_TVAL > > > > + bge a0, zero, _new_vmalloc_restore_context > > > > + > > > > + /* Check if a new vmalloc mapping appeared that could explain the trap */ > > > > + > > > > + /* > > > > + * Computes: > > > > + * a0 = &new_vmalloc[BIT_WORD(cpu)] > > > > + * a1 = BIT_MASK(cpu) > > > > + */ > > > > + REG_L a2, TASK_TI_CPU(tp) > > > > + /* > > > > + * Compute the new_vmalloc element position: > > > > + * (cpu / 64) * 8 = (cpu >> 6) << 3 > > > > + */ > > > > + srli a1, a2, 6 > > > > + slli a1, a1, 3 > > > > + la a0, new_vmalloc > > > > + add a0, a0, a1 > > > > + /* > > > > + * Compute the bit position in the new_vmalloc element: > > > > + * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6 > > > > + * = cpu - ((cpu >> 6) << 3) << 3 > > > > + */ > > > > + slli a1, a1, 3 > > > > + sub a1, a2, a1 > > > > + /* Compute the "get mask": 1 << bit_pos */ > > > > + li a2, 1 > > > > + sll a1, a2, a1 > > > > + > > > > + /* Check the value of new_vmalloc for this cpu */ > > > > + REG_L a2, 0(a0) > > > > + and a2, a2, a1 > > > > + beq a2, zero, _new_vmalloc_restore_context > > > > + > > > > + /* Atomically reset the current cpu bit in new_vmalloc */ > > > > + amoxor.w a0, a1, (a0) > > > > + > > > > + /* Only emit a sfence.vma if the uarch caches invalid entries */ > > > > + ALTERNATIVE("sfence.vma", "nop", 0, RISCV_ISA_EXT_SVVPTC, 1) > > > > + > > > > + REG_L a0, TASK_TI_A0(tp) > > > > + REG_L a1, TASK_TI_A1(tp) > > > > + REG_L a2, TASK_TI_A2(tp) > > > > + csrw CSR_SCRATCH, x0 > > > > + sret > > > > + > > > > +_new_vmalloc_restore_context: > > > > + REG_L a0, TASK_TI_A0(tp) > > > > + REG_L a1, TASK_TI_A1(tp) > > > > + REG_L a2, TASK_TI_A2(tp) > > > > +.endm > > > > + > > > > + > > > > SYM_CODE_START(handle_exception) > > > > /* > > > > * If coming from userspace, preserve the user thread pointer and load > > > > @@ -30,6 +102,18 @@ SYM_CODE_START(handle_exception) > > > > > > > > .Lrestore_kernel_tpsp: > > > > csrr tp, CSR_SCRATCH > > > > + > > > > + /* > > > > + * The RISC-V kernel does not eagerly emit a sfence.vma after each > > > > + * new vmalloc mapping, which may result in exceptions: > > > > + * - if the uarch caches invalid entries, the new mapping would not be > > > > + * observed by the page table walker and an invalidation is needed. > > > > + * - if the uarch does not cache invalid entries, a reordered access > > > > + * could "miss" the new mapping and traps: in that case, we only need > > > > + * to retry the access, no sfence.vma is required. > > > > + */ > > > > + new_vmalloc_check > > > > + > > > > REG_S sp, TASK_TI_KERNEL_SP(tp) > > > > > > > > #ifdef CONFIG_VMAP_STACK > > > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c > > > > index eafc4c2200f2..54c9fdeda11e 100644 > > > > --- a/arch/riscv/mm/init.c > > > > +++ b/arch/riscv/mm/init.c > > > > @@ -36,6 +36,8 @@ > > > > > > > > #include "../kernel/head.h" > > > > > > > > +u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1]; > > > > + > > > > struct kernel_mapping kernel_map __ro_after_init; > > > > EXPORT_SYMBOL(kernel_map); > > > > #ifdef CONFIG_XIP_KERNEL > > > > -- > > > > 2.39.2 > > > > > > > > > > > > > > Can we consider using new_vmalloc as a percpu variable, so that we > > > don't need to add a0/1/2 in thread_info? > > > > At first, I used percpu variables. But then I realized that percpu > > areas are allocated in the vmalloc area, so if somehow we take a trap > > when accessing the new_vmalloc percpu variable, we could not recover > > from this as we would trap forever in new_vmalloc_check. But > > admittedly, not sure that can happen. > > > > And how would that remove a0, a1 and a2 from thread_info? We'd still > > need to save some registers somewhere to access the percpu variable > > right? > > > > > Also, try not to do too much > > > calculation logic in new_vmalloc_check, after all, handle_exception is > > > a high-frequency path. In this case, can we consider writing > > > new_vmalloc_check in C language to increase readability? > > > > If we write that in C, we don't have the control over the allocated > > registers and then we can't correctly save the context. > > If we use C language, new_vmalloc_check is written just like do_irq(), > then we need _save_context, but for new_vmalloc_check, it is not worth > the loss, because exceptions from user mode do not need > new_vmalloc_check, which also shows that it is reasonable to put > new_vmalloc_check after _restore_kernel_tpsp. > > Saving is necessary. We can save a0, a1, a2 without using thread_info. > We can choose to save on the kernel stack of the current tp, but we > need to add the following instructions: > REG_S sp, TASK_TI_USER_SP(tp) > REG_L sp, TASK_TI_KERNEL_SP(tp) > addi sp, sp, -(PT_SIZE_ON_STACK) > It seems that saving directly on thread_info is more direct, but > saving on the kernel stack is more logically consistent, and there is > no need to increase the size of thread_info. You can't save on the kernel stack since kernel stacks are allocated in the vmalloc area. > > As for the current status of the patch, there are two points that can > be optimized: > 1. Some chip hardware implementations may not cache TLB invalid > entries, so it doesn't matter whether svvptc is available or not. Can > we consider adding a CONFIG_RISCV_SVVPTC to control it? > > 2. .macro new_vmalloc_check > REG_S a0, TASK_TI_A0(tp) > REG_S a1, TASK_TI_A1(tp) > REG_S a2, TASK_TI_A2(tp) > When executing blt a0, zero, _new_vmalloc_restore_context, you can not > save a1, a2 first Ok, I can do that :) Thanks again for your inputs, Alex > > > > > Thanks for your interest in this patchset :) > > > > Alex > > > > > > > > Thanks, > > > Yunhui > > Thanks, > Yunhui
On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote: > > Hi Yunhui, > > On Tue, Jun 4, 2024 at 8:21 AM yunhui cui <cuiyunhui@bytedance.com> wrote: > > > > Hi Alexandre, > > > > On Mon, Jun 3, 2024 at 8:02 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote: > > > > > > Hi Yunhui, > > > > > > On Mon, Jun 3, 2024 at 4:26 AM yunhui cui <cuiyunhui@bytedance.com> wrote: > > > > > > > > Hi Alexandre, > > > > > > > > On Thu, Feb 1, 2024 at 12:03 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote: > > > > > > > > > > In 6.5, we removed the vmalloc fault path because that can't work (see > > > > > [1] [2]). Then in order to make sure that new page table entries were > > > > > seen by the page table walker, we had to preventively emit a sfence.vma > > > > > on all harts [3] but this solution is very costly since it relies on IPI. > > > > > > > > > > And even there, we could end up in a loop of vmalloc faults if a vmalloc > > > > > allocation is done in the IPI path (for example if it is traced, see > > > > > [4]), which could result in a kernel stack overflow. > > > > > > > > > > Those preventive sfence.vma needed to be emitted because: > > > > > > > > > > - if the uarch caches invalid entries, the new mapping may not be > > > > > observed by the page table walker and an invalidation may be needed. > > > > > - if the uarch does not cache invalid entries, a reordered access > > > > > could "miss" the new mapping and traps: in that case, we would actually > > > > > only need to retry the access, no sfence.vma is required. > > > > > > > > > > So this patch removes those preventive sfence.vma and actually handles > > > > > the possible (and unlikely) exceptions. And since the kernel stacks > > > > > mappings lie in the vmalloc area, this handling must be done very early > > > > > when the trap is taken, at the very beginning of handle_exception: this > > > > > also rules out the vmalloc allocations in the fault path. > > > > > > > > > > Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1] > > > > > Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2] > > > > > Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3] > > > > > Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4] > > > > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> > > > > > --- > > > > > arch/riscv/include/asm/cacheflush.h | 18 +++++- > > > > > arch/riscv/include/asm/thread_info.h | 5 ++ > > > > > arch/riscv/kernel/asm-offsets.c | 5 ++ > > > > > arch/riscv/kernel/entry.S | 84 ++++++++++++++++++++++++++++ > > > > > arch/riscv/mm/init.c | 2 + > > > > > 5 files changed, 113 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h > > > > > index a129dac4521d..b0d631701757 100644 > > > > > --- a/arch/riscv/include/asm/cacheflush.h > > > > > +++ b/arch/riscv/include/asm/cacheflush.h > > > > > @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page *page) > > > > > flush_icache_mm(vma->vm_mm, 0) > > > > > > > > > > #ifdef CONFIG_64BIT > > > > > -#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end) > > > > > +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1]; > > > > > +extern char _end[]; > > > > > +#define flush_cache_vmap flush_cache_vmap > > > > > +static inline void flush_cache_vmap(unsigned long start, unsigned long end) > > > > > +{ > > > > > + if (is_vmalloc_or_module_addr((void *)start)) { > > > > > + int i; > > > > > + > > > > > + /* > > > > > + * We don't care if concurrently a cpu resets this value since > > > > > + * the only place this can happen is in handle_exception() where > > > > > + * an sfence.vma is emitted. > > > > > + */ > > > > > + for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i) > > > > > + new_vmalloc[i] = -1ULL; > > > > > + } > > > > > +} > > > > > #define flush_cache_vmap_early(start, end) local_flush_tlb_kernel_range(start, end) > > > > > #endif > > > > > > > > > > diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h > > > > > index 5d473343634b..32631acdcdd4 100644 > > > > > --- a/arch/riscv/include/asm/thread_info.h > > > > > +++ b/arch/riscv/include/asm/thread_info.h > > > > > @@ -60,6 +60,11 @@ struct thread_info { > > > > > void *scs_base; > > > > > void *scs_sp; > > > > > #endif > > > > > + /* > > > > > + * Used in handle_exception() to save a0, a1 and a2 before knowing if we > > > > > + * can access the kernel stack. > > > > > + */ > > > > > + unsigned long a0, a1, a2; > > > > > }; > > > > > > > > > > #ifdef CONFIG_SHADOW_CALL_STACK > > > > > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c > > > > > index a03129f40c46..939ddc0e3c6e 100644 > > > > > --- a/arch/riscv/kernel/asm-offsets.c > > > > > +++ b/arch/riscv/kernel/asm-offsets.c > > > > > @@ -35,6 +35,8 @@ void asm_offsets(void) > > > > > OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]); > > > > > OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]); > > > > > OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]); > > > > > + > > > > > + OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu); > > > > > OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags); > > > > > OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count); > > > > > OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp); > > > > > @@ -42,6 +44,9 @@ void asm_offsets(void) > > > > > #ifdef CONFIG_SHADOW_CALL_STACK > > > > > OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp); > > > > > #endif > > > > > + OFFSET(TASK_TI_A0, task_struct, thread_info.a0); > > > > > + OFFSET(TASK_TI_A1, task_struct, thread_info.a1); > > > > > + OFFSET(TASK_TI_A2, task_struct, thread_info.a2); > > > > > > > > > > OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu); > > > > > OFFSET(TASK_THREAD_F0, task_struct, thread.fstate.f[0]); > > > > > diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S > > > > > index 9d1a305d5508..c1ffaeaba7aa 100644 > > > > > --- a/arch/riscv/kernel/entry.S > > > > > +++ b/arch/riscv/kernel/entry.S > > > > > @@ -19,6 +19,78 @@ > > > > > > > > > > .section .irqentry.text, "ax" > > > > > > > > > > +.macro new_vmalloc_check > > > > > + REG_S a0, TASK_TI_A0(tp) > > > > > + REG_S a1, TASK_TI_A1(tp) > > > > > + REG_S a2, TASK_TI_A2(tp) > > > > > + > > > > > + csrr a0, CSR_CAUSE > > > > > + /* Exclude IRQs */ > > > > > + blt a0, zero, _new_vmalloc_restore_context > > > > > + /* Only check new_vmalloc if we are in page/protection fault */ > > > > > + li a1, EXC_LOAD_PAGE_FAULT > > > > > + beq a0, a1, _new_vmalloc_kernel_address > > > > > + li a1, EXC_STORE_PAGE_FAULT > > > > > + beq a0, a1, _new_vmalloc_kernel_address > > > > > + li a1, EXC_INST_PAGE_FAULT > > > > > + bne a0, a1, _new_vmalloc_restore_context > > > > > + > > > > > +_new_vmalloc_kernel_address: > > > > > + /* Is it a kernel address? */ > > > > > + csrr a0, CSR_TVAL > > > > > + bge a0, zero, _new_vmalloc_restore_context > > > > > + > > > > > + /* Check if a new vmalloc mapping appeared that could explain the trap */ > > > > > + > > > > > + /* > > > > > + * Computes: > > > > > + * a0 = &new_vmalloc[BIT_WORD(cpu)] > > > > > + * a1 = BIT_MASK(cpu) > > > > > + */ > > > > > + REG_L a2, TASK_TI_CPU(tp) > > > > > + /* > > > > > + * Compute the new_vmalloc element position: > > > > > + * (cpu / 64) * 8 = (cpu >> 6) << 3 > > > > > + */ > > > > > + srli a1, a2, 6 > > > > > + slli a1, a1, 3 > > > > > + la a0, new_vmalloc > > > > > + add a0, a0, a1 > > > > > + /* > > > > > + * Compute the bit position in the new_vmalloc element: > > > > > + * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6 > > > > > + * = cpu - ((cpu >> 6) << 3) << 3 > > > > > + */ > > > > > + slli a1, a1, 3 > > > > > + sub a1, a2, a1 > > > > > + /* Compute the "get mask": 1 << bit_pos */ > > > > > + li a2, 1 > > > > > + sll a1, a2, a1 > > > > > + > > > > > + /* Check the value of new_vmalloc for this cpu */ > > > > > + REG_L a2, 0(a0) > > > > > + and a2, a2, a1 > > > > > + beq a2, zero, _new_vmalloc_restore_context > > > > > + > > > > > + /* Atomically reset the current cpu bit in new_vmalloc */ > > > > > + amoxor.w a0, a1, (a0) > > > > > + > > > > > + /* Only emit a sfence.vma if the uarch caches invalid entries */ > > > > > + ALTERNATIVE("sfence.vma", "nop", 0, RISCV_ISA_EXT_SVVPTC, 1) > > > > > + > > > > > + REG_L a0, TASK_TI_A0(tp) > > > > > + REG_L a1, TASK_TI_A1(tp) > > > > > + REG_L a2, TASK_TI_A2(tp) > > > > > + csrw CSR_SCRATCH, x0 > > > > > + sret > > > > > + > > > > > +_new_vmalloc_restore_context: > > > > > + REG_L a0, TASK_TI_A0(tp) > > > > > + REG_L a1, TASK_TI_A1(tp) > > > > > + REG_L a2, TASK_TI_A2(tp) > > > > > +.endm > > > > > + > > > > > + > > > > > SYM_CODE_START(handle_exception) > > > > > /* > > > > > * If coming from userspace, preserve the user thread pointer and load > > > > > @@ -30,6 +102,18 @@ SYM_CODE_START(handle_exception) > > > > > > > > > > .Lrestore_kernel_tpsp: > > > > > csrr tp, CSR_SCRATCH > > > > > + > > > > > + /* > > > > > + * The RISC-V kernel does not eagerly emit a sfence.vma after each > > > > > + * new vmalloc mapping, which may result in exceptions: > > > > > + * - if the uarch caches invalid entries, the new mapping would not be > > > > > + * observed by the page table walker and an invalidation is needed. > > > > > + * - if the uarch does not cache invalid entries, a reordered access > > > > > + * could "miss" the new mapping and traps: in that case, we only need > > > > > + * to retry the access, no sfence.vma is required. > > > > > + */ > > > > > + new_vmalloc_check > > > > > + > > > > > REG_S sp, TASK_TI_KERNEL_SP(tp) > > > > > > > > > > #ifdef CONFIG_VMAP_STACK > > > > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c > > > > > index eafc4c2200f2..54c9fdeda11e 100644 > > > > > --- a/arch/riscv/mm/init.c > > > > > +++ b/arch/riscv/mm/init.c > > > > > @@ -36,6 +36,8 @@ > > > > > > > > > > #include "../kernel/head.h" > > > > > > > > > > +u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1]; > > > > > + > > > > > struct kernel_mapping kernel_map __ro_after_init; > > > > > EXPORT_SYMBOL(kernel_map); > > > > > #ifdef CONFIG_XIP_KERNEL > > > > > -- > > > > > 2.39.2 > > > > > > > > > > > > > > > > > > Can we consider using new_vmalloc as a percpu variable, so that we > > > > don't need to add a0/1/2 in thread_info? > > > > > > At first, I used percpu variables. But then I realized that percpu > > > areas are allocated in the vmalloc area, so if somehow we take a trap > > > when accessing the new_vmalloc percpu variable, we could not recover > > > from this as we would trap forever in new_vmalloc_check. But > > > admittedly, not sure that can happen. > > > > > > And how would that remove a0, a1 and a2 from thread_info? We'd still > > > need to save some registers somewhere to access the percpu variable > > > right? > > > > > > > Also, try not to do too much > > > > calculation logic in new_vmalloc_check, after all, handle_exception is > > > > a high-frequency path. In this case, can we consider writing > > > > new_vmalloc_check in C language to increase readability? > > > > > > If we write that in C, we don't have the control over the allocated > > > registers and then we can't correctly save the context. > > > > If we use C language, new_vmalloc_check is written just like do_irq(), > > then we need _save_context, but for new_vmalloc_check, it is not worth > > the loss, because exceptions from user mode do not need > > new_vmalloc_check, which also shows that it is reasonable to put > > new_vmalloc_check after _restore_kernel_tpsp. > > > > Saving is necessary. We can save a0, a1, a2 without using thread_info. > > We can choose to save on the kernel stack of the current tp, but we > > need to add the following instructions: > > REG_S sp, TASK_TI_USER_SP(tp) > > REG_L sp, TASK_TI_KERNEL_SP(tp) > > addi sp, sp, -(PT_SIZE_ON_STACK) > > It seems that saving directly on thread_info is more direct, but > > saving on the kernel stack is more logically consistent, and there is > > no need to increase the size of thread_info. > > You can't save on the kernel stack since kernel stacks are allocated > in the vmalloc area. > > > > > As for the current status of the patch, there are two points that can > > be optimized: > > 1. Some chip hardware implementations may not cache TLB invalid > > entries, so it doesn't matter whether svvptc is available or not. Can > > we consider adding a CONFIG_RISCV_SVVPTC to control it? That would produce a non-portable kernel. But I'm not opposed to that at all, let me check how we handle other extensions. Maybe @Conor Dooley has some feedback here? > > > > 2. .macro new_vmalloc_check > > REG_S a0, TASK_TI_A0(tp) > > REG_S a1, TASK_TI_A1(tp) > > REG_S a2, TASK_TI_A2(tp) > > When executing blt a0, zero, _new_vmalloc_restore_context, you can not > > save a1, a2 first > > Ok, I can do that :) > > Thanks again for your inputs, > > Alex > > > > > > > > > Thanks for your interest in this patchset :) > > > > > > Alex > > > > > > > > > > > Thanks, > > > > Yunhui > > > > Thanks, > > Yunhui
On Tue, Jun 04, 2024 at 09:17:26AM +0200, Alexandre Ghiti wrote: > On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote: > > On Tue, Jun 4, 2024 at 8:21 AM yunhui cui <cuiyunhui@bytedance.com> wrote: > > > > > > As for the current status of the patch, there are two points that can > > > be optimized: > > > 1. Some chip hardware implementations may not cache TLB invalid > > > entries, so it doesn't matter whether svvptc is available or not. Can > > > we consider adding a CONFIG_RISCV_SVVPTC to control it? > > That would produce a non-portable kernel. But I'm not opposed to that > at all, let me check how we handle other extensions. Maybe @Conor > Dooley has some feedback here? To be honest, not really sure what to give feedback on. Could you elaborate on exactly what the option is going to do? Given the portability concern, I guess you were proposing that the option would remove the preventative fences, rather than your current patch that removes them via an alternative? I don't think we have any extension related options that work like that at the moment, and making that an option will just mean that distros that look to cater for multiple platforms won't be able to turn it on. Thanks, Conor.
On Tue, Jun 4, 2024 at 10:52 AM Conor Dooley <conor@kernel.org> wrote: > > On Tue, Jun 04, 2024 at 09:17:26AM +0200, Alexandre Ghiti wrote: > > On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote: > > > On Tue, Jun 4, 2024 at 8:21 AM yunhui cui <cuiyunhui@bytedance.com> wrote: > > > > > > > > As for the current status of the patch, there are two points that can > > > > be optimized: > > > > 1. Some chip hardware implementations may not cache TLB invalid > > > > entries, so it doesn't matter whether svvptc is available or not. Can > > > > we consider adding a CONFIG_RISCV_SVVPTC to control it? > > > > That would produce a non-portable kernel. But I'm not opposed to that > > at all, let me check how we handle other extensions. Maybe @Conor > > Dooley has some feedback here? > > To be honest, not really sure what to give feedback on. Could you > elaborate on exactly what the option is going to do? Given the > portability concern, I guess you were proposing that the option would > remove the preventative fences, rather than your current patch that > removes them via an alternative? No no, I won't do that, we need a generic kernel for distros so that's not even a question. What Yunhui was asking about (to me) is: can we introduce a Kconfig option to always remove the preventive fences, bypassing the use of alternatives altogether? To me, it won't make a difference in terms of performance. But if we already offer such a possibility for other extensions, well I'll do it. Otherwise, the question is: should we start doing that? > I don't think we have any extension > related options that work like that at the moment, and making that an > option will just mean that distros that look to cater for multiple > platforms won't be able to turn it on. > > Thanks, > Conor.
On Tue, Jun 04, 2024 at 01:44:15PM +0200, Alexandre Ghiti wrote: > On Tue, Jun 4, 2024 at 10:52 AM Conor Dooley <conor@kernel.org> wrote: > > > > On Tue, Jun 04, 2024 at 09:17:26AM +0200, Alexandre Ghiti wrote: > > > On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote: > > > > On Tue, Jun 4, 2024 at 8:21 AM yunhui cui <cuiyunhui@bytedance.com> wrote: > > > > > > > > > > As for the current status of the patch, there are two points that can > > > > > be optimized: > > > > > 1. Some chip hardware implementations may not cache TLB invalid > > > > > entries, so it doesn't matter whether svvptc is available or not. Can > > > > > we consider adding a CONFIG_RISCV_SVVPTC to control it? > > > > > > That would produce a non-portable kernel. But I'm not opposed to that > > > at all, let me check how we handle other extensions. Maybe @Conor > > > Dooley has some feedback here? > > > > To be honest, not really sure what to give feedback on. Could you > > elaborate on exactly what the option is going to do? Given the > > portability concern, I guess you were proposing that the option would > > remove the preventative fences, rather than your current patch that > > removes them via an alternative? > > No no, I won't do that, we need a generic kernel for distros so that's > not even a question. What Yunhui was asking about (to me) is: can we > introduce a Kconfig option to always remove the preventive fences, > bypassing the use of alternatives altogether? > > To me, it won't make a difference in terms of performance. But if we > already offer such a possibility for other extensions, well I'll do > it. Otherwise, the question is: should we start doing that? We don't do that for other extensions yet, because currently all the extensions we have options for are additive. There's like 3 alternative patchsites, and they are all just one nop? I don't see the point of having a Kconfig knob for that.
diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h index a129dac4521d..b0d631701757 100644 --- a/arch/riscv/include/asm/cacheflush.h +++ b/arch/riscv/include/asm/cacheflush.h @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page *page) flush_icache_mm(vma->vm_mm, 0) #ifdef CONFIG_64BIT -#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end) +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1]; +extern char _end[]; +#define flush_cache_vmap flush_cache_vmap +static inline void flush_cache_vmap(unsigned long start, unsigned long end) +{ + if (is_vmalloc_or_module_addr((void *)start)) { + int i; + + /* + * We don't care if concurrently a cpu resets this value since + * the only place this can happen is in handle_exception() where + * an sfence.vma is emitted. + */ + for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i) + new_vmalloc[i] = -1ULL; + } +} #define flush_cache_vmap_early(start, end) local_flush_tlb_kernel_range(start, end) #endif diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h index 5d473343634b..32631acdcdd4 100644 --- a/arch/riscv/include/asm/thread_info.h +++ b/arch/riscv/include/asm/thread_info.h @@ -60,6 +60,11 @@ struct thread_info { void *scs_base; void *scs_sp; #endif + /* + * Used in handle_exception() to save a0, a1 and a2 before knowing if we + * can access the kernel stack. + */ + unsigned long a0, a1, a2; }; #ifdef CONFIG_SHADOW_CALL_STACK diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c index a03129f40c46..939ddc0e3c6e 100644 --- a/arch/riscv/kernel/asm-offsets.c +++ b/arch/riscv/kernel/asm-offsets.c @@ -35,6 +35,8 @@ void asm_offsets(void) OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]); OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]); OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]); + + OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu); OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags); OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count); OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp); @@ -42,6 +44,9 @@ void asm_offsets(void) #ifdef CONFIG_SHADOW_CALL_STACK OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp); #endif + OFFSET(TASK_TI_A0, task_struct, thread_info.a0); + OFFSET(TASK_TI_A1, task_struct, thread_info.a1); + OFFSET(TASK_TI_A2, task_struct, thread_info.a2); OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu); OFFSET(TASK_THREAD_F0, task_struct, thread.fstate.f[0]); diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S index 9d1a305d5508..c1ffaeaba7aa 100644 --- a/arch/riscv/kernel/entry.S +++ b/arch/riscv/kernel/entry.S @@ -19,6 +19,78 @@ .section .irqentry.text, "ax" +.macro new_vmalloc_check + REG_S a0, TASK_TI_A0(tp) + REG_S a1, TASK_TI_A1(tp) + REG_S a2, TASK_TI_A2(tp) + + csrr a0, CSR_CAUSE + /* Exclude IRQs */ + blt a0, zero, _new_vmalloc_restore_context + /* Only check new_vmalloc if we are in page/protection fault */ + li a1, EXC_LOAD_PAGE_FAULT + beq a0, a1, _new_vmalloc_kernel_address + li a1, EXC_STORE_PAGE_FAULT + beq a0, a1, _new_vmalloc_kernel_address + li a1, EXC_INST_PAGE_FAULT + bne a0, a1, _new_vmalloc_restore_context + +_new_vmalloc_kernel_address: + /* Is it a kernel address? */ + csrr a0, CSR_TVAL + bge a0, zero, _new_vmalloc_restore_context + + /* Check if a new vmalloc mapping appeared that could explain the trap */ + + /* + * Computes: + * a0 = &new_vmalloc[BIT_WORD(cpu)] + * a1 = BIT_MASK(cpu) + */ + REG_L a2, TASK_TI_CPU(tp) + /* + * Compute the new_vmalloc element position: + * (cpu / 64) * 8 = (cpu >> 6) << 3 + */ + srli a1, a2, 6 + slli a1, a1, 3 + la a0, new_vmalloc + add a0, a0, a1 + /* + * Compute the bit position in the new_vmalloc element: + * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6 + * = cpu - ((cpu >> 6) << 3) << 3 + */ + slli a1, a1, 3 + sub a1, a2, a1 + /* Compute the "get mask": 1 << bit_pos */ + li a2, 1 + sll a1, a2, a1 + + /* Check the value of new_vmalloc for this cpu */ + REG_L a2, 0(a0) + and a2, a2, a1 + beq a2, zero, _new_vmalloc_restore_context + + /* Atomically reset the current cpu bit in new_vmalloc */ + amoxor.w a0, a1, (a0) + + /* Only emit a sfence.vma if the uarch caches invalid entries */ + ALTERNATIVE("sfence.vma", "nop", 0, RISCV_ISA_EXT_SVVPTC, 1) + + REG_L a0, TASK_TI_A0(tp) + REG_L a1, TASK_TI_A1(tp) + REG_L a2, TASK_TI_A2(tp) + csrw CSR_SCRATCH, x0 + sret + +_new_vmalloc_restore_context: + REG_L a0, TASK_TI_A0(tp) + REG_L a1, TASK_TI_A1(tp) + REG_L a2, TASK_TI_A2(tp) +.endm + + SYM_CODE_START(handle_exception) /* * If coming from userspace, preserve the user thread pointer and load @@ -30,6 +102,18 @@ SYM_CODE_START(handle_exception) .Lrestore_kernel_tpsp: csrr tp, CSR_SCRATCH + + /* + * The RISC-V kernel does not eagerly emit a sfence.vma after each + * new vmalloc mapping, which may result in exceptions: + * - if the uarch caches invalid entries, the new mapping would not be + * observed by the page table walker and an invalidation is needed. + * - if the uarch does not cache invalid entries, a reordered access + * could "miss" the new mapping and traps: in that case, we only need + * to retry the access, no sfence.vma is required. + */ + new_vmalloc_check + REG_S sp, TASK_TI_KERNEL_SP(tp) #ifdef CONFIG_VMAP_STACK diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c index eafc4c2200f2..54c9fdeda11e 100644 --- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -36,6 +36,8 @@ #include "../kernel/head.h" +u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1]; + struct kernel_mapping kernel_map __ro_after_init; EXPORT_SYMBOL(kernel_map); #ifdef CONFIG_XIP_KERNEL
In 6.5, we removed the vmalloc fault path because that can't work (see [1] [2]). Then in order to make sure that new page table entries were seen by the page table walker, we had to preventively emit a sfence.vma on all harts [3] but this solution is very costly since it relies on IPI. And even there, we could end up in a loop of vmalloc faults if a vmalloc allocation is done in the IPI path (for example if it is traced, see [4]), which could result in a kernel stack overflow. Those preventive sfence.vma needed to be emitted because: - if the uarch caches invalid entries, the new mapping may not be observed by the page table walker and an invalidation may be needed. - if the uarch does not cache invalid entries, a reordered access could "miss" the new mapping and traps: in that case, we would actually only need to retry the access, no sfence.vma is required. So this patch removes those preventive sfence.vma and actually handles the possible (and unlikely) exceptions. And since the kernel stacks mappings lie in the vmalloc area, this handling must be done very early when the trap is taken, at the very beginning of handle_exception: this also rules out the vmalloc allocations in the fault path. Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1] Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2] Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3] Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4] Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> --- arch/riscv/include/asm/cacheflush.h | 18 +++++- arch/riscv/include/asm/thread_info.h | 5 ++ arch/riscv/kernel/asm-offsets.c | 5 ++ arch/riscv/kernel/entry.S | 84 ++++++++++++++++++++++++++++ arch/riscv/mm/init.c | 2 + 5 files changed, 113 insertions(+), 1 deletion(-)