Message ID | 20231030133027.19542-1-alexghiti@rivosinc.com (mailing list archive) |
---|---|
Headers | show |
Series | riscv: tlb flush improvements | expand |
> On Oct 30, 2023, at 3:30 PM, Alexandre Ghiti <alexghiti@rivosinc.com> wrote: > > + on_each_cpu_mask(cmask, > + __ipi_flush_tlb_range_asid, > + &ftd, 1); > Unrelated, but having fed on the stack might cause it to be unaligned to the cacheline, which in x86 we have seen introduces some overhead. Actually, it is best not to put it on the stack, if possible to reduce cache traffic.
Hello: This series was applied to riscv/linux.git (for-next) by Palmer Dabbelt <palmer@rivosinc.com>: On Mon, 30 Oct 2023 14:30:24 +0100 you wrote: > This series optimizes the tlb flushes on riscv which used to simply > flush the whole tlb whatever the size of the range to flush or the size > of the stride. > > Patch 3 introduces a threshold that is microarchitecture specific and > will very likely be modified by vendors, not sure though which mechanism > we'll use to do that (dt? alternatives? vendor initialization code?). > > [...] Here is the summary with links: - [v6,1/4] riscv: Improve tlb_flush() https://git.kernel.org/riscv/c/c5e9b2c2ae82 - [v6,2/4] riscv: Improve flush_tlb_range() for hugetlb pages https://git.kernel.org/riscv/c/c962a6e74639 - [v6,3/4] riscv: Make __flush_tlb_range() loop over pte instead of flushing the whole tlb https://git.kernel.org/riscv/c/9d4e8d5fa7db - [v6,4/4] riscv: Improve flush_tlb_kernel_range() https://git.kernel.org/riscv/c/5e22bfd520ea You are awesome, thank you!
On Mon, 30 Oct 2023 07:01:48 PDT (-0700), nadav.amit@gmail.com wrote: > >> On Oct 30, 2023, at 3:30 PM, Alexandre Ghiti <alexghiti@rivosinc.com> wrote: >> >> + on_each_cpu_mask(cmask, >> + __ipi_flush_tlb_range_asid, >> + &ftd, 1); >> > > Unrelated, but having fed Do you mean `ftd`? If so I'm not all that convinced that's a problem: sure it's 4x`long`, so we pass it on the stack instead of registers, but otherwise we'd need another `on_each_cpu_mask()` callback to shim stuff through via registers. > on the stack might cause it to be unaligned to > the cacheline, which in x86 we have seen introduces some overhead. We have 128-bit stack alignment on RISC-V, so the elements are at least aligned. Since they're just being loaded up as scalars for the next function call I'm not sure the alignment is all that exciting here. > Actually, it is best not to put it on the stack, if possible to reduce > cache traffic. Sorry if I'm just missing something, but I'm not convinced this is a measurable performance problem.
> On Nov 7, 2023, at 9:00 AM, Palmer Dabbelt <palmer@dabbelt.com> wrote: > > On Mon, 30 Oct 2023 07:01:48 PDT (-0700), nadav.amit@gmail.com wrote: >> >>> On Oct 30, 2023, at 3:30 PM, Alexandre Ghiti <alexghiti@rivosinc.com> wrote: >>> + on_each_cpu_mask(cmask, >>> + __ipi_flush_tlb_range_asid, >>> + &ftd, 1); >> >> Unrelated, but having fed > > Do you mean `ftd`? > > If so I'm not all that convinced that's a problem: sure it's 4x`long`, so we pass it on the stack instead of registers, but otherwise we'd need another `on_each_cpu_mask()` callback to shim stuff through via registers. I have no idea why you need to move stuff through the registers. >> Actually, it is best not to put it on the stack, if possible to reduce >> cache traffic. > > Sorry if I'm just missing something, but I'm not convinced this is a measurable performance problem. I am not going to try to convince you (I ran the numbers on x86 a long time ago). There is a cost of bouncing cache-lines (because multiple cores access the stack), TLB-miss on remote cores (which is mostly avoidable if ftd is global). Having said that, the optimizations you added now and intend to add in the next steps are definitely more important for performance.