Message ID | 20200329141258.31172-1-ardb@kernel.org (mailing list archive) |
---|---|
State | Mainlined |
Commit | e16e65a02913d29a7b27c4e3a415ceec967b0629 |
Headers | show |
Series | [RFC] arm64: remove CONFIG_DEBUG_ALIGN_RODATA feature | expand |
On Sun, Mar 29, 2020 at 04:12:58PM +0200, Ard Biesheuvel wrote: > When CONFIG_DEBUG_ALIGN_RODATA is enabled, kernel segments mapped with > different permissions (r-x for .text, r-- for .rodata, rw- for .data, > etc) are rounded up to 2 MiB so they can be mapped more efficiently. > In particular, it permits the segments to be mapped using level 2 > block entries when using 4k pages, which is expected to result in less > TLB pressure. > > However, the mappings for the bulk of the kernel will use level 2 > entries anyway, and the misaligned fringes are organized such that they > can take advantage of the contiguous bit, and use far fewer level 3 > entries than would be needed otherwise. > > This makes the value of this feature dubious at best, and since it is not > enabled in defconfig or in the distro configs, it does not appear to be > in wide use either. So let's just remove it. > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> No strong feelings either way, but getting rid of code is usually good, so: Acked-by: Mark Rutland <mark.rutland@arm.com> Mark. > --- > arch/arm64/Kconfig.debug | 13 ------------- > arch/arm64/include/asm/memory.h | 12 +----------- > drivers/firmware/efi/libstub/arm64-stub.c | 8 +++----- > 3 files changed, 4 insertions(+), 29 deletions(-) > > diff --git a/arch/arm64/Kconfig.debug b/arch/arm64/Kconfig.debug > index 1c906d932d6b..a1efa246c9ed 100644 > --- a/arch/arm64/Kconfig.debug > +++ b/arch/arm64/Kconfig.debug > @@ -52,19 +52,6 @@ config DEBUG_WX > > If in doubt, say "Y". > > -config DEBUG_ALIGN_RODATA > - depends on STRICT_KERNEL_RWX > - bool "Align linker sections up to SECTION_SIZE" > - help > - If this option is enabled, sections that may potentially be marked as > - read only or non-executable will be aligned up to the section size of > - the kernel. This prevents sections from being split into pages and > - avoids a potential TLB penalty. The downside is an increase in > - alignment and potentially wasted space. Turn on this option if > - performance is more important than memory pressure. > - > - If in doubt, say N. > - > config DEBUG_EFI > depends on EFI && DEBUG_INFO > bool "UEFI debugging" > diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h > index 4d94676e5a8b..3b34f7bde2f2 100644 > --- a/arch/arm64/include/asm/memory.h > +++ b/arch/arm64/include/asm/memory.h > @@ -119,22 +119,12 @@ > > /* > * Alignment of kernel segments (e.g. .text, .data). > - */ > -#if defined(CONFIG_DEBUG_ALIGN_RODATA) > -/* > - * 4 KB granule: 1 level 2 entry > - * 16 KB granule: 128 level 3 entries, with contiguous bit > - * 64 KB granule: 32 level 3 entries, with contiguous bit > - */ > -#define SEGMENT_ALIGN SZ_2M > -#else > -/* > + * > * 4 KB granule: 16 level 3 entries, with contiguous bit > * 16 KB granule: 4 level 3 entries, without contiguous bit > * 64 KB granule: 1 level 3 entry > */ > #define SEGMENT_ALIGN SZ_64K > -#endif > > /* > * Memory types available. > diff --git a/drivers/firmware/efi/libstub/arm64-stub.c b/drivers/firmware/efi/libstub/arm64-stub.c > index db0c1a9c1699..fc9f8ab533a7 100644 > --- a/drivers/firmware/efi/libstub/arm64-stub.c > +++ b/drivers/firmware/efi/libstub/arm64-stub.c > @@ -75,14 +75,12 @@ efi_status_t handle_kernel_image(unsigned long *image_addr, > > if (IS_ENABLED(CONFIG_RANDOMIZE_BASE) && phys_seed != 0) { > /* > - * If CONFIG_DEBUG_ALIGN_RODATA is not set, produce a > - * displacement in the interval [0, MIN_KIMG_ALIGN) that > - * doesn't violate this kernel's de-facto alignment > + * Produce a displacement in the interval [0, MIN_KIMG_ALIGN) > + * that doesn't violate this kernel's de-facto alignment > * constraints. > */ > u32 mask = (MIN_KIMG_ALIGN - 1) & ~(EFI_KIMG_ALIGN - 1); > - u32 offset = !IS_ENABLED(CONFIG_DEBUG_ALIGN_RODATA) ? > - (phys_seed >> 32) & mask : TEXT_OFFSET; > + u32 offset = (phys_seed >> 32) & mask; > > /* > * With CONFIG_RANDOMIZE_TEXT_OFFSET=y, TEXT_OFFSET may not > -- > 2.17.1 >
On Mon, 30 Mar 2020 at 13:29, Mark Rutland <mark.rutland@arm.com> wrote: > > On Sun, Mar 29, 2020 at 04:12:58PM +0200, Ard Biesheuvel wrote: > > When CONFIG_DEBUG_ALIGN_RODATA is enabled, kernel segments mapped with > > different permissions (r-x for .text, r-- for .rodata, rw- for .data, > > etc) are rounded up to 2 MiB so they can be mapped more efficiently. > > In particular, it permits the segments to be mapped using level 2 > > block entries when using 4k pages, which is expected to result in less > > TLB pressure. > > > > However, the mappings for the bulk of the kernel will use level 2 > > entries anyway, and the misaligned fringes are organized such that they > > can take advantage of the contiguous bit, and use far fewer level 3 > > entries than would be needed otherwise. > > > > This makes the value of this feature dubious at best, and since it is not > > enabled in defconfig or in the distro configs, it does not appear to be > > in wide use either. So let's just remove it. > > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > > No strong feelings either way, but getting rid of code is usually good, > so: > > Acked-by: Mark Rutland <mark.rutland@arm.com> > Thanks Mark. This is related to [0], which increases the PE/COFF section alignment to 64k so that a KASLR enabled kernel always lands at an address at which it can execute without being moved around first. This is an improvement in itself, but also provides 5 bits (log2(2M / 64k)) of wiggle room for the virtual as well as the physical placement of the kernel. CONFIG_DEBUG_ALIGN_RODATA kind of interferes with that, so I'd like to get rid of it. Catalin, Will: if you have no objections, I can include this in my series for v5.8 and take it via the EFI tree. Thanks, Ard. [0] https://lore.kernel.org/linux-efi/20200326165905.2240-1-ardb@kernel.org/
On Sun, Mar 29, 2020 at 04:12:58PM +0200, Ard Biesheuvel wrote: > When CONFIG_DEBUG_ALIGN_RODATA is enabled, kernel segments mapped with > different permissions (r-x for .text, r-- for .rodata, rw- for .data, > etc) are rounded up to 2 MiB so they can be mapped more efficiently. > In particular, it permits the segments to be mapped using level 2 > block entries when using 4k pages, which is expected to result in less > TLB pressure. > > However, the mappings for the bulk of the kernel will use level 2 > entries anyway, and the misaligned fringes are organized such that they > can take advantage of the contiguous bit, and use far fewer level 3 > entries than would be needed otherwise. > > This makes the value of this feature dubious at best, and since it is not > enabled in defconfig or in the distro configs, it does not appear to be > in wide use either. So let's just remove it. > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > --- > arch/arm64/Kconfig.debug | 13 ------------- > arch/arm64/include/asm/memory.h | 12 +----------- > drivers/firmware/efi/libstub/arm64-stub.c | 8 +++----- > 3 files changed, 4 insertions(+), 29 deletions(-) Acked-by: Will Deacon <will@kernel.org> But I would really like to go a step further and rip out the block mapping support altogether so that we can fix non-coherent DMA aliases: https://lore.kernel.org/lkml/20200224194446.690816-1-hch@lst.de Will
On Mon, 30 Mar 2020 at 15:51, Will Deacon <will@kernel.org> wrote: > > On Sun, Mar 29, 2020 at 04:12:58PM +0200, Ard Biesheuvel wrote: > > When CONFIG_DEBUG_ALIGN_RODATA is enabled, kernel segments mapped with > > different permissions (r-x for .text, r-- for .rodata, rw- for .data, > > etc) are rounded up to 2 MiB so they can be mapped more efficiently. > > In particular, it permits the segments to be mapped using level 2 > > block entries when using 4k pages, which is expected to result in less > > TLB pressure. > > > > However, the mappings for the bulk of the kernel will use level 2 > > entries anyway, and the misaligned fringes are organized such that they > > can take advantage of the contiguous bit, and use far fewer level 3 > > entries than would be needed otherwise. > > > > This makes the value of this feature dubious at best, and since it is not > > enabled in defconfig or in the distro configs, it does not appear to be > > in wide use either. So let's just remove it. > > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > > --- > > arch/arm64/Kconfig.debug | 13 ------------- > > arch/arm64/include/asm/memory.h | 12 +----------- > > drivers/firmware/efi/libstub/arm64-stub.c | 8 +++----- > > 3 files changed, 4 insertions(+), 29 deletions(-) > > Acked-by: Will Deacon <will@kernel.org> > > But I would really like to go a step further and rip out the block mapping > support altogether so that we can fix non-coherent DMA aliases: > > https://lore.kernel.org/lkml/20200224194446.690816-1-hch@lst.de > I'm not sure I follow - is this about mapping parts of the static kernel Image for non-coherent DMA?
On 3/29/20 10:12 AM, Ard Biesheuvel wrote: > When CONFIG_DEBUG_ALIGN_RODATA is enabled, kernel segments mapped with > different permissions (r-x for .text, r-- for .rodata, rw- for .data, > etc) are rounded up to 2 MiB so they can be mapped more efficiently. > In particular, it permits the segments to be mapped using level 2 > block entries when using 4k pages, which is expected to result in less > TLB pressure. > > However, the mappings for the bulk of the kernel will use level 2 > entries anyway, and the misaligned fringes are organized such that they > can take advantage of the contiguous bit, and use far fewer level 3 > entries than would be needed otherwise. > > This makes the value of this feature dubious at best, and since it is not > enabled in defconfig or in the distro configs, it does not appear to be > in wide use either. So let's just remove it. > This was supposed to avoid potential performance inconsistencies with some memory being in pages vs other in block. It was always a minor concern so if it's preventing other work: Acked-by: Laura Abbott <labbott@kernel.org> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > --- > arch/arm64/Kconfig.debug | 13 ------------- > arch/arm64/include/asm/memory.h | 12 +----------- > drivers/firmware/efi/libstub/arm64-stub.c | 8 +++----- > 3 files changed, 4 insertions(+), 29 deletions(-) > > diff --git a/arch/arm64/Kconfig.debug b/arch/arm64/Kconfig.debug > index 1c906d932d6b..a1efa246c9ed 100644 > --- a/arch/arm64/Kconfig.debug > +++ b/arch/arm64/Kconfig.debug > @@ -52,19 +52,6 @@ config DEBUG_WX > > If in doubt, say "Y". > > -config DEBUG_ALIGN_RODATA > - depends on STRICT_KERNEL_RWX > - bool "Align linker sections up to SECTION_SIZE" > - help > - If this option is enabled, sections that may potentially be marked as > - read only or non-executable will be aligned up to the section size of > - the kernel. This prevents sections from being split into pages and > - avoids a potential TLB penalty. The downside is an increase in > - alignment and potentially wasted space. Turn on this option if > - performance is more important than memory pressure. > - > - If in doubt, say N. > - > config DEBUG_EFI > depends on EFI && DEBUG_INFO > bool "UEFI debugging" > diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h > index 4d94676e5a8b..3b34f7bde2f2 100644 > --- a/arch/arm64/include/asm/memory.h > +++ b/arch/arm64/include/asm/memory.h > @@ -119,22 +119,12 @@ > > /* > * Alignment of kernel segments (e.g. .text, .data). > - */ > -#if defined(CONFIG_DEBUG_ALIGN_RODATA) > -/* > - * 4 KB granule: 1 level 2 entry > - * 16 KB granule: 128 level 3 entries, with contiguous bit > - * 64 KB granule: 32 level 3 entries, with contiguous bit > - */ > -#define SEGMENT_ALIGN SZ_2M > -#else > -/* > + * > * 4 KB granule: 16 level 3 entries, with contiguous bit > * 16 KB granule: 4 level 3 entries, without contiguous bit > * 64 KB granule: 1 level 3 entry > */ > #define SEGMENT_ALIGN SZ_64K > -#endif > > /* > * Memory types available. > diff --git a/drivers/firmware/efi/libstub/arm64-stub.c b/drivers/firmware/efi/libstub/arm64-stub.c > index db0c1a9c1699..fc9f8ab533a7 100644 > --- a/drivers/firmware/efi/libstub/arm64-stub.c > +++ b/drivers/firmware/efi/libstub/arm64-stub.c > @@ -75,14 +75,12 @@ efi_status_t handle_kernel_image(unsigned long *image_addr, > > if (IS_ENABLED(CONFIG_RANDOMIZE_BASE) && phys_seed != 0) { > /* > - * If CONFIG_DEBUG_ALIGN_RODATA is not set, produce a > - * displacement in the interval [0, MIN_KIMG_ALIGN) that > - * doesn't violate this kernel's de-facto alignment > + * Produce a displacement in the interval [0, MIN_KIMG_ALIGN) > + * that doesn't violate this kernel's de-facto alignment > * constraints. > */ > u32 mask = (MIN_KIMG_ALIGN - 1) & ~(EFI_KIMG_ALIGN - 1); > - u32 offset = !IS_ENABLED(CONFIG_DEBUG_ALIGN_RODATA) ? > - (phys_seed >> 32) & mask : TEXT_OFFSET; > + u32 offset = (phys_seed >> 32) & mask; > > /* > * With CONFIG_RANDOMIZE_TEXT_OFFSET=y, TEXT_OFFSET may not >
On 2020-03-30 2:53 pm, Ard Biesheuvel wrote: > On Mon, 30 Mar 2020 at 15:51, Will Deacon <will@kernel.org> wrote: >> >> On Sun, Mar 29, 2020 at 04:12:58PM +0200, Ard Biesheuvel wrote: >>> When CONFIG_DEBUG_ALIGN_RODATA is enabled, kernel segments mapped with >>> different permissions (r-x for .text, r-- for .rodata, rw- for .data, >>> etc) are rounded up to 2 MiB so they can be mapped more efficiently. >>> In particular, it permits the segments to be mapped using level 2 >>> block entries when using 4k pages, which is expected to result in less >>> TLB pressure. >>> >>> However, the mappings for the bulk of the kernel will use level 2 >>> entries anyway, and the misaligned fringes are organized such that they >>> can take advantage of the contiguous bit, and use far fewer level 3 >>> entries than would be needed otherwise. >>> >>> This makes the value of this feature dubious at best, and since it is not >>> enabled in defconfig or in the distro configs, it does not appear to be >>> in wide use either. So let's just remove it. >>> >>> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> >>> --- >>> arch/arm64/Kconfig.debug | 13 ------------- >>> arch/arm64/include/asm/memory.h | 12 +----------- >>> drivers/firmware/efi/libstub/arm64-stub.c | 8 +++----- >>> 3 files changed, 4 insertions(+), 29 deletions(-) >> >> Acked-by: Will Deacon <will@kernel.org> >> >> But I would really like to go a step further and rip out the block mapping >> support altogether so that we can fix non-coherent DMA aliases: >> >> https://lore.kernel.org/lkml/20200224194446.690816-1-hch@lst.de >> > > I'm not sure I follow - is this about mapping parts of the static > kernel Image for non-coherent DMA? Yikes, I hope not! The concern there is about block entries in the linear map; I'd assume kernel text/data means not-linear-map, and is thus a different kettle of fish anyway. Robin.
On Mon, Mar 30, 2020 at 03:53:04PM +0200, Ard Biesheuvel wrote: > On Mon, 30 Mar 2020 at 15:51, Will Deacon <will@kernel.org> wrote: > > > > On Sun, Mar 29, 2020 at 04:12:58PM +0200, Ard Biesheuvel wrote: > > > When CONFIG_DEBUG_ALIGN_RODATA is enabled, kernel segments mapped with > > > different permissions (r-x for .text, r-- for .rodata, rw- for .data, > > > etc) are rounded up to 2 MiB so they can be mapped more efficiently. > > > In particular, it permits the segments to be mapped using level 2 > > > block entries when using 4k pages, which is expected to result in less > > > TLB pressure. > > > > > > However, the mappings for the bulk of the kernel will use level 2 > > > entries anyway, and the misaligned fringes are organized such that they > > > can take advantage of the contiguous bit, and use far fewer level 3 > > > entries than would be needed otherwise. > > > > > > This makes the value of this feature dubious at best, and since it is not > > > enabled in defconfig or in the distro configs, it does not appear to be > > > in wide use either. So let's just remove it. > > > > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > > > --- > > > arch/arm64/Kconfig.debug | 13 ------------- > > > arch/arm64/include/asm/memory.h | 12 +----------- > > > drivers/firmware/efi/libstub/arm64-stub.c | 8 +++----- > > > 3 files changed, 4 insertions(+), 29 deletions(-) > > > > Acked-by: Will Deacon <will@kernel.org> > > > > But I would really like to go a step further and rip out the block mapping > > support altogether so that we can fix non-coherent DMA aliases: > > > > https://lore.kernel.org/lkml/20200224194446.690816-1-hch@lst.de > > > > I'm not sure I follow - is this about mapping parts of the static > kernel Image for non-coherent DMA? Sorry, it's not directly related to your patch, just that if we're removing options relating to kernel mappings then I'd be quite keen on effectively forcing page-granularity on the linear map, as is currently done by default thanks to RODATA_FULL_DEFAULT_ENABLED, so that we can nobble cacheable aliases for non-coherent streaming DMA mappings by hooking into Christoph's series above. This series just reminded me of it because it's another "off-by-default-behaviour-for-block-mappings-probably-because-of-performance- but-never-actually-measured" type of thing which really just gets in the way. Will
On Mon, 30 Mar 2020 at 16:04, Will Deacon <will@kernel.org> wrote: > > On Mon, Mar 30, 2020 at 03:53:04PM +0200, Ard Biesheuvel wrote: > > On Mon, 30 Mar 2020 at 15:51, Will Deacon <will@kernel.org> wrote: > > > > > > On Sun, Mar 29, 2020 at 04:12:58PM +0200, Ard Biesheuvel wrote: > > > > When CONFIG_DEBUG_ALIGN_RODATA is enabled, kernel segments mapped with > > > > different permissions (r-x for .text, r-- for .rodata, rw- for .data, > > > > etc) are rounded up to 2 MiB so they can be mapped more efficiently. > > > > In particular, it permits the segments to be mapped using level 2 > > > > block entries when using 4k pages, which is expected to result in less > > > > TLB pressure. > > > > > > > > However, the mappings for the bulk of the kernel will use level 2 > > > > entries anyway, and the misaligned fringes are organized such that they > > > > can take advantage of the contiguous bit, and use far fewer level 3 > > > > entries than would be needed otherwise. > > > > > > > > This makes the value of this feature dubious at best, and since it is not > > > > enabled in defconfig or in the distro configs, it does not appear to be > > > > in wide use either. So let's just remove it. > > > > > > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > > > > --- > > > > arch/arm64/Kconfig.debug | 13 ------------- > > > > arch/arm64/include/asm/memory.h | 12 +----------- > > > > drivers/firmware/efi/libstub/arm64-stub.c | 8 +++----- > > > > 3 files changed, 4 insertions(+), 29 deletions(-) > > > > > > Acked-by: Will Deacon <will@kernel.org> > > > > > > But I would really like to go a step further and rip out the block mapping > > > support altogether so that we can fix non-coherent DMA aliases: > > > > > > https://lore.kernel.org/lkml/20200224194446.690816-1-hch@lst.de > > > > > > > I'm not sure I follow - is this about mapping parts of the static > > kernel Image for non-coherent DMA? > > Sorry, it's not directly related to your patch, just that if we're removing > options relating to kernel mappings then I'd be quite keen on effectively > forcing page-granularity on the linear map, as is currently done by default > thanks to RODATA_FULL_DEFAULT_ENABLED, so that we can nobble cacheable > aliases for non-coherent streaming DMA mappings by hooking into Christoph's > series above. > Right. I don't remember seeing any complaints about RODATA_FULL_DEFAULT_ENABLED, but maybe it's too early for that. > This series just reminded me of it because it's another > "off-by-default-behaviour-for-block-mappings-probably-because-of-performance- > but-never-actually-measured" type of thing which really just gets in the > way. > Well, even though I agree that the lack of actual numbers is a bit disturbing here, I'd hate to penalize all systems even more than they already are (due to ARCH_KMALLOC_MINALIGN == ARCH_DMA_MINALIGN) by adding another workaround that is only needed on devices that have non-coherent masters.
On Mon, Mar 30, 2020 at 04:22:24PM +0200, Ard Biesheuvel wrote: > On Mon, 30 Mar 2020 at 16:04, Will Deacon <will@kernel.org> wrote: > > > > On Mon, Mar 30, 2020 at 03:53:04PM +0200, Ard Biesheuvel wrote: > > > On Mon, 30 Mar 2020 at 15:51, Will Deacon <will@kernel.org> wrote: > > > > > > > > On Sun, Mar 29, 2020 at 04:12:58PM +0200, Ard Biesheuvel wrote: > > > > > When CONFIG_DEBUG_ALIGN_RODATA is enabled, kernel segments mapped with > > > > > different permissions (r-x for .text, r-- for .rodata, rw- for .data, > > > > > etc) are rounded up to 2 MiB so they can be mapped more efficiently. > > > > > In particular, it permits the segments to be mapped using level 2 > > > > > block entries when using 4k pages, which is expected to result in less > > > > > TLB pressure. > > > > > > > > > > However, the mappings for the bulk of the kernel will use level 2 > > > > > entries anyway, and the misaligned fringes are organized such that they > > > > > can take advantage of the contiguous bit, and use far fewer level 3 > > > > > entries than would be needed otherwise. > > > > > > > > > > This makes the value of this feature dubious at best, and since it is not > > > > > enabled in defconfig or in the distro configs, it does not appear to be > > > > > in wide use either. So let's just remove it. > > > > > > > > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > > > > > --- > > > > > arch/arm64/Kconfig.debug | 13 ------------- > > > > > arch/arm64/include/asm/memory.h | 12 +----------- > > > > > drivers/firmware/efi/libstub/arm64-stub.c | 8 +++----- > > > > > 3 files changed, 4 insertions(+), 29 deletions(-) > > > > > > > > Acked-by: Will Deacon <will@kernel.org> > > > > > > > > But I would really like to go a step further and rip out the block mapping > > > > support altogether so that we can fix non-coherent DMA aliases: > > > > > > > > https://lore.kernel.org/lkml/20200224194446.690816-1-hch@lst.de > > > > > > > > > > I'm not sure I follow - is this about mapping parts of the static > > > kernel Image for non-coherent DMA? > > > > Sorry, it's not directly related to your patch, just that if we're removing > > options relating to kernel mappings then I'd be quite keen on effectively > > forcing page-granularity on the linear map, as is currently done by default > > thanks to RODATA_FULL_DEFAULT_ENABLED, so that we can nobble cacheable > > aliases for non-coherent streaming DMA mappings by hooking into Christoph's > > series above. > > > > Right. I don't remember seeing any complaints about > RODATA_FULL_DEFAULT_ENABLED, but maybe it's too early for that. > > > This series just reminded me of it because it's another > > "off-by-default-behaviour-for-block-mappings-probably-because-of-performance- > > but-never-actually-measured" type of thing which really just gets in the > > way. > > > > Well, even though I agree that the lack of actual numbers is a bit > disturbing here, I'd hate to penalize all systems even more than they > already are (due to ARCH_KMALLOC_MINALIGN == ARCH_DMA_MINALIGN) by > adding another workaround that is only needed on devices that have > non-coherent masters. Fair enough, but I'd still like to see some numbers. If they're compelling, then we could explore something like CONFIG_OF_DMA_DEFAULT_COHERENT, but that doesn't really help the kconfig maze :( Will
On Mon, 30 Mar 2020 at 16:28, Will Deacon <will@kernel.org> wrote: > > On Mon, Mar 30, 2020 at 04:22:24PM +0200, Ard Biesheuvel wrote: > > On Mon, 30 Mar 2020 at 16:04, Will Deacon <will@kernel.org> wrote: > > > > > > On Mon, Mar 30, 2020 at 03:53:04PM +0200, Ard Biesheuvel wrote: > > > > On Mon, 30 Mar 2020 at 15:51, Will Deacon <will@kernel.org> wrote: > > > > > > > > > > On Sun, Mar 29, 2020 at 04:12:58PM +0200, Ard Biesheuvel wrote: > > > > > > When CONFIG_DEBUG_ALIGN_RODATA is enabled, kernel segments mapped with > > > > > > different permissions (r-x for .text, r-- for .rodata, rw- for .data, > > > > > > etc) are rounded up to 2 MiB so they can be mapped more efficiently. > > > > > > In particular, it permits the segments to be mapped using level 2 > > > > > > block entries when using 4k pages, which is expected to result in less > > > > > > TLB pressure. > > > > > > > > > > > > However, the mappings for the bulk of the kernel will use level 2 > > > > > > entries anyway, and the misaligned fringes are organized such that they > > > > > > can take advantage of the contiguous bit, and use far fewer level 3 > > > > > > entries than would be needed otherwise. > > > > > > > > > > > > This makes the value of this feature dubious at best, and since it is not > > > > > > enabled in defconfig or in the distro configs, it does not appear to be > > > > > > in wide use either. So let's just remove it. > > > > > > > > > > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > > > > > > --- > > > > > > arch/arm64/Kconfig.debug | 13 ------------- > > > > > > arch/arm64/include/asm/memory.h | 12 +----------- > > > > > > drivers/firmware/efi/libstub/arm64-stub.c | 8 +++----- > > > > > > 3 files changed, 4 insertions(+), 29 deletions(-) > > > > > > > > > > Acked-by: Will Deacon <will@kernel.org> > > > > > > > > > > But I would really like to go a step further and rip out the block mapping > > > > > support altogether so that we can fix non-coherent DMA aliases: > > > > > > > > > > https://lore.kernel.org/lkml/20200224194446.690816-1-hch@lst.de > > > > > > > > > > > > > I'm not sure I follow - is this about mapping parts of the static > > > > kernel Image for non-coherent DMA? > > > > > > Sorry, it's not directly related to your patch, just that if we're removing > > > options relating to kernel mappings then I'd be quite keen on effectively > > > forcing page-granularity on the linear map, as is currently done by default > > > thanks to RODATA_FULL_DEFAULT_ENABLED, so that we can nobble cacheable > > > aliases for non-coherent streaming DMA mappings by hooking into Christoph's > > > series above. > > > > > > > Right. I don't remember seeing any complaints about > > RODATA_FULL_DEFAULT_ENABLED, but maybe it's too early for that. > > > > > This series just reminded me of it because it's another > > > "off-by-default-behaviour-for-block-mappings-probably-because-of-performance- > > > but-never-actually-measured" type of thing which really just gets in the > > > way. > > > > > > > Well, even though I agree that the lack of actual numbers is a bit > > disturbing here, I'd hate to penalize all systems even more than they > > already are (due to ARCH_KMALLOC_MINALIGN == ARCH_DMA_MINALIGN) by > > adding another workaround that is only needed on devices that have > > non-coherent masters. > > Fair enough, but I'd still like to see some numbers. If they're compelling, > then we could explore something like CONFIG_OF_DMA_DEFAULT_COHERENT, but > that doesn't really help the kconfig maze :( > Could we make this a runtime thing? E.g., remap the entire linear region down to pages under stop_machine() the first time we probe a device that uses non-coherent DMA? (/me ducks)
On Sun, Mar 29, 2020 at 04:12:58PM +0200, Ard Biesheuvel wrote: > When CONFIG_DEBUG_ALIGN_RODATA is enabled, kernel segments mapped with > different permissions (r-x for .text, r-- for .rodata, rw- for .data, > etc) are rounded up to 2 MiB so they can be mapped more efficiently. > In particular, it permits the segments to be mapped using level 2 > block entries when using 4k pages, which is expected to result in less > TLB pressure. > > However, the mappings for the bulk of the kernel will use level 2 > entries anyway, and the misaligned fringes are organized such that they > can take advantage of the contiguous bit, and use far fewer level 3 > entries than would be needed otherwise. > > This makes the value of this feature dubious at best, and since it is not > enabled in defconfig or in the distro configs, it does not appear to be > in wide use either. So let's just remove it. > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Happy to take this patch via the arm64 tree for 5.7 (no new functionality), unless you want it to go with your other relocation login in the EFI stub patches.
On Thu, 2 Apr 2020 at 13:15, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Sun, Mar 29, 2020 at 04:12:58PM +0200, Ard Biesheuvel wrote: > > When CONFIG_DEBUG_ALIGN_RODATA is enabled, kernel segments mapped with > > different permissions (r-x for .text, r-- for .rodata, rw- for .data, > > etc) are rounded up to 2 MiB so they can be mapped more efficiently. > > In particular, it permits the segments to be mapped using level 2 > > block entries when using 4k pages, which is expected to result in less > > TLB pressure. > > > > However, the mappings for the bulk of the kernel will use level 2 > > entries anyway, and the misaligned fringes are organized such that they > > can take advantage of the contiguous bit, and use far fewer level 3 > > entries than would be needed otherwise. > > > > This makes the value of this feature dubious at best, and since it is not > > enabled in defconfig or in the distro configs, it does not appear to be > > in wide use either. So let's just remove it. > > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > > Happy to take this patch via the arm64 tree for 5.7 (no new > functionality), unless you want it to go with your other relocation > login in the EFI stub patches. > If you don't mind taking it for v5.7, please go ahead.
On Mon, Mar 30, 2020 at 04:32:31PM +0200, Ard Biesheuvel wrote: > On Mon, 30 Mar 2020 at 16:28, Will Deacon <will@kernel.org> wrote: > > > On Mon, 30 Mar 2020 at 16:04, Will Deacon <will@kernel.org> wrote: > > > > On Mon, Mar 30, 2020 at 03:53:04PM +0200, Ard Biesheuvel wrote: > > > > > On Mon, 30 Mar 2020 at 15:51, Will Deacon <will@kernel.org> wrote: > > > > > > But I would really like to go a step further and rip out the block mapping > > > > > > support altogether so that we can fix non-coherent DMA aliases: > > > > > > > > > > > > https://lore.kernel.org/lkml/20200224194446.690816-1-hch@lst.de > > > > > > > > > > I'm not sure I follow - is this about mapping parts of the static > > > > > kernel Image for non-coherent DMA? > > > > > > > > Sorry, it's not directly related to your patch, just that if we're removing > > > > options relating to kernel mappings then I'd be quite keen on effectively > > > > forcing page-granularity on the linear map, as is currently done by default > > > > thanks to RODATA_FULL_DEFAULT_ENABLED, so that we can nobble cacheable > > > > aliases for non-coherent streaming DMA mappings by hooking into Christoph's > > > > series above. Have we ever hit this issue in practice? At least from the CPU perspective, we've assumed that a non-cacheable access would not hit in the cache. Reading the ARM ARM rules, it doesn't seem to state this explicitly but we can ask for clarification (I dug out an email from 2015, left unanswered). Assuming that the CPU is behaving as we'd expect, are there other issues with peripherals/SMMU? > > Fair enough, but I'd still like to see some numbers. If they're compelling, > > then we could explore something like CONFIG_OF_DMA_DEFAULT_COHERENT, but > > that doesn't really help the kconfig maze :( I'd prefer not to have a config option, we could easily break single Image at some point. > Could we make this a runtime thing? E.g., remap the entire linear > region down to pages under stop_machine() the first time we probe a > device that uses non-coherent DMA? That could be pretty expensive at run-time. With the ARMv8.4-TTRem feature, I wonder whether we could do this lazily when allocating non-coherent DMA buffers. (I still hope there isn't a problem at all with this mismatch ;)).
On Thu, Apr 02, 2020 at 12:30:33PM +0100, Catalin Marinas wrote: > On Mon, Mar 30, 2020 at 04:32:31PM +0200, Ard Biesheuvel wrote: > > Could we make this a runtime thing? E.g., remap the entire linear > > region down to pages under stop_machine() the first time we probe a > > device that uses non-coherent DMA? > > That could be pretty expensive at run-time. With the ARMv8.4-TTRem > feature, I wonder whether we could do this lazily when allocating > non-coherent DMA buffers. It's worth noting that ARMv8.4-TTRem is optional and the "level 1" and "level 2" behaviours still allows non-fatal faults to be taken while nT is set (or until you perform the TLB invalidation). We can only safely use it where we could use a full BBM sequence today. Effectively TTRem is an optimized, where the CPU *might* be able to use entries during the break stage, but is not guaranteed to do so. Thanksm Mark.
On Thu, Apr 02, 2020 at 12:30:33PM +0100, Catalin Marinas wrote: > On Mon, Mar 30, 2020 at 04:32:31PM +0200, Ard Biesheuvel wrote: > > On Mon, 30 Mar 2020 at 16:28, Will Deacon <will@kernel.org> wrote: > > > > On Mon, 30 Mar 2020 at 16:04, Will Deacon <will@kernel.org> wrote: > > > > > On Mon, Mar 30, 2020 at 03:53:04PM +0200, Ard Biesheuvel wrote: > > > > > > On Mon, 30 Mar 2020 at 15:51, Will Deacon <will@kernel.org> wrote: > > > > > > > But I would really like to go a step further and rip out the block mapping > > > > > > > support altogether so that we can fix non-coherent DMA aliases: > > > > > > > > > > > > > > https://lore.kernel.org/lkml/20200224194446.690816-1-hch@lst.de > > > > > > > > > > > > I'm not sure I follow - is this about mapping parts of the static > > > > > > kernel Image for non-coherent DMA? > > > > > > > > > > Sorry, it's not directly related to your patch, just that if we're removing > > > > > options relating to kernel mappings then I'd be quite keen on effectively > > > > > forcing page-granularity on the linear map, as is currently done by default > > > > > thanks to RODATA_FULL_DEFAULT_ENABLED, so that we can nobble cacheable > > > > > aliases for non-coherent streaming DMA mappings by hooking into Christoph's > > > > > series above. > > Have we ever hit this issue in practice? At least from the CPU > perspective, we've assumed that a non-cacheable access would not hit in > the cache. Reading the ARM ARM rules, it doesn't seem to state this > explicitly but we can ask for clarification (I dug out an email from > 2015, left unanswered). > > Assuming that the CPU is behaving as we'd expect, are there other issues > with peripherals/SMMU? Any clarification would need to be architectural, I think, and not specific to the CPU. Worth asking again though. Will
On Thu, 2 Apr 2020 at 13:30, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Mon, Mar 30, 2020 at 04:32:31PM +0200, Ard Biesheuvel wrote: > > On Mon, 30 Mar 2020 at 16:28, Will Deacon <will@kernel.org> wrote: > > > > On Mon, 30 Mar 2020 at 16:04, Will Deacon <will@kernel.org> wrote: > > > > > On Mon, Mar 30, 2020 at 03:53:04PM +0200, Ard Biesheuvel wrote: > > > > > > On Mon, 30 Mar 2020 at 15:51, Will Deacon <will@kernel.org> wrote: > > > > > > > But I would really like to go a step further and rip out the block mapping > > > > > > > support altogether so that we can fix non-coherent DMA aliases: > > > > > > > > > > > > > > https://lore.kernel.org/lkml/20200224194446.690816-1-hch@lst.de > > > > > > > > > > > > I'm not sure I follow - is this about mapping parts of the static > > > > > > kernel Image for non-coherent DMA? > > > > > > > > > > Sorry, it's not directly related to your patch, just that if we're removing > > > > > options relating to kernel mappings then I'd be quite keen on effectively > > > > > forcing page-granularity on the linear map, as is currently done by default > > > > > thanks to RODATA_FULL_DEFAULT_ENABLED, so that we can nobble cacheable > > > > > aliases for non-coherent streaming DMA mappings by hooking into Christoph's > > > > > series above. > > Have we ever hit this issue in practice? At least from the CPU > perspective, we've assumed that a non-cacheable access would not hit in > the cache. Reading the ARM ARM rules, it doesn't seem to state this > explicitly but we can ask for clarification (I dug out an email from > 2015, left unanswered). > There is some wording in D4.4.5 (Behavior of caches at reset) that suggests that implementations may permit cache hits in regions that are mapped Non-cacheable (although the paragraph in question talks about global controls and not page table attributes) > Assuming that the CPU is behaving as we'd expect, are there other issues > with peripherals/SMMU? > There is the NoSnoop PCIe issue as well: PCIe masters that are DMA coherent in general can generate transactions with non-cacheable attributes. I guess this is mostly orthogonal, but I'm sure it would be much easier to reason about correctness if it is guaranteed that no mappings with mismatched attributes exist anywhere. > > > Fair enough, but I'd still like to see some numbers. If they're compelling, > > > then we could explore something like CONFIG_OF_DMA_DEFAULT_COHERENT, but > > > that doesn't really help the kconfig maze :( > > I'd prefer not to have a config option, we could easily break single > Image at some point. > > > Could we make this a runtime thing? E.g., remap the entire linear > > region down to pages under stop_machine() the first time we probe a > > device that uses non-coherent DMA? > > That could be pretty expensive at run-time. With the ARMv8.4-TTRem > feature, I wonder whether we could do this lazily when allocating > non-coherent DMA buffers. > > (I still hope there isn't a problem at all with this mismatch ;)). > Now that we have the pieces to easily remap the linear region down to pages, and [apparently] some generic infrastructure to manage the linear aliases, the only downside is the alleged performance hit resulting from increased TLB pressure. This is obviously highly micro-architecture dependent, but with Xgene1 and ThunderX1 out of the picture, I wonder if the tradeoffs are different now. Maybe by now, we should just suck it up (Note that we had no complaints afaik regarding the fact that we map the linear map down to pages by default now)
On Fri, Apr 03, 2020 at 10:58:51AM +0200, Ard Biesheuvel wrote: > On Thu, 2 Apr 2020 at 13:30, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Mon, Mar 30, 2020 at 04:32:31PM +0200, Ard Biesheuvel wrote: > > > On Mon, 30 Mar 2020 at 16:28, Will Deacon <will@kernel.org> wrote: > > > > Fair enough, but I'd still like to see some numbers. If they're compelling, > > > > then we could explore something like CONFIG_OF_DMA_DEFAULT_COHERENT, but > > > > that doesn't really help the kconfig maze :( > > > > I'd prefer not to have a config option, we could easily break single > > Image at some point. > > > > > Could we make this a runtime thing? E.g., remap the entire linear > > > region down to pages under stop_machine() the first time we probe a > > > device that uses non-coherent DMA? > > > > That could be pretty expensive at run-time. With the ARMv8.4-TTRem > > feature, I wonder whether we could do this lazily when allocating > > non-coherent DMA buffers. > > > > (I still hope there isn't a problem at all with this mismatch ;)). > > > > Now that we have the pieces to easily remap the linear region down to > pages, and [apparently] some generic infrastructure to manage the > linear aliases, the only downside is the alleged performance hit > resulting from increased TLB pressure. This is obviously highly > micro-architecture dependent, but with Xgene1 and ThunderX1 out of the > picture, I wonder if the tradeoffs are different now. Maybe by now, we > should just suck it up (Note that we had no complaints afaik regarding > the fact that we map the linear map down to pages by default now) I'd be in favour of that fwiw. Catalin -- did you get anything back from the architects about the cache hit behaviour? Will
On Tue, May 05, 2020 at 11:44:06AM +0100, Will Deacon wrote: > Catalin -- did you get anything back from the architects about the cache > hit behaviour? Any read from a non-cacheable alias would be coherent with writes using the same non-cacheable attributes, irrespective of other cacheable aliases (of course, subject to the cache lines having been previously cleaned/invalidated to avoid dirty lines evictions). So as long as the hardware works as per the ARM ARM (B2.8), we don't need to unmap the non-cacheable DMA buffers from the linear map.
diff --git a/arch/arm64/Kconfig.debug b/arch/arm64/Kconfig.debug index 1c906d932d6b..a1efa246c9ed 100644 --- a/arch/arm64/Kconfig.debug +++ b/arch/arm64/Kconfig.debug @@ -52,19 +52,6 @@ config DEBUG_WX If in doubt, say "Y". -config DEBUG_ALIGN_RODATA - depends on STRICT_KERNEL_RWX - bool "Align linker sections up to SECTION_SIZE" - help - If this option is enabled, sections that may potentially be marked as - read only or non-executable will be aligned up to the section size of - the kernel. This prevents sections from being split into pages and - avoids a potential TLB penalty. The downside is an increase in - alignment and potentially wasted space. Turn on this option if - performance is more important than memory pressure. - - If in doubt, say N. - config DEBUG_EFI depends on EFI && DEBUG_INFO bool "UEFI debugging" diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h index 4d94676e5a8b..3b34f7bde2f2 100644 --- a/arch/arm64/include/asm/memory.h +++ b/arch/arm64/include/asm/memory.h @@ -119,22 +119,12 @@ /* * Alignment of kernel segments (e.g. .text, .data). - */ -#if defined(CONFIG_DEBUG_ALIGN_RODATA) -/* - * 4 KB granule: 1 level 2 entry - * 16 KB granule: 128 level 3 entries, with contiguous bit - * 64 KB granule: 32 level 3 entries, with contiguous bit - */ -#define SEGMENT_ALIGN SZ_2M -#else -/* + * * 4 KB granule: 16 level 3 entries, with contiguous bit * 16 KB granule: 4 level 3 entries, without contiguous bit * 64 KB granule: 1 level 3 entry */ #define SEGMENT_ALIGN SZ_64K -#endif /* * Memory types available. diff --git a/drivers/firmware/efi/libstub/arm64-stub.c b/drivers/firmware/efi/libstub/arm64-stub.c index db0c1a9c1699..fc9f8ab533a7 100644 --- a/drivers/firmware/efi/libstub/arm64-stub.c +++ b/drivers/firmware/efi/libstub/arm64-stub.c @@ -75,14 +75,12 @@ efi_status_t handle_kernel_image(unsigned long *image_addr, if (IS_ENABLED(CONFIG_RANDOMIZE_BASE) && phys_seed != 0) { /* - * If CONFIG_DEBUG_ALIGN_RODATA is not set, produce a - * displacement in the interval [0, MIN_KIMG_ALIGN) that - * doesn't violate this kernel's de-facto alignment + * Produce a displacement in the interval [0, MIN_KIMG_ALIGN) + * that doesn't violate this kernel's de-facto alignment * constraints. */ u32 mask = (MIN_KIMG_ALIGN - 1) & ~(EFI_KIMG_ALIGN - 1); - u32 offset = !IS_ENABLED(CONFIG_DEBUG_ALIGN_RODATA) ? - (phys_seed >> 32) & mask : TEXT_OFFSET; + u32 offset = (phys_seed >> 32) & mask; /* * With CONFIG_RANDOMIZE_TEXT_OFFSET=y, TEXT_OFFSET may not
When CONFIG_DEBUG_ALIGN_RODATA is enabled, kernel segments mapped with different permissions (r-x for .text, r-- for .rodata, rw- for .data, etc) are rounded up to 2 MiB so they can be mapped more efficiently. In particular, it permits the segments to be mapped using level 2 block entries when using 4k pages, which is expected to result in less TLB pressure. However, the mappings for the bulk of the kernel will use level 2 entries anyway, and the misaligned fringes are organized such that they can take advantage of the contiguous bit, and use far fewer level 3 entries than would be needed otherwise. This makes the value of this feature dubious at best, and since it is not enabled in defconfig or in the distro configs, it does not appear to be in wide use either. So let's just remove it. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> --- arch/arm64/Kconfig.debug | 13 ------------- arch/arm64/include/asm/memory.h | 12 +----------- drivers/firmware/efi/libstub/arm64-stub.c | 8 +++----- 3 files changed, 4 insertions(+), 29 deletions(-)