Message ID | 20200605222257.44882-1-saravanak@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v1] arm64/module: Optimize module load time by optimizing PLT counting | expand |
On Fri, Jun 05, 2020 at 03:22:57PM -0700, Saravana Kannan wrote: > When loading a module, module_frob_arch_sections() tries to figure out > the number of PLTs that'll be needed to handle all the RELAs. While > doing this, it tries to dedupe PLT allocations for multiple > R_AARCH64_CALL26 relocations to the same symbol. It does the same for > R_AARCH64_JUMP26 relocations too. > > To make checks for duplicates easier/faster, it sorts the relocation > list by type, symbol and addend. That way, to check for a duplicate > relocation, it just needs to compare with the previous entry. > > However, sorting the entire relocation array is unnecessary and > expensive (O(n log n)) because there are a lot of other relocation types > that don't need deduping or can't be deduped. > > So this commit partitions the array into entries that need deduping and > those that don't. And then sorts just the part that needs deduping. And > when CONFIG_RANDOMIZE_BASE is disabled, the sorting is skipped entirely > because PLTs are not allocated for R_AARCH64_CALL26 and R_AARCH64_JUMP26 > if it's disabled. > > This gives significant reduction in module load time for modules with > large number of relocations with no measurable impact on modules with a > small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE > enabled, the load time for one module went down from 268ms to 100ms. > Another module went down from 143ms to 83ms. Whilst I can see that's a significant relative saving, what proportion of actual boot time are we talking about here? It would be interesting to know if there are bigger potential savings elsewhere. > This commit also disables the sorting if CONFIG_RANDOMIZE_BASE is > disabled because it looks like PLTs are not allocated for > R_AARCH64_CALL26 and R_AARCH64_JUMP26 if it's disabled. > > Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> > Signed-off-by: Saravana Kannan <saravanak@google.com> > --- > arch/arm64/kernel/module-plts.c | 37 ++++++++++++++++++++++++++++++++- > 1 file changed, 36 insertions(+), 1 deletion(-) > > diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c > index 65b08a74aec6..bf5118b3b828 100644 > --- a/arch/arm64/kernel/module-plts.c > +++ b/arch/arm64/kernel/module-plts.c > @@ -253,6 +253,36 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num, > return ret; > } > > +static bool rela_needs_dedup(Elf64_Rela *rela) > +{ > + return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26 > + || ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26; > +} Does this handle A53 erratum 843419 correctly? I'm worried that we skip the ADRP PLTs there. > + > +/* Group the CALL26/JUMP26 relas toward the beginning of the array. */ > +static int partition_dedup_relas(Elf64_Rela *rela, int numrels) > +{ > + int i = 0, j = numrels - 1; > + Elf64_Rela t; > + > + while (i < j) { > + while (rela_needs_dedup(rela + i) && i < j) > + i++; > + while (!rela_needs_dedup(rela + j) && i < j) > + j--; > + if (i < j) { > + t = *(rela + j); > + *(rela + j) = *(rela + i); > + *(rela + i) = t; > + } > + } This is very hard to read and I think some of the 'i < j' comparisons are redundant. Would it make more sense to assign a temporary rather than post-inc/decrement and recheck? Will
On Tue, Jun 16, 2020 at 2:40 PM Will Deacon <will@kernel.org> wrote: > > On Fri, Jun 05, 2020 at 03:22:57PM -0700, Saravana Kannan wrote: > > When loading a module, module_frob_arch_sections() tries to figure out > > the number of PLTs that'll be needed to handle all the RELAs. While > > doing this, it tries to dedupe PLT allocations for multiple > > R_AARCH64_CALL26 relocations to the same symbol. It does the same for > > R_AARCH64_JUMP26 relocations too. > > > > To make checks for duplicates easier/faster, it sorts the relocation > > list by type, symbol and addend. That way, to check for a duplicate > > relocation, it just needs to compare with the previous entry. > > > > However, sorting the entire relocation array is unnecessary and > > expensive (O(n log n)) because there are a lot of other relocation types > > that don't need deduping or can't be deduped. > > > > So this commit partitions the array into entries that need deduping and > > those that don't. And then sorts just the part that needs deduping. And > > when CONFIG_RANDOMIZE_BASE is disabled, the sorting is skipped entirely > > because PLTs are not allocated for R_AARCH64_CALL26 and R_AARCH64_JUMP26 > > if it's disabled. > > > > This gives significant reduction in module load time for modules with > > large number of relocations with no measurable impact on modules with a > > small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE > > enabled, the load time for one module went down from 268ms to 100ms. > > Another module went down from 143ms to 83ms. > > Whilst I can see that's a significant relative saving, what proportion of > actual boot time are we talking about here? It would be interesting to > know if there are bigger potential savings elsewhere. 100ms if pretty big in terms of boot time from a phone OEM perspective. It adds up. And for these two modules I profiled, it adds up to 228 ms. That's quite a lot. Also, if you look at just the kernel init time and all the module load time, it comes to around 2.2 seconds in the hardware I'm testing on. That's a 10% reduction in the "kernel init" time. Also, this 2.2 seconds is without async probing. If we do that, this becomes and even larger % of the kernel init time. > > This commit also disables the sorting if CONFIG_RANDOMIZE_BASE is > > disabled because it looks like PLTs are not allocated for > > R_AARCH64_CALL26 and R_AARCH64_JUMP26 if it's disabled. > > > > Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> > > Signed-off-by: Saravana Kannan <saravanak@google.com> > > --- > > arch/arm64/kernel/module-plts.c | 37 ++++++++++++++++++++++++++++++++- > > 1 file changed, 36 insertions(+), 1 deletion(-) > > > > diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c > > index 65b08a74aec6..bf5118b3b828 100644 > > --- a/arch/arm64/kernel/module-plts.c > > +++ b/arch/arm64/kernel/module-plts.c > > @@ -253,6 +253,36 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num, > > return ret; > > } > > > > +static bool rela_needs_dedup(Elf64_Rela *rela) > > +{ > > + return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26 > > + || ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26; > > +} > > Does this handle A53 erratum 843419 correctly? I'm worried that we skip > the ADRP PLTs there. So this isn't really skipping any PLTs or RELAs. It's just picking the RELAs that can benefit from deduping using sort. Even if I delete the entire sort() call, all the modules load and the device seems to be working fine (Video streaming over Wifi works). The catch there is that the module takes up more memory (because the PLTs are duplicated often) and you might potentially lose some caching benefits for PLTs that have a significant amount of repetition (say, printk or something like that). You'll see that the duplicate_rel() call isn't even called in any code related to that erratum. So, I think this is safe. > > + > > +/* Group the CALL26/JUMP26 relas toward the beginning of the array. */ > > +static int partition_dedup_relas(Elf64_Rela *rela, int numrels) > > +{ > > + int i = 0, j = numrels - 1; > > + Elf64_Rela t; > > + > > + while (i < j) { > > + while (rela_needs_dedup(rela + i) && i < j) > > + i++; > > + while (!rela_needs_dedup(rela + j) && i < j) > > + j--; > > + if (i < j) { > > + t = *(rela + j); > > + *(rela + j) = *(rela + i); > > + *(rela + i) = t; > > + } > > + } > > This is very hard to read and I think some of the 'i < j' comparisons are > redundant. I have no strong opinion on how this function is written. It's a trivial partitioning code. I'll be happy to copy paste any clean up anyone wants to suggest. I think all the i < j checks are necessary for my current implementation. > Would it make more sense to assign a temporary rather than > post-inc/decrement and recheck? I'm not sure I get what you are trying to say. -Saravana
On Tue, 16 Jun 2020 at 23:40, Will Deacon <will@kernel.org> wrote: > > On Fri, Jun 05, 2020 at 03:22:57PM -0700, Saravana Kannan wrote: > > When loading a module, module_frob_arch_sections() tries to figure out > > the number of PLTs that'll be needed to handle all the RELAs. While > > doing this, it tries to dedupe PLT allocations for multiple > > R_AARCH64_CALL26 relocations to the same symbol. It does the same for > > R_AARCH64_JUMP26 relocations too. > > > > To make checks for duplicates easier/faster, it sorts the relocation > > list by type, symbol and addend. That way, to check for a duplicate > > relocation, it just needs to compare with the previous entry. > > > > However, sorting the entire relocation array is unnecessary and > > expensive (O(n log n)) because there are a lot of other relocation types > > that don't need deduping or can't be deduped. > > > > So this commit partitions the array into entries that need deduping and > > those that don't. And then sorts just the part that needs deduping. And > > when CONFIG_RANDOMIZE_BASE is disabled, the sorting is skipped entirely > > because PLTs are not allocated for R_AARCH64_CALL26 and R_AARCH64_JUMP26 > > if it's disabled. > > > > This gives significant reduction in module load time for modules with > > large number of relocations with no measurable impact on modules with a > > small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE > > enabled, the load time for one module went down from 268ms to 100ms. > > Another module went down from 143ms to 83ms. > > Whilst I can see that's a significant relative saving, what proportion of > actual boot time are we talking about here? It would be interesting to > know if there are bigger potential savings elsewhere. > Also, 'some module' vs 'some other module' doesn't really say anything. Please explain which modules and their sizes. > > This commit also disables the sorting if CONFIG_RANDOMIZE_BASE is > > disabled because it looks like PLTs are not allocated for > > R_AARCH64_CALL26 and R_AARCH64_JUMP26 if it's disabled. > > > > Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> > > Signed-off-by: Saravana Kannan <saravanak@google.com> > > --- > > arch/arm64/kernel/module-plts.c | 37 ++++++++++++++++++++++++++++++++- > > 1 file changed, 36 insertions(+), 1 deletion(-) > > > > diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c > > index 65b08a74aec6..bf5118b3b828 100644 > > --- a/arch/arm64/kernel/module-plts.c > > +++ b/arch/arm64/kernel/module-plts.c > > @@ -253,6 +253,36 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num, > > return ret; > > } > > > > +static bool rela_needs_dedup(Elf64_Rela *rela) > > +{ > > + return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26 > > + || ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26; > > +} > Would it help to check the section index here as well? Call/jump instructions within a section are never sent through a PLT entry. > Does this handle A53 erratum 843419 correctly? I'm worried that we skip > the ADRP PLTs there. > ADRP PLTs cannot be deduplicated, as they incorporate a relative jump back to the caller. > > + > > +/* Group the CALL26/JUMP26 relas toward the beginning of the array. */ > > +static int partition_dedup_relas(Elf64_Rela *rela, int numrels) > > +{ > > + int i = 0, j = numrels - 1; > > + Elf64_Rela t; > > + > > + while (i < j) { > > + while (rela_needs_dedup(rela + i) && i < j) > > + i++; > > + while (!rela_needs_dedup(rela + j) && i < j) > > + j--; > > + if (i < j) { > > + t = *(rela + j); > > + *(rela + j) = *(rela + i); > > + *(rela + i) = t; > > + } > > + } > > This is very hard to read and I think some of the 'i < j' comparisons are > redundant. Would it make more sense to assign a temporary rather than > post-inc/decrement and recheck? > Agreed. Also, what's wrong with [] array indexing?
Hi all, On Wed, Jun 17, 2020 at 10:17:33AM +0200, Ard Biesheuvel wrote: > On Tue, 16 Jun 2020 at 23:40, Will Deacon <will@kernel.org> wrote: > > On Fri, Jun 05, 2020 at 03:22:57PM -0700, Saravana Kannan wrote: > > > This gives significant reduction in module load time for modules with > > > large number of relocations with no measurable impact on modules with a > > > small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE > > > enabled, the load time for one module went down from 268ms to 100ms. > > > Another module went down from 143ms to 83ms. > > > > Whilst I can see that's a significant relative saving, what proportion of > > actual boot time are we talking about here? It would be interesting to > > know if there are bigger potential savings elsewhere. > > > > Also, 'some module' vs 'some other module' doesn't really say > anything. Please explain which modules and their sizes. I suspect these are all out-of-tree modules, but yes, some metadata such as sizes, nr or relocs etc would be good to have in the commit message. > > > diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c > > > index 65b08a74aec6..bf5118b3b828 100644 > > > --- a/arch/arm64/kernel/module-plts.c > > > +++ b/arch/arm64/kernel/module-plts.c > > > @@ -253,6 +253,36 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num, > > > return ret; > > > } > > > > > > +static bool rela_needs_dedup(Elf64_Rela *rela) > > > +{ > > > + return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26 > > > + || ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26; > > > +} > > > > Would it help to check the section index here as well? Call/jump > instructions within a section are never sent through a PLT entry. (I tried hacking this in below) > > Does this handle A53 erratum 843419 correctly? I'm worried that we skip > > the ADRP PLTs there. > > > > ADRP PLTs cannot be deduplicated, as they incorporate a relative jump > back to the caller. Duh yes, thanks. We can't trash the link register here. > > > +/* Group the CALL26/JUMP26 relas toward the beginning of the array. */ > > > +static int partition_dedup_relas(Elf64_Rela *rela, int numrels) > > > +{ > > > + int i = 0, j = numrels - 1; > > > + Elf64_Rela t; > > > + > > > + while (i < j) { > > > + while (rela_needs_dedup(rela + i) && i < j) > > > + i++; > > > + while (!rela_needs_dedup(rela + j) && i < j) > > > + j--; > > > + if (i < j) { > > > + t = *(rela + j); > > > + *(rela + j) = *(rela + i); > > > + *(rela + i) = t; > > > + } > > > + } > > > > This is very hard to read and I think some of the 'i < j' comparisons are > > redundant. Would it make more sense to assign a temporary rather than > > post-inc/decrement and recheck? > > > > Agreed. > > Also, what's wrong with [] array indexing? Saravana, since our stylistic objections are reasonably vague, I tried to clean this up so you can get an idea of how I'd prefer it to look (can't speak for Ard). I haven't tried running this, but please feel free to adapt it. Replacement diff below. Will --->8 diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c index 65b08a74aec6..204290314054 100644 --- a/arch/arm64/kernel/module-plts.c +++ b/arch/arm64/kernel/module-plts.c @@ -253,6 +253,39 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num, return ret; } +static bool branch_rela_needs_plt(Elf64_Sym *syms, Elf64_Rela *rela, + Elf64_Word dstidx) +{ + + Elf64_Sym *s = syms + ELF64_R_SYM(rela->r_info); + + if (s->st_shndx == dstidx) + return false; + + return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26 || + ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26; +} + +static int partition_branch_plt_relas(Elf64_Sym *syms, Elf64_Rela *rela, + int numrels, Elf64_Word dstidx) +{ + int i = 0, j = numrels - 1; + + if (!IS_ENABLED(CONFIG_RANDOMIZE_BASE)) + return 0; + + while (i < j) { + if (branch_rela_needs_plt(syms, &rela[i], dstidx)) + i++; + else if (branch_rela_needs_plt(syms, &rela[j], dstidx)) + swap(rela[i], rela[j]); + else + j--; + } + + return i; +} + int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs, char *secstrings, struct module *mod) { @@ -290,7 +323,7 @@ int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs, for (i = 0; i < ehdr->e_shnum; i++) { Elf64_Rela *rels = (void *)ehdr + sechdrs[i].sh_offset; - int numrels = sechdrs[i].sh_size / sizeof(Elf64_Rela); + int nents, numrels = sechdrs[i].sh_size / sizeof(Elf64_Rela); Elf64_Shdr *dstsec = sechdrs + sechdrs[i].sh_info; if (sechdrs[i].sh_type != SHT_RELA) @@ -300,8 +333,14 @@ int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs, if (!(dstsec->sh_flags & SHF_EXECINSTR)) continue; - /* sort by type, symbol index and addend */ - sort(rels, numrels, sizeof(Elf64_Rela), cmp_rela, NULL); + /* + * sort branch relocations requiring a PLT by type, symbol index + * and addend + */ + nents = partition_branch_plt_relas(syms, rels, numrels, + sechdrs[i].sh_info); + if (nents) + sort(rels, nents, sizeof(Elf64_Rela), cmp_rela, NULL); if (!str_has_prefix(secstrings + dstsec->sh_name, ".init")) core_plts += count_plts(syms, rels, numrels,
On Wed, Jun 17, 2020 at 7:05 AM Will Deacon <will@kernel.org> wrote: > > Hi all, > > On Wed, Jun 17, 2020 at 10:17:33AM +0200, Ard Biesheuvel wrote: > > On Tue, 16 Jun 2020 at 23:40, Will Deacon <will@kernel.org> wrote: > > > On Fri, Jun 05, 2020 at 03:22:57PM -0700, Saravana Kannan wrote: > > > > This gives significant reduction in module load time for modules with > > > > large number of relocations with no measurable impact on modules with a > > > > small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE > > > > enabled, the load time for one module went down from 268ms to 100ms. > > > > Another module went down from 143ms to 83ms. > > > > > > Whilst I can see that's a significant relative saving, what proportion of > > > actual boot time are we talking about here? It would be interesting to > > > know if there are bigger potential savings elsewhere. > > > > > > > Also, 'some module' vs 'some other module' doesn't really say > > anything. Please explain which modules and their sizes. > > I suspect these are all out-of-tree modules, but yes, some metadata such as > sizes, nr or relocs etc would be good to have in the commit message. > > > > > diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c > > > > index 65b08a74aec6..bf5118b3b828 100644 > > > > --- a/arch/arm64/kernel/module-plts.c > > > > +++ b/arch/arm64/kernel/module-plts.c > > > > @@ -253,6 +253,36 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num, > > > > return ret; > > > > } > > > > > > > > +static bool rela_needs_dedup(Elf64_Rela *rela) > > > > +{ > > > > + return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26 > > > > + || ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26; > > > > +} > > > > > > > Would it help to check the section index here as well? Call/jump > > instructions within a section are never sent through a PLT entry. > > (I tried hacking this in below) > > > > Does this handle A53 erratum 843419 correctly? I'm worried that we skip > > > the ADRP PLTs there. > > > > > > > ADRP PLTs cannot be deduplicated, as they incorporate a relative jump > > back to the caller. > > Duh yes, thanks. We can't trash the link register here. > > > > > +/* Group the CALL26/JUMP26 relas toward the beginning of the array. */ > > > > +static int partition_dedup_relas(Elf64_Rela *rela, int numrels) > > > > +{ > > > > + int i = 0, j = numrels - 1; > > > > + Elf64_Rela t; > > > > + > > > > + while (i < j) { > > > > + while (rela_needs_dedup(rela + i) && i < j) > > > > + i++; > > > > + while (!rela_needs_dedup(rela + j) && i < j) > > > > + j--; > > > > + if (i < j) { > > > > + t = *(rela + j); > > > > + *(rela + j) = *(rela + i); > > > > + *(rela + i) = t; > > > > + } > > > > + } > > > > > > This is very hard to read and I think some of the 'i < j' comparisons are > > > redundant. Would it make more sense to assign a temporary rather than > > > post-inc/decrement and recheck? > > > > > > > Agreed. > > > > Also, what's wrong with [] array indexing? > > Saravana, since our stylistic objections are reasonably vague, I tried > to clean this up so you can get an idea of how I'd prefer it to look (can't > speak for Ard). I haven't tried running this, but please feel free to adapt > it. Replacement diff below. > > Will > > --->8 > > diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c > index 65b08a74aec6..204290314054 100644 > --- a/arch/arm64/kernel/module-plts.c > +++ b/arch/arm64/kernel/module-plts.c > @@ -253,6 +253,39 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num, > return ret; > } > > +static bool branch_rela_needs_plt(Elf64_Sym *syms, Elf64_Rela *rela, > + Elf64_Word dstidx) > +{ > + > + Elf64_Sym *s = syms + ELF64_R_SYM(rela->r_info); > + > + if (s->st_shndx == dstidx) > + return false; > + > + return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26 || > + ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26; > +} > + > +static int partition_branch_plt_relas(Elf64_Sym *syms, Elf64_Rela *rela, > + int numrels, Elf64_Word dstidx) > +{ > + int i = 0, j = numrels - 1; > + > + if (!IS_ENABLED(CONFIG_RANDOMIZE_BASE)) > + return 0; > + > + while (i < j) { > + if (branch_rela_needs_plt(syms, &rela[i], dstidx)) > + i++; > + else if (branch_rela_needs_plt(syms, &rela[j], dstidx)) > + swap(rela[i], rela[j]); > + else > + j--; > + } > + > + return i; > +} > + > int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs, > char *secstrings, struct module *mod) > { > @@ -290,7 +323,7 @@ int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs, > > for (i = 0; i < ehdr->e_shnum; i++) { > Elf64_Rela *rels = (void *)ehdr + sechdrs[i].sh_offset; > - int numrels = sechdrs[i].sh_size / sizeof(Elf64_Rela); > + int nents, numrels = sechdrs[i].sh_size / sizeof(Elf64_Rela); > Elf64_Shdr *dstsec = sechdrs + sechdrs[i].sh_info; > > if (sechdrs[i].sh_type != SHT_RELA) > @@ -300,8 +333,14 @@ int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs, > if (!(dstsec->sh_flags & SHF_EXECINSTR)) > continue; > > - /* sort by type, symbol index and addend */ > - sort(rels, numrels, sizeof(Elf64_Rela), cmp_rela, NULL); > + /* > + * sort branch relocations requiring a PLT by type, symbol index > + * and addend > + */ > + nents = partition_branch_plt_relas(syms, rels, numrels, > + sechdrs[i].sh_info); > + if (nents) > + sort(rels, nents, sizeof(Elf64_Rela), cmp_rela, NULL); > > if (!str_has_prefix(secstrings + dstsec->sh_name, ".init")) > core_plts += count_plts(syms, rels, numrels, Thanks Will & Ard. I'll incorporate your feedback and send a v2 within a few days. -Saravana
diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c index 65b08a74aec6..bf5118b3b828 100644 --- a/arch/arm64/kernel/module-plts.c +++ b/arch/arm64/kernel/module-plts.c @@ -253,6 +253,36 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num, return ret; } +static bool rela_needs_dedup(Elf64_Rela *rela) +{ + return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26 + || ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26; +} + +/* Group the CALL26/JUMP26 relas toward the beginning of the array. */ +static int partition_dedup_relas(Elf64_Rela *rela, int numrels) +{ + int i = 0, j = numrels - 1; + Elf64_Rela t; + + while (i < j) { + while (rela_needs_dedup(rela + i) && i < j) + i++; + while (!rela_needs_dedup(rela + j) && i < j) + j--; + if (i < j) { + t = *(rela + j); + *(rela + j) = *(rela + i); + *(rela + i) = t; + } + } + /* If the entire array needs dedup, make sure i == numrels */ + if (rela_needs_dedup(rela + i)) + i++; + + return i; +} + int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs, char *secstrings, struct module *mod) { @@ -291,6 +321,7 @@ int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs, for (i = 0; i < ehdr->e_shnum; i++) { Elf64_Rela *rels = (void *)ehdr + sechdrs[i].sh_offset; int numrels = sechdrs[i].sh_size / sizeof(Elf64_Rela); + int num_dedup_rels = 0; Elf64_Shdr *dstsec = sechdrs + sechdrs[i].sh_info; if (sechdrs[i].sh_type != SHT_RELA) @@ -300,8 +331,12 @@ int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs, if (!(dstsec->sh_flags & SHF_EXECINSTR)) continue; + if (IS_ENABLED(CONFIG_RANDOMIZE_BASE)) + num_dedup_rels = partition_dedup_relas(rels, numrels); /* sort by type, symbol index and addend */ - sort(rels, numrels, sizeof(Elf64_Rela), cmp_rela, NULL); + if (num_dedup_rels) + sort(rels, num_dedup_rels, sizeof(Elf64_Rela), + cmp_rela, NULL); if (!str_has_prefix(secstrings + dstsec->sh_name, ".init")) core_plts += count_plts(syms, rels, numrels,
When loading a module, module_frob_arch_sections() tries to figure out the number of PLTs that'll be needed to handle all the RELAs. While doing this, it tries to dedupe PLT allocations for multiple R_AARCH64_CALL26 relocations to the same symbol. It does the same for R_AARCH64_JUMP26 relocations too. To make checks for duplicates easier/faster, it sorts the relocation list by type, symbol and addend. That way, to check for a duplicate relocation, it just needs to compare with the previous entry. However, sorting the entire relocation array is unnecessary and expensive (O(n log n)) because there are a lot of other relocation types that don't need deduping or can't be deduped. So this commit partitions the array into entries that need deduping and those that don't. And then sorts just the part that needs deduping. And when CONFIG_RANDOMIZE_BASE is disabled, the sorting is skipped entirely because PLTs are not allocated for R_AARCH64_CALL26 and R_AARCH64_JUMP26 if it's disabled. This gives significant reduction in module load time for modules with large number of relocations with no measurable impact on modules with a small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE enabled, the load time for one module went down from 268ms to 100ms. Another module went down from 143ms to 83ms. This commit also disables the sorting if CONFIG_RANDOMIZE_BASE is disabled because it looks like PLTs are not allocated for R_AARCH64_CALL26 and R_AARCH64_JUMP26 if it's disabled. Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Saravana Kannan <saravanak@google.com> --- arch/arm64/kernel/module-plts.c | 37 ++++++++++++++++++++++++++++++++- 1 file changed, 36 insertions(+), 1 deletion(-)