Message ID | 20240815150339.1564769-4-kris.van.hees@oracle.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Generate address range data for built-in modules | expand |
On Fri, Aug 16, 2024 at 12:04 AM Kris Van Hees <kris.van.hees@oracle.com> wrote: > > The modules.builtin.ranges offset range data for builtin modules is > generated at compile time based on the list of built-in modules and > the vmlinux.map and vmlinux.o.map linker maps. This data can be used > to determine whether a symbol at a particular address belongs to > module code that was configured to be compiled into the kernel proper > as a built-in module (rather than as a standalone module). > > This patch adds a script that uses the generated modules.builtin.ranges > data to annotate the symbols in the System.map with module names if > their address falls within a range that belongs to one or more built-in > modules. > > It then processes the vmlinux.map (and if needed, vmlinux.o.map) to > verify the annotation: > > - For each top-level section: > - For each object in the section: > - Determine whether the object is part of a built-in module > (using modules.builtin and the .*.cmd file used to compile > the object as suggested in [0]) > - For each symbol in that object, verify that the built-in > module association (or lack thereof) matches the annotation > given to the symbol. > > Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> > Reviewed-by: Nick Alcock <nick.alcock@oracle.com> > Reviewed-by: Alan Maguire <alan.maguire@oracle.com> > --- > Changes since v5: > - Added optional 6th argument to specify kernel build directory. > - Report error and exit if .*.o.cmd files cannot be read. > > Changes since v4: > - New patch in the series > --- > scripts/verify_builtin_ranges.awk | 365 ++++++++++++++++++++++++++++++ > 1 file changed, 365 insertions(+) > create mode 100755 scripts/verify_builtin_ranges.awk > > diff --git a/scripts/verify_builtin_ranges.awk b/scripts/verify_builtin_ranges.awk > new file mode 100755 > index 000000000000..b82cf0a0fbeb > --- /dev/null > +++ b/scripts/verify_builtin_ranges.awk > @@ -0,0 +1,365 @@ > +#!/usr/bin/gawk -f > +# SPDX-License-Identifier: GPL-2.0 > +# verify_builtin_ranges.awk: Verify address range data for builtin modules > +# Written by Kris Van Hees <kris.van.hees@oracle.com> > +# > +# Usage: verify_builtin_ranges.awk modules.builtin.ranges System.map \ > +# modules.builtin vmlinux.map vmlinux.o.map \ > +# [ <build-dir> ] > +# > + > +# Return the module name(s) (if any) associated with the given object. > +# > +# If we have seen this object before, return information from the cache. > +# Otherwise, retrieve it from the corresponding .cmd file. > +# > +function get_module_info(fn, mod, obj, mfn, s) { > + if (fn in omod) > + return omod[fn]; > + > + if (match(fn, /\/[^/]+$/) == 0) > + return ""; > + > + obj = fn; > + mod = ""; > + mfn = ""; > + fn = kdir "/" substr(fn, 1, RSTART) "." substr(fn, RSTART + 1) ".cmd"; > + if (getline s <fn == 1) { > + if (match(s, /DKBUILD_MODFILE=['"]+[^'"]+/) > 0) { > + mfn = substr(s, RSTART + 16, RLENGTH - 16); > + gsub(/['"]/, "", mfn); > + > + mod = mfn; > + gsub(/([^/ ]*\/)+/, "", mod); > + gsub(/-/, "_", mod); > + } > + } else { > + print "ERROR: Failed to read: " fn "\n\n" \ > + " Invalid kernel build directory (" kdir ")\n" \ > + " or its content does not match " ARGV[1] >"/dev/stderr"; > + close(fn); > + total = 0; > + exit(1); > + } > + close(fn); > + > + # A single module (common case) also reflects objects that are not part > + # of a module. Some of those objects have names that are also a module > + # name (e.g. core). We check the associated module file name, and if > + # they do not match, the object is not part of a module. > + if (mod !~ / /) { > + if (!(mod in mods)) > + return ""; > + if (mods[mod] != mfn) > + return ""; > + } > + > + # At this point, mod is a single (valid) module name, or a list of > + # module names (that do not need validation). > + omod[obj] = mod; > + close(fn); > + > + return mod; > +} > This code is copy-paste from scripts/generate_builtin_ranges.awk So, my comments in 2/4 can apply to this patch, too. Instead of adding a separate script, we could add a "verify mode" option. scripts/generate_builtin_ranges.awk --verify ... But, I do not know how much cleaner it will become. I am not good at reviewing AWK code, but this is how you go. If this script were written in Python, it would be easy and readable to split logically-related code chunks into functions, as follows: def parse_module_builtin(): ... def parse_vmlinux_map_lld(): ... def parse_vmlinux_map_bfd(): ... def parse_vmlinux_o_map(): ... -- Best Regards Masahiro Yamada
On Sun, Aug 18, 2024 at 03:40:36PM +0900, Masahiro Yamada wrote: > On Fri, Aug 16, 2024 at 12:04???AM Kris Van Hees <kris.van.hees@oracle.com> wrote: > > > > The modules.builtin.ranges offset range data for builtin modules is > > generated at compile time based on the list of built-in modules and > > the vmlinux.map and vmlinux.o.map linker maps. This data can be used > > to determine whether a symbol at a particular address belongs to > > module code that was configured to be compiled into the kernel proper > > as a built-in module (rather than as a standalone module). > > > > This patch adds a script that uses the generated modules.builtin.ranges > > data to annotate the symbols in the System.map with module names if > > their address falls within a range that belongs to one or more built-in > > modules. > > > > It then processes the vmlinux.map (and if needed, vmlinux.o.map) to > > verify the annotation: > > > > - For each top-level section: > > - For each object in the section: > > - Determine whether the object is part of a built-in module > > (using modules.builtin and the .*.cmd file used to compile > > the object as suggested in [0]) > > - For each symbol in that object, verify that the built-in > > module association (or lack thereof) matches the annotation > > given to the symbol. > > > > Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> > > Reviewed-by: Nick Alcock <nick.alcock@oracle.com> > > Reviewed-by: Alan Maguire <alan.maguire@oracle.com> > > --- > > Changes since v5: > > - Added optional 6th argument to specify kernel build directory. > > - Report error and exit if .*.o.cmd files cannot be read. > > > > Changes since v4: > > - New patch in the series > > --- > > scripts/verify_builtin_ranges.awk | 365 ++++++++++++++++++++++++++++++ > > 1 file changed, 365 insertions(+) > > create mode 100755 scripts/verify_builtin_ranges.awk > > > > diff --git a/scripts/verify_builtin_ranges.awk b/scripts/verify_builtin_ranges.awk > > new file mode 100755 > > index 000000000000..b82cf0a0fbeb > > --- /dev/null > > +++ b/scripts/verify_builtin_ranges.awk > > @@ -0,0 +1,365 @@ > > +#!/usr/bin/gawk -f > > +# SPDX-License-Identifier: GPL-2.0 > > +# verify_builtin_ranges.awk: Verify address range data for builtin modules > > +# Written by Kris Van Hees <kris.van.hees@oracle.com> > > +# > > +# Usage: verify_builtin_ranges.awk modules.builtin.ranges System.map \ > > +# modules.builtin vmlinux.map vmlinux.o.map \ > > +# [ <build-dir> ] > > +# > > + > > +# Return the module name(s) (if any) associated with the given object. > > +# > > +# If we have seen this object before, return information from the cache. > > +# Otherwise, retrieve it from the corresponding .cmd file. > > +# > > +function get_module_info(fn, mod, obj, mfn, s) { > > + if (fn in omod) > > + return omod[fn]; > > + > > + if (match(fn, /\/[^/]+$/) == 0) > > + return ""; > > + > > + obj = fn; > > + mod = ""; > > + mfn = ""; > > + fn = kdir "/" substr(fn, 1, RSTART) "." substr(fn, RSTART + 1) ".cmd"; > > + if (getline s <fn == 1) { > > + if (match(s, /DKBUILD_MODFILE=['"]+[^'"]+/) > 0) { > > + mfn = substr(s, RSTART + 16, RLENGTH - 16); > > + gsub(/['"]/, "", mfn); > > + > > + mod = mfn; > > + gsub(/([^/ ]*\/)+/, "", mod); > > + gsub(/-/, "_", mod); > > + } > > + } else { > > + print "ERROR: Failed to read: " fn "\n\n" \ > > + " Invalid kernel build directory (" kdir ")\n" \ > > + " or its content does not match " ARGV[1] >"/dev/stderr"; > > + close(fn); > > + total = 0; > > + exit(1); > > + } > > + close(fn); > > + > > + # A single module (common case) also reflects objects that are not part > > + # of a module. Some of those objects have names that are also a module > > + # name (e.g. core). We check the associated module file name, and if > > + # they do not match, the object is not part of a module. > > + if (mod !~ / /) { > > + if (!(mod in mods)) > > + return ""; > > + if (mods[mod] != mfn) > > + return ""; > > + } > > + > > + # At this point, mod is a single (valid) module name, or a list of > > + # module names (that do not need validation). > > + omod[obj] = mod; > > + close(fn); > > + > > + return mod; > > +} > > > > > > This code is copy-paste from scripts/generate_builtin_ranges.awk > So, my comments in 2/4 can apply to this patch, too. Yes, and I will apply the same changes to the verifier script. > Instead of adding a separate script, > we could add a "verify mode" option. > > > scripts/generate_builtin_ranges.awk --verify ... > > > But, I do not know how much cleaner it will become. > > I am not good at reviewing AWK code, but this > is how you go. I think that adding the verifier functionality into the generator script would make things more complex. The nature of AWK is that it works best in terms of a liner walk of the content of the files it is given as input. Since the generator and verifier scripts have different inputs (and especially since the first couple of files that they need to process first differ), the script would become more complex. The fact that different actions are taken on the same input records between the generator and verifier also complicates matters. And keeping them separate makes it easier to optimize the scripts. > If this script were written in Python, > it would be easy and readable to > split logically-related code chunks into functions, > as follows: > > > def parse_module_builtin(): > ... > > > def parse_vmlinux_map_lld(): > ... > > > def parse_vmlinux_map_bfd(): > ... > > > def parse_vmlinux_o_map(): > ... I am much better with AWK than with Python, so I went with AWK. This is a task it is remarkably well suited for given that a simple linear read through the files accomplishes almost everything that is needed. Thank you for bearing with me on doing this with AWK. Kris
diff --git a/scripts/verify_builtin_ranges.awk b/scripts/verify_builtin_ranges.awk new file mode 100755 index 000000000000..b82cf0a0fbeb --- /dev/null +++ b/scripts/verify_builtin_ranges.awk @@ -0,0 +1,365 @@ +#!/usr/bin/gawk -f +# SPDX-License-Identifier: GPL-2.0 +# verify_builtin_ranges.awk: Verify address range data for builtin modules +# Written by Kris Van Hees <kris.van.hees@oracle.com> +# +# Usage: verify_builtin_ranges.awk modules.builtin.ranges System.map \ +# modules.builtin vmlinux.map vmlinux.o.map \ +# [ <build-dir> ] +# + +# Return the module name(s) (if any) associated with the given object. +# +# If we have seen this object before, return information from the cache. +# Otherwise, retrieve it from the corresponding .cmd file. +# +function get_module_info(fn, mod, obj, mfn, s) { + if (fn in omod) + return omod[fn]; + + if (match(fn, /\/[^/]+$/) == 0) + return ""; + + obj = fn; + mod = ""; + mfn = ""; + fn = kdir "/" substr(fn, 1, RSTART) "." substr(fn, RSTART + 1) ".cmd"; + if (getline s <fn == 1) { + if (match(s, /DKBUILD_MODFILE=['"]+[^'"]+/) > 0) { + mfn = substr(s, RSTART + 16, RLENGTH - 16); + gsub(/['"]/, "", mfn); + + mod = mfn; + gsub(/([^/ ]*\/)+/, "", mod); + gsub(/-/, "_", mod); + } + } else { + print "ERROR: Failed to read: " fn "\n\n" \ + " Invalid kernel build directory (" kdir ")\n" \ + " or its content does not match " ARGV[1] >"/dev/stderr"; + close(fn); + total = 0; + exit(1); + } + close(fn); + + # A single module (common case) also reflects objects that are not part + # of a module. Some of those objects have names that are also a module + # name (e.g. core). We check the associated module file name, and if + # they do not match, the object is not part of a module. + if (mod !~ / /) { + if (!(mod in mods)) + return ""; + if (mods[mod] != mfn) + return ""; + } + + # At this point, mod is a single (valid) module name, or a list of + # module names (that do not need validation). + omod[obj] = mod; + close(fn); + + return mod; +} + +# Return a representative integer value for a given hexadecimal address. +# +# Since all kernel addresses fall within the same memory region, we can safely +# strip off the first 6 hex digits before performing the hex-to-dec conversion, +# thereby avoiding integer overflows. +# +function addr2val(val) { + sub(/^0x/, "", val); + if (length(val) == 16) + val = substr(val, 5); + return strtonum("0x" val); +} + +# Determine the kernel build directory to use (default is .). +# +BEGIN { + if (ARGC > 6) { + kdir = ARGV[ARGC - 1]; + ARGV[ARGC - 1] = ""; + } else + kdir = "."; +} + +# (1) Load the built-in module address range data. +# +ARGIND == 1 { + ranges[FNR] = $0; + rcnt++; + next; +} + +# (2) Annotate System.map symbols with module names. +# +ARGIND == 2 { + addr = addr2val($1); + name = $3; + + while (addr >= mod_eaddr) { + if (sect_symb) { + if (sect_symb != name) + next; + + sect_base = addr - sect_off; + if (dbg) + printf "[%s] BASE (%s) %016x - %016x = %016x\n", sect_name, sect_symb, addr, sect_off, sect_base >"/dev/stderr"; + sect_symb = 0; + } + + if (++ridx > rcnt) + break; + + $0 = ranges[ridx]; + sub(/-/, " "); + if ($4 != "=") { + sub(/-/, " "); + mod_saddr = strtonum("0x" $2) + sect_base; + mod_eaddr = strtonum("0x" $3) + sect_base; + $1 = $2 = $3 = ""; + sub(/^ +/, ""); + mod_name = $0; + + if (dbg) + printf "[%s] %s from %016x to %016x\n", sect_name, mod_name, mod_saddr, mod_eaddr >"/dev/stderr"; + } else { + sect_name = $1; + sect_off = strtonum("0x" $2); + sect_symb = $5; + } + } + + idx = addr"-"name; + if (addr >= mod_saddr && addr < mod_eaddr) + sym2mod[idx] = mod_name; + + next; +} + +# Once we are done annotating the System.map, we no longer need the ranges data. +# +FNR == 1 && ARGIND == 3 { + delete ranges; +} + +# (3) Build a lookup map of built-in module names. +# +# Lines from modules.builtin will be like: +# kernel/crypto/lzo-rle.ko +# and we derive the built-in module name from this as "lzo_rle" and associate +# it with object name "crypto/lzo-rle". +# +ARGIND == 3 { + sub(/kernel\//, ""); # strip off "kernel/" prefix + sub(/\.ko$/, ""); # strip off .ko suffix + + mod = $1; + sub(/([^/]*\/)+/, "", mod); # mod = basename($1) + gsub(/-/, "_", mod); # Convert - to _ + + mods[mod] = $1; + next; +} + +# (4) Get a list of symbols (per object). +# +# Symbols by object are read from vmlinux.map, with fallback to vmlinux.o.map +# if vmlinux is found to have inked in vmlinux.o. +# + +# If we were able to get the data we need from vmlinux.map, there is no need to +# process vmlinux.o.map. +# +FNR == 1 && ARGIND == 5 && total > 0 { + if (dbg) + printf "Note: %s is not needed.\n", FILENAME >"/dev/stderr"; + exit; +} + +# First determine whether we are dealing with a GNU ld or LLVM lld linker map. +# +ARGIND >= 4 && FNR == 1 && NF == 7 && $1 == "VMA" && $7 == "Symbol" { + map_is_lld = 1; + next; +} + +# (LLD) Convert a section record fronm lld format to ld format. +# +ARGIND >= 4 && map_is_lld && NF == 5 && /[0-9] [^ ]/ { + $0 = $5 " 0x"$1 " 0x"$3 " load address 0x"$2; +} + +# (LLD) Convert an object record from lld format to ld format. +# +ARGIND >= 4 && map_is_lld && NF == 5 && $5 ~ /:\(\./ { + gsub(/\)/, ""); + sub(/:\(/, " "); + sub(/ vmlinux\.a\(/, " "); + $0 = " "$6 " 0x"$1 " 0x"$3 " " $5; +} + +# (LLD) Convert a symbol record from lld format to ld format. +# +ARGIND >= 4 && map_is_lld && NF == 5 && $5 ~ /^[A-Za-z_][A-Za-z0-9_]*$/ { + $0 = " 0x" $1 " " $5; +} + +# (LLD) We do not need any other ldd linker map records. +# +ARGIND >= 4 && map_is_lld && /^[0-9a-f]{16} / { + next; +} + +# Handle section records with long section names (spilling onto a 2nd line). +# +ARGIND >= 4 && !map_is_lld && NF == 1 && /^[^ ]/ { + s = $0; + getline; + $0 = s " " $0; +} + +# Next section - previous one is done. +# +ARGIND >= 4 && /^[^ ]/ { + sect = 0; +} + +# Get the (top level) section name. +# +ARGIND >= 4 && /^[^ ]/ && $2 ~ /^0x/ && $3 ~ /^0x/ { + # Empty section or per-CPU section - ignore. + if (NF < 3 || $1 ~ /\.percpu/) { + sect = 0; + next; + } + + sect = $1; + + next; +} + +# If we are not currently in a section we care about, ignore records. +# +!sect { + next; +} + +# Handle object records with long section names (spilling onto a 2nd line). +# +ARGIND >= 4 && /^ [^ \*]/ && NF == 1 { + # If the section name is long, the remainder of the entry is found on + # the next line. + s = $0; + getline; + $0 = s " " $0; +} + +# If the object is vmlinux.o, we need to consult vmlinux.o.map for per-object +# symbol information +# +ARGIND == 4 && /^ [^ ]/ && NF == 4 { + idx = sect":"$1; + if (!(idx in sect_addend)) { + sect_addend[idx] = addr2val($2); + if (dbg) + printf "ADDEND %s = %016x\n", idx, sect_addend[idx] >"/dev/stderr"; + } + if ($4 == "vmlinux.o") { + need_o_map = 1; + next; + } +} + +# If data from vmlinux.o.map is needed, we only process section and object +# records from vmlinux.map to determine which section we need to pay attention +# to in vmlinux.o.map. So skip everything else from vmlinux.map. +# +ARGIND == 4 && need_o_map { + next; +} + +# Get module information for the current object. +# +ARGIND >= 4 && /^ [^ ]/ && NF == 4 { + msect = $1; + mod_name = get_module_info($4); + mod_eaddr = addr2val($2) + addr2val($3); + + next; +} + +# Process a symbol record. +# +# Evaluate the module information obtained from vmlinux.map (or vmlinux.o.map) +# as follows: +# - For all symbols in a given object: +# - If the symbol is annotated with the same module name(s) that the object +# belongs to, count it as a match. +# - Otherwise: +# - If the symbol is known to have duplicates of which at least one is +# in a built-in module, disregard it. +# - If the symbol us not annotated with any module name(s) AND the +# object belongs to built-in modules, count it as missing. +# - Otherwise, count it as a mismatch. +# +ARGIND >= 4 && /^ / && NF == 2 && $1 ~ /^0x/ { + idx = sect":"msect; + if (!(idx in sect_addend)) + next; + + addr = addr2val($1); + + # Handle the rare but annoying case where a 0-size symbol is placed at + # the byte *after* the module range. Based on vmlinux.map it will be + # considered part of the current object, but it falls just beyond the + # module address range. Unfortunately, its address could be at the + # start of another built-in module, so the only safe thing to do is to + # ignore it. + if (mod_name && addr == mod_eaddr) + next; + + # If we are processing vmlinux.o.map, we need to apply the base address + # of the section to the relative address on the record. + # + if (ARGIND == 5) + addr += sect_addend[idx]; + + idx = addr"-"$2; + mod = ""; + if (idx in sym2mod) { + mod = sym2mod[idx]; + if (sym2mod[idx] == mod_name) { + mod_matches++; + matches++; + } else if (mod_name == "") { + print $2 " in " sym2mod[idx] " (should NOT be)"; + mismatches++; + } else { + print $2 " in " sym2mod[idx] " (should be " mod_name ")"; + mismatches++; + } + } else if (mod_name != "") { + print $2 " should be in " mod_name; + missing++; + } else + matches++; + + total++; + + next; +} + +# Issue the comparison report. +# +END { + if (total) { + printf "Verification of %s:\n", ARGV[1]; + printf " Correct matches: %6d (%d%% of total)\n", matches, 100 * matches / total; + printf " Module matches: %6d (%d%% of matches)\n", mod_matches, 100 * mod_matches / matches; + printf " Mismatches: %6d (%d%% of total)\n", mismatches, 100 * mismatches / total; + printf " Missing: %6d (%d%% of total)\n", missing, 100 * missing / total; + } +}