Message ID | 20220826184911.168442-1-stephen.s.brennan@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | Add support for generating BTF for all variables | expand |
On Fri, Aug 26, 2022 at 11:54 AM Stephen Brennan <stephen.s.brennan@oracle.com> wrote: > > Hello everyone, > > BTF offers some exciting new possibilities beyond its original intent; > one of these is making the kernel more self-describing for debug tools. > Kallsyms contains symbol table data, and ORC (for x86_64) contains > information to help unwind stacks. Now, BTF can provide type information > for functions and variables. Taken together, this data is enough to > power the basic (read-only) functions of a postmortem or live debugger, > without falling back on the heavier debugging information formats like > DWARF. What's more, all of these data sources are contained within the > kernel image itself, and are thus available on live systems and within > crash dumps, without consulting any external debug information files. > > However, currently BTF generation emits information only for percpu > variables. This patch series removes that limitation, allowing > generating BTF for all variables, thus providing complete type > information for debuggers. > > Of course, generating additional BTF means that more data must be stored > in the kernel image, and that may not be okay for everyone. Thus, the > new behavior must be explicitly enabled by a flag. > > Testing > ------- > > To verify this change and illustrate the additional space required, I > built v5.19-rc7 on x86_defconfig, with the following additionally > enabled: > > enable DEBUG_INFO_DWARF4 > enable BPF_SYSCALL > enable DEBUG_INFO_BTF > > I then ran pahole to generate BTF from the built vmlinux in three > configurations, and recorded the size of the BTF for each: > > 1) using the current master branch > size: 5505315 bytes > 2) using this patched version, without enabling --encode_all_btf_vars > size: 5505315 bytes > 3) using this patched version, with --encode_all_btf_vars enabled > size: 6811291 bytes > > A total increase of 1.25 MiB, or a 23.7% increase. This is definitely > notable, but not unreasonable for many use cases such as desktop or > server applications. I also verified that the data generated by cases 1 > and 2 are byte-for-byte identical: that is, there are no changes to the > generated BTF unless --encode_all_btf_vars is enabled. > > I also verified that the output variables makes sense. I created an > application which parses the output BTF and dumps the > declarations (BTF_KIND_VAR and BTF_KIND_FUNC), and then diffed its > output between configuration 2 and 3. I'm happy to provide a link to > that diff (it's of course too big to include in the email). > > End-to-end test > --------------- > > To show this is not just theory, I've created an end-to-end test which > combines BTF generated via this patch series, along with a kernel patch > necessary to expose the kallsyms data [1], and a branch of the drgn > debugger[2] which implements kallsyms and BTF parsing. Core dumps > generated on the resulting kernel can be loaded by the drgn debugger, > and the it can read out variables from the dump with full type > information without needing to consult a DWARF debuginfo file. > > Future Work > ----------- > > If this proves acceptable, I'd like to follow-up with a kernel patch to > add a configuration option (default=n) for generating BTF with all > variables, which distributions could choose to enable or not. > > There was previous discussion[3] about leveraging split BTF or building > additional kernel modules to contain the extra variables. I believe with > this patch series, it is possible to do that. However, I'd argue that > simpler is better here: the advantage for using BTF is having it all > available in the kernel/module image. Storing extra BTF on the > filesystem would break that advantage, and at that point, you'd be > better off using a debuginfo format like CTF, which is lightweight and > expected to be found on the filesystem. With all or nothing approach the distros would have a hard choice to make whether to enable that kconfig, increase BTF and consume extra memory without any obvious reason or just don't do it. Majority probably is not going to enable it. So the feature will become a single vendor only and with inevitable bit-rot. Whereas with split BTF and extra kernel module approach we can enable BTF with all global vars by default. The extra module will be shipped by all distros and tools like bpftrace might start using it. > > [1]: https://lore.kernel.org/lkml/20220517000508.777145-3-stephen.s.brennan@oracle.com/T/ > (The above series is already in the 6.0 RC's) > [2]: https://github.com/brenns10/drgn/tree/kallsyms_plus_btf > [3]: https://lore.kernel.org/bpf/586a6288-704a-f7a7-b256-e18a675927df@oracle.com/ > > Stephen Brennan (7): > dutil: return ELF section name when looked up by index > btf_encoder: Rename percpu structures to variables > btf_encoder: cache all ELF section info > btf_encoder: make the variable array dynamic > btf_encoder: record ELF section for collected variables > btf_encoder: collect all variables > btf_encoder: allow encoding all variables > > btf_encoder.c | 196 +++++++++++++++++++++++++++------------------ > btf_encoder.h | 8 +- > dutil.c | 10 ++- > dutil.h | 2 +- > man-pages/pahole.1 | 6 +- > pahole.c | 31 +++++-- > 6 files changed, 165 insertions(+), 88 deletions(-) > > -- > 2.34.1 >
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > On Fri, Aug 26, 2022 at 11:54 AM Stephen Brennan > <stephen.s.brennan@oracle.com> wrote: [...] >> Future Work >> ----------- >> >> If this proves acceptable, I'd like to follow-up with a kernel patch to >> add a configuration option (default=n) for generating BTF with all >> variables, which distributions could choose to enable or not. >> >> There was previous discussion[3] about leveraging split BTF or building >> additional kernel modules to contain the extra variables. I believe with >> this patch series, it is possible to do that. However, I'd argue that >> simpler is better here: the advantage for using BTF is having it all >> available in the kernel/module image. Storing extra BTF on the >> filesystem would break that advantage, and at that point, you'd be >> better off using a debuginfo format like CTF, which is lightweight and >> expected to be found on the filesystem. > > With all or nothing approach the distros would have a hard choice > to make whether to enable that kconfig, increase BTF and consume > extra memory without any obvious reason or just don't do it. > Majority probably is not going to enable it. > So the feature will become a single vendor only and with > inevitable bit-rot. I'd intend to support it even if just a single distribution enabled it. But I do see your concern. > Whereas with split BTF and extra kernel module approach > we can enable BTF with all global vars by default. > The extra module will be shipped by all distros and tools > like bpftrace might start using it. Split BTF is currently limited to a single base BTF file. We'd need more patches for pahole to support multiple --btf_base files: e.g. vmlinux.btf and vmlinux-variables.btf. There's also the question of modules: presumably we wouldn't try to have "$MODULE" and "$MODULE-btf-extra" modules due to the added complexity. I doubt the space savings would be worth it. I can look into these concerns, but if possible I would like to proceed with this series, as it is a separate concern from the exact mechanism by which we include extra BTF into the kernel. Thanks, Stephen
On Wed, Sep 7, 2022 at 12:07 PM Stephen Brennan <stephen.s.brennan@oracle.com> wrote: > > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > > On Fri, Aug 26, 2022 at 11:54 AM Stephen Brennan > > <stephen.s.brennan@oracle.com> wrote: > [...] > >> Future Work > >> ----------- > >> > >> If this proves acceptable, I'd like to follow-up with a kernel patch to > >> add a configuration option (default=n) for generating BTF with all > >> variables, which distributions could choose to enable or not. > >> > >> There was previous discussion[3] about leveraging split BTF or building > >> additional kernel modules to contain the extra variables. I believe with > >> this patch series, it is possible to do that. However, I'd argue that > >> simpler is better here: the advantage for using BTF is having it all > >> available in the kernel/module image. Storing extra BTF on the > >> filesystem would break that advantage, and at that point, you'd be > >> better off using a debuginfo format like CTF, which is lightweight and > >> expected to be found on the filesystem. > > > > With all or nothing approach the distros would have a hard choice > > to make whether to enable that kconfig, increase BTF and consume > > extra memory without any obvious reason or just don't do it. > > Majority probably is not going to enable it. > > So the feature will become a single vendor only and with > > inevitable bit-rot. > > I'd intend to support it even if just a single distribution enabled it. > But I do see your concern. This thread was dormant for 8 days. That's a poor example of "intend to support". > > Whereas with split BTF and extra kernel module approach > > we can enable BTF with all global vars by default. > > The extra module will be shipped by all distros and tools > > like bpftrace might start using it. > > Split BTF is currently limited to a single base BTF file. We'd need more > patches for pahole to support multiple --btf_base files: e.g. > vmlinux.btf and vmlinux-variables.btf. There's also the question of > modules: presumably we wouldn't try to have "$MODULE" and > "$MODULE-btf-extra" modules due to the added complexity. I doubt the > space savings would be worth it. > > I can look into these concerns, but if possible I would like to proceed > with this series, as it is a separate concern from the exact mechanism > by which we include extra BTF into the kernel. Not an option. Sorry.
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > On Wed, Sep 7, 2022 at 12:07 PM Stephen Brennan > <stephen.s.brennan@oracle.com> wrote: >> >> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: >> > On Fri, Aug 26, 2022 at 11:54 AM Stephen Brennan >> > <stephen.s.brennan@oracle.com> wrote: >> [...] >> >> Future Work >> >> ----------- >> >> >> >> If this proves acceptable, I'd like to follow-up with a kernel patch to >> >> add a configuration option (default=n) for generating BTF with all >> >> variables, which distributions could choose to enable or not. >> >> >> >> There was previous discussion[3] about leveraging split BTF or building >> >> additional kernel modules to contain the extra variables. I believe with >> >> this patch series, it is possible to do that. However, I'd argue that >> >> simpler is better here: the advantage for using BTF is having it all >> >> available in the kernel/module image. Storing extra BTF on the >> >> filesystem would break that advantage, and at that point, you'd be >> >> better off using a debuginfo format like CTF, which is lightweight and >> >> expected to be found on the filesystem. >> > >> > With all or nothing approach the distros would have a hard choice >> > to make whether to enable that kconfig, increase BTF and consume >> > extra memory without any obvious reason or just don't do it. >> > Majority probably is not going to enable it. >> > So the feature will become a single vendor only and with >> > inevitable bit-rot. >> >> I'd intend to support it even if just a single distribution enabled it. >> But I do see your concern. > > This thread was dormant for 8 days. > That's a poor example of "intend to support". You're right, I definitely could have replied sooner. I'm sorry for that. >> > Whereas with split BTF and extra kernel module approach >> > we can enable BTF with all global vars by default. >> > The extra module will be shipped by all distros and tools >> > like bpftrace might start using it. >> >> Split BTF is currently limited to a single base BTF file. We'd need more >> patches for pahole to support multiple --btf_base files: e.g. >> vmlinux.btf and vmlinux-variables.btf. There's also the question of >> modules: presumably we wouldn't try to have "$MODULE" and >> "$MODULE-btf-extra" modules due to the added complexity. I doubt the >> space savings would be worth it. >> >> I can look into these concerns, but if possible I would like to proceed >> with this series, as it is a separate concern from the exact mechanism >> by which we include extra BTF into the kernel. > > Not an option. Sorry. Ok, so let me describe what I understand to be the proposed design as of the previous thread, and see if it satisfies your concerns. We can work from there to make sure we've got a concensus design before going further. Option #1 --------- * A new module, "vmlinux-btf-extra" (or something roughly like that) is added, which contains BTF only. It is generated with --encode_all_btf_vars and uses --btf_base=path/to/vmlinux_btf so that it contains only BTF variables. The vmlinux BTF would be generated same as always (without the --encode_all_btf_vars). * In the previous thread, it was proposed [1] that modules could include variables in their BTF in order to reduce the complexity of the change. Modules would have their BTF generated using --encode_all_btf_vars and --btf_base=path/to/vmlinux_btf. The resulting hierarchy would look like this: vmlinux_btf [ functions and percpu vars only ] |- vmlinux-btf-extra [ all other vars for vmlinux ] |- $MODULE [ functions and all vars ] ... This option is desirable because it means that we only need 2-level split BTF, and so we don't actually need to make changes to pahole for multiple --btf_base files. There are two downsides I see: (a) While we save space on vmlinux BTF, each module will have a bit of extra data for variable types. On my laptop (5.15 based) I have 9.8 MB of BTF, and if you deduct vmlinux, you're still left with 4.7 MB. If we assume the same overhead of 23.7%, that would be 1.1 MB of extra module BTF for my particular use case. $ ls -l /sys/kernel/btf | awk '{sum += $5} END {print(sum)}' 9876871 $ ls -l /sys/kernel/btf/vmlinux -r--r--r-- 1 root root 5174406 Sep 7 14:20 /sys/kernel/btf/vmlinux (b) It's possible for "vmlinux-btf-extras" and "$MODULE" to contain duplicate type definitions, wasting additional space. However, as far as I understand it, this was already a possibility, e.g. $MODULE1 and $MODULE2 could already contain duplicate types. So I think this downside is no more. Option #2 --------- * The vmlinux-btf-extra module is still added as in Option #1. * Further, each module would have its own "$MODULE-btf-extra" module to add in extra BTF. These would be built with a --btf_base=$MODULE.ko and of course that BTF is based on vmlinux, so we would have: vmlinux_btf [ functions and percpu vars only ] |- vmlinux-btf-extras [ all other vars for vmlinux ] |- $MODULE [ functions and percpu vars only ] |- $MODULE-btf-extra [ all other vars for $MODULE ] This is much more complex, pahole must be extended to support a hierarchy of --btf_base files. The kernel itself may not need to understand multi-level BTF since there's no requirement that it actually understand $MODULE-btf-extra, so long as it exposes it via /sys/kernel/btf/$MODULE-btf-extra. I'd also like to see some sort of mechanism to allow an administrator to say "please always load $MODULE-btf-extras alongside $MODULE", but I think that would be a userspace problem. This resolves issue (a) from option #1, of course at implementation cost. Regardless of Option #1 or #2, I'd propose that we implement this as a tristate, similar to what Alan proposed [2]. When set to "m" we use the solutions described above, and when set to "y", we don't bother with it, instead using --encode_all_btf_vars for all generation. If we go with Option #1, no changes to this series should be necessary. If we go with Option #2, I'll need to extend pahole to support at least two BTF base files. Please let me know your thoughts. Stephen [1]: https://lore.kernel.org/bpf/CAEf4BzZmJKqXaJMBxhKqFNXzjO=eN5gk2xQVnmQVdK2xd3HQ=g@mail.gmail.com/ [2]: https://lore.kernel.org/bpf/alpine.LRH.2.23.451.2205032254390.10133@MyRouter/
On Wed, Sep 7, 2022 at 2:54 PM Stephen Brennan <stephen.s.brennan@oracle.com> wrote: > > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > > On Wed, Sep 7, 2022 at 12:07 PM Stephen Brennan > > <stephen.s.brennan@oracle.com> wrote: > >> > >> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > >> > On Fri, Aug 26, 2022 at 11:54 AM Stephen Brennan > >> > <stephen.s.brennan@oracle.com> wrote: > >> [...] > >> >> Future Work > >> >> ----------- > >> >> > >> >> If this proves acceptable, I'd like to follow-up with a kernel patch to > >> >> add a configuration option (default=n) for generating BTF with all > >> >> variables, which distributions could choose to enable or not. > >> >> > >> >> There was previous discussion[3] about leveraging split BTF or building > >> >> additional kernel modules to contain the extra variables. I believe with > >> >> this patch series, it is possible to do that. However, I'd argue that > >> >> simpler is better here: the advantage for using BTF is having it all > >> >> available in the kernel/module image. Storing extra BTF on the > >> >> filesystem would break that advantage, and at that point, you'd be > >> >> better off using a debuginfo format like CTF, which is lightweight and > >> >> expected to be found on the filesystem. > >> > > >> > With all or nothing approach the distros would have a hard choice > >> > to make whether to enable that kconfig, increase BTF and consume > >> > extra memory without any obvious reason or just don't do it. > >> > Majority probably is not going to enable it. > >> > So the feature will become a single vendor only and with > >> > inevitable bit-rot. > >> > >> I'd intend to support it even if just a single distribution enabled it. > >> But I do see your concern. > > > > This thread was dormant for 8 days. > > That's a poor example of "intend to support". > > You're right, I definitely could have replied sooner. I'm sorry for that. > > >> > Whereas with split BTF and extra kernel module approach > >> > we can enable BTF with all global vars by default. > >> > The extra module will be shipped by all distros and tools > >> > like bpftrace might start using it. > >> > >> Split BTF is currently limited to a single base BTF file. We'd need more > >> patches for pahole to support multiple --btf_base files: e.g. > >> vmlinux.btf and vmlinux-variables.btf. There's also the question of > >> modules: presumably we wouldn't try to have "$MODULE" and > >> "$MODULE-btf-extra" modules due to the added complexity. I doubt the > >> space savings would be worth it. > >> > >> I can look into these concerns, but if possible I would like to proceed > >> with this series, as it is a separate concern from the exact mechanism > >> by which we include extra BTF into the kernel. > > > > Not an option. Sorry. > > Ok, so let me describe what I understand to be the proposed design as of > the previous thread, and see if it satisfies your concerns. We can work > from there to make sure we've got a concensus design before going > further. I was hoping Andrii and others will provide their opinion. Here are my .02 > Option #1 > --------- > > * A new module, "vmlinux-btf-extra" (or something roughly like that) is > added, which contains BTF only. It is generated with > --encode_all_btf_vars and uses --btf_base=path/to/vmlinux_btf so that > it contains only BTF variables. The vmlinux BTF would be generated > same as always (without the --encode_all_btf_vars). > > * In the previous thread, it was proposed [1] that modules could > include variables in their BTF in order to reduce the complexity of > the change. Modules would have their BTF generated using > --encode_all_btf_vars and --btf_base=path/to/vmlinux_btf. The > resulting hierarchy would look like this: > > vmlinux_btf [ functions and percpu vars only ] > |- vmlinux-btf-extra [ all other vars for vmlinux ] > |- $MODULE [ functions and all vars ] > ... > > This option is desirable because it means that we only need 2-level > split BTF, and so we don't actually need to make changes to pahole for > multiple --btf_base files. There are two downsides I see: > > (a) While we save space on vmlinux BTF, each module will have a bit of > extra data for variable types. On my laptop (5.15 based) I have 9.8 > MB of BTF, and if you deduct vmlinux, you're still left with 4.7 MB. > If we assume the same overhead of 23.7%, that would be 1.1 MB of > extra module BTF for my particular use case. > > $ ls -l /sys/kernel/btf | awk '{sum += $5} END {print(sum)}' > 9876871 > $ ls -l /sys/kernel/btf/vmlinux > -r--r--r-- 1 root root 5174406 Sep 7 14:20 /sys/kernel/btf/vmlinux > > (b) It's possible for "vmlinux-btf-extras" and "$MODULE" to contain > duplicate type definitions, wasting additional space. However, as > far as I understand it, this was already a possibility, e.g. > $MODULE1 and $MODULE2 could already contain duplicate types. So I > think this downside is no more. Both concerns are valid, but I'm a bit puzzled with (a). At least in the networking drivers the number of global vars is very small. I expected other drivers to be similar. So having "functions and all vars" in ko-s should not add that much overhead. Maybe you're seeing this overhead because pahole is adding all declared vars and not only the vars that are actually present? That would explain the discrepancy. (b) with a bunch of duplicates is a sign that something is off as well. > > > Option #2 > --------- > > * The vmlinux-btf-extra module is still added as in Option #1. > > * Further, each module would have its own "$MODULE-btf-extra" module to > add in extra BTF. These would be built with a --btf_base=$MODULE.ko > and of course that BTF is based on vmlinux, so we would have: > > vmlinux_btf [ functions and percpu vars only ] > |- vmlinux-btf-extras [ all other vars for vmlinux ] > |- $MODULE [ functions and percpu vars only ] > |- $MODULE-btf-extra [ all other vars for $MODULE ] > > This is much more complex, pahole must be extended to support a > hierarchy of --btf_base files. The kernel itself may not need to > understand multi-level BTF since there's no requirement that it actually > understand $MODULE-btf-extra, so long as it exposes it via > /sys/kernel/btf/$MODULE-btf-extra. I'd also like to see some sort of > mechanism to allow an administrator to say "please always load > $MODULE-btf-extras alongside $MODULE", but I think that would be a > userspace problem. > > This resolves issue (a) from option #1, of course at implementation > cost. > > Regardless of Option #1 or #2, I'd propose that we implement this as a > tristate, similar to what Alan proposed [2]. When set to "m" we use the > solutions described above, and when set to "y", we don't bother with it, > instead using --encode_all_btf_vars for all generation. > > If we go with Option #1, no changes to this series should be necessary. > If we go with Option #2, I'll need to extend pahole to support at least > two BTF base files. Please let me know your thoughts. Completely agree that two level btf-extra needs quite a bit more work. Before we proceed with option 2 let's figure out the reason for extra space in option 1.
>> (a) While we save space on vmlinux BTF, each module will have a bit of >> extra data for variable types. On my laptop (5.15 based) I have 9.8 >> MB of BTF, and if you deduct vmlinux, you're still left with 4.7 MB. >> If we assume the same overhead of 23.7%, that would be 1.1 MB of >> extra module BTF for my particular use case. >> >> $ ls -l /sys/kernel/btf | awk '{sum += $5} END {print(sum)}' >> 9876871 >> $ ls -l /sys/kernel/btf/vmlinux >> -r--r--r-- 1 root root 5174406 Sep 7 14:20 /sys/kernel/btf/vmlinux >> >> (b) It's possible for "vmlinux-btf-extras" and "$MODULE" to contain >> duplicate type definitions, wasting additional space. However, as >> far as I understand it, this was already a possibility, e.g. >> $MODULE1 and $MODULE2 could already contain duplicate types. So I >> think this downside is no more. > > Both concerns are valid, but I'm a bit puzzled with (a). > At least in the networking drivers the number of global vars is very small. > I expected other drivers to be similar. > So having "functions and all vars" in ko-s should not add > that much overhead. > > Maybe you're seeing this overhead because pahole is adding > all declared vars and not only the vars that are actually present? > That would explain the discrepancy. > (b) with a bunch of duplicates is a sign that something is off as well. Sorry, I didn't actually have an analysis for module BTF, I was just extrapolating the result I had seen for vmlinux. I went ahead and did a proper test, generating BTF for a distribution kernel from Oracle Linux (kernel-uek-5.15.0-1.43.4.1.el9uek.x86_64) - something that I easily had on hand and could regenerate the BTF for quickly. Basically, the steps were: pahole -J vmlinux --btf_encode_detached=vmlinux.btf pahole -J vmlinux --btf_encode_detached=vmlinux.btf.all \ --encode_all_btf_vars # For each module pahole -J $MODULE --btf_encode_detached=$MODULE.btf \ --btf_base=vmlinux.btf pahole -J $MODULE --btf_encode_detached=$MODULE.btf.all \ --btf_base=vmlinux.btf --encode_all_btf_vars # what if we based the module BTF on the "vmlinux.btf.all" instead? pahole -J $MODULE --btf_encode_detached=$MODULE.btf.all.all \ --btf_base=vmlinux.btf.all --encode_all_btf_vars And then using ls/awk to sum up the bytes of each BTF file. Results are: vmlinux: -rw-r-----. 1 opc opc 4904193 Sep 9 18:58 vmlinux.btf -rw-r-----. 1 opc opc 6534684 Sep 9 18:58 vmlinux.btf.all In this case there's a 33% increase in BTF size. modules: $ ls -l *.btf | awk '{sum += $5} END {print(sum)}' 43979532 $ ls -l *.btf.all | awk '{sum += $5} END {print(sum)}' 44757792 $ ls -l *.btf.all.all | awk '{sum += $5} END {print(sum)}' 44696639 So the "*.btf.all.all" modules were just an experiment to see if the extra data inside "vmlinux.btf.all" could reduce some duplication in module BTF. The answer was yes, but not enough to make up for the increase in the vmlinux BTF size. The "*.btf.all" modules are the ones we would actually expect to use in Option #1, where we have a vmlinux-btf-extras and the rest of the modules include their globals in their BTF sections directly, and are based off of the vmlinux BTF. This test shows on average, that the module BTF size would grow by 1.6% with Option #1. Of course the exact memory size that accounts for will vary by workload, depending on how many modules are loaded. But I'd imagine, assuming you have around 5MB of module BTF *actually loaded*, then the overhead would be around 85k bytes. I don't know about how you feel, but I think that sounds acceptable, it's just 22 pages at 4k size :) Let me know how it sounds to you. Thanks, Stephen >> >> >> Option #2 >> --------- >> >> * The vmlinux-btf-extra module is still added as in Option #1. >> >> * Further, each module would have its own "$MODULE-btf-extra" module to >> add in extra BTF. These would be built with a --btf_base=$MODULE.ko >> and of course that BTF is based on vmlinux, so we would have: >> >> vmlinux_btf [ functions and percpu vars only ] >> |- vmlinux-btf-extras [ all other vars for vmlinux ] >> |- $MODULE [ functions and percpu vars only ] >> |- $MODULE-btf-extra [ all other vars for $MODULE ] >> >> This is much more complex, pahole must be extended to support a >> hierarchy of --btf_base files. The kernel itself may not need to >> understand multi-level BTF since there's no requirement that it actually >> understand $MODULE-btf-extra, so long as it exposes it via >> /sys/kernel/btf/$MODULE-btf-extra. I'd also like to see some sort of >> mechanism to allow an administrator to say "please always load >> $MODULE-btf-extras alongside $MODULE", but I think that would be a >> userspace problem. >> >> This resolves issue (a) from option #1, of course at implementation >> cost. >> >> Regardless of Option #1 or #2, I'd propose that we implement this as a >> tristate, similar to what Alan proposed [2]. When set to "m" we use the >> solutions described above, and when set to "y", we don't bother with it, >> instead using --encode_all_btf_vars for all generation. >> >> If we go with Option #1, no changes to this series should be necessary. >> If we go with Option #2, I'll need to extend pahole to support at least >> two BTF base files. Please let me know your thoughts. > > Completely agree that two level btf-extra needs quite a bit more work. > Before we proceed with option 2 let's figure out > the reason for extra space in option 1.
On Fri, Sep 9, 2022 at 12:31 PM Stephen Brennan <stephen.s.brennan@oracle.com> wrote: > > >> (a) While we save space on vmlinux BTF, each module will have a bit of > >> extra data for variable types. On my laptop (5.15 based) I have 9.8 > >> MB of BTF, and if you deduct vmlinux, you're still left with 4.7 MB. > >> If we assume the same overhead of 23.7%, that would be 1.1 MB of > >> extra module BTF for my particular use case. > >> > >> $ ls -l /sys/kernel/btf | awk '{sum += $5} END {print(sum)}' > >> 9876871 > >> $ ls -l /sys/kernel/btf/vmlinux > >> -r--r--r-- 1 root root 5174406 Sep 7 14:20 /sys/kernel/btf/vmlinux > >> > >> (b) It's possible for "vmlinux-btf-extras" and "$MODULE" to contain > >> duplicate type definitions, wasting additional space. However, as > >> far as I understand it, this was already a possibility, e.g. > >> $MODULE1 and $MODULE2 could already contain duplicate types. So I > >> think this downside is no more. > > > > Both concerns are valid, but I'm a bit puzzled with (a). > > At least in the networking drivers the number of global vars is very small. > > I expected other drivers to be similar. > > So having "functions and all vars" in ko-s should not add > > that much overhead. > > > > Maybe you're seeing this overhead because pahole is adding > > all declared vars and not only the vars that are actually present? > > That would explain the discrepancy. > > (b) with a bunch of duplicates is a sign that something is off as well. > > Sorry, I didn't actually have an analysis for module BTF, I was just > extrapolating the result I had seen for vmlinux. I went ahead and did a > proper test, generating BTF for a distribution kernel from Oracle Linux > (kernel-uek-5.15.0-1.43.4.1.el9uek.x86_64) - something that I easily had > on hand and could regenerate the BTF for quickly. > > Basically, the steps were: > > pahole -J vmlinux --btf_encode_detached=vmlinux.btf > pahole -J vmlinux --btf_encode_detached=vmlinux.btf.all \ > --encode_all_btf_vars > > # For each module > pahole -J $MODULE --btf_encode_detached=$MODULE.btf \ > --btf_base=vmlinux.btf > pahole -J $MODULE --btf_encode_detached=$MODULE.btf.all \ > --btf_base=vmlinux.btf --encode_all_btf_vars > > # what if we based the module BTF on the "vmlinux.btf.all" instead? > pahole -J $MODULE --btf_encode_detached=$MODULE.btf.all.all \ > --btf_base=vmlinux.btf.all --encode_all_btf_vars > > And then using ls/awk to sum up the bytes of each BTF file. Results are: > > vmlinux: > > -rw-r-----. 1 opc opc 4904193 Sep 9 18:58 vmlinux.btf > -rw-r-----. 1 opc opc 6534684 Sep 9 18:58 vmlinux.btf.all > > In this case there's a 33% increase in BTF size. > > modules: > > $ ls -l *.btf | awk '{sum += $5} END {print(sum)}' > 43979532 > $ ls -l *.btf.all | awk '{sum += $5} END {print(sum)}' > 44757792 > $ ls -l *.btf.all.all | awk '{sum += $5} END {print(sum)}' > 44696639 > > So the "*.btf.all.all" modules were just an experiment to see if the > extra data inside "vmlinux.btf.all" could reduce some duplication in > module BTF. The answer was yes, but not enough to make up for the > increase in the vmlinux BTF size. > > The "*.btf.all" modules are the ones we would actually expect to use in > Option #1, where we have a vmlinux-btf-extras and the rest of the > modules include their globals in their BTF sections directly, and are > based off of the vmlinux BTF. This test shows on average, that the > module BTF size would grow by 1.6% with Option #1. Of course the exact > memory size that accounts for will vary by workload, depending on how > many modules are loaded. But I'd imagine, assuming you have around 5MB > of module BTF *actually loaded*, then the overhead would be around 85k > bytes. I don't know about how you feel, but I think that sounds > acceptable, it's just 22 pages at 4k size :) > > Let me know how it sounds to you. > > Thanks, > Stephen > > >> > >> > >> Option #2 > >> --------- > >> > >> * The vmlinux-btf-extra module is still added as in Option #1. > >> > >> * Further, each module would have its own "$MODULE-btf-extra" module to > >> add in extra BTF. These would be built with a --btf_base=$MODULE.ko > >> and of course that BTF is based on vmlinux, so we would have: > >> > >> vmlinux_btf [ functions and percpu vars only ] > >> |- vmlinux-btf-extras [ all other vars for vmlinux ] > >> |- $MODULE [ functions and percpu vars only ] > >> |- $MODULE-btf-extra [ all other vars for $MODULE ] > >> > >> This is much more complex, pahole must be extended to support a > >> hierarchy of --btf_base files. The kernel itself may not need to > >> understand multi-level BTF since there's no requirement that it actually > >> understand $MODULE-btf-extra, so long as it exposes it via > >> /sys/kernel/btf/$MODULE-btf-extra. I'd also like to see some sort of > >> mechanism to allow an administrator to say "please always load > >> $MODULE-btf-extras alongside $MODULE", but I think that would be a > >> userspace problem. > >> > >> This resolves issue (a) from option #1, of course at implementation > >> cost. > >> > >> Regardless of Option #1 or #2, I'd propose that we implement this as a > >> tristate, similar to what Alan proposed [2]. When set to "m" we use the > >> solutions described above, and when set to "y", we don't bother with it, > >> instead using --encode_all_btf_vars for all generation. > >> > >> If we go with Option #1, no changes to this series should be necessary. > >> If we go with Option #2, I'll need to extend pahole to support at least > >> two BTF base files. Please let me know your thoughts. > > > > Completely agree that two level btf-extra needs quite a bit more work. > > Before we proceed with option 2 let's figure out > > the reason for extra space in option 1. I don't think an extra module for each module just for keeping those all-var-BTFs is acceptable, so Option #2 doesn't even seem like an option. But given a very small increase in size of BTF for modules when including variables, I think Option #1 is quite reasonable.