Message ID | 20240504003006.3303334-3-andrii@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | ioctl()-based API to query VMAs from /proc/<pid>/maps | expand |
On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote: > /proc/<pid>/maps file is extremely useful in practice for various tasks > involving figuring out process memory layout, what files are backing any > given memory range, etc. One important class of applications that > absolutely rely on this are profilers/stack symbolizers. They would > normally capture stack trace containing absolute memory addresses of > some functions, and would then use /proc/<pid>/maps file to file > corresponding backing ELF files, file offsets within them, and then > continue from there to get yet more information (ELF symbols, DWARF > information) to get human-readable symbolic information. > > As such, there are both performance and correctness requirement > involved. This address to VMA information translation has to be done as > efficiently as possible, but also not miss any VMA (especially in the > case of loading/unloading shared libraries). > > Unfortunately, for all the /proc/<pid>/maps file universality and > usefulness, it doesn't fit the above 100%. Is this a new change or has it always been this way? > First, it's text based, which makes its programmatic use from > applications and libraries unnecessarily cumbersome and slow due to the > need to do text parsing to get necessary pieces of information. slow in what way? How has it never been noticed before as a problem? And exact numbers are appreciated please, yes open/read/close seems slower than open/ioctl/close, but is it really overall an issue in the real world for anything? Text apis are good as everyone can handle them, ioctls are harder for obvious reasons. > Second, it's main purpose is to emit all VMAs sequentially, but in > practice captured addresses would fall only into a small subset of all > process' VMAs, mainly containing executable text. Yet, library would > need to parse most or all of the contents to find needed VMAs, as there > is no way to skip VMAs that are of no use. Efficient library can do the > linear pass and it is still relatively efficient, but it's definitely an > overhead that can be avoided, if there was a way to do more targeted > querying of the relevant VMA information. I don't understand, is this a bug in the current files? If so, why not just fix that up? And again "efficient" need to be quantified. > Another problem when writing generic stack trace symbolization library > is an unfortunate performance-vs-correctness tradeoff that needs to be > made. What requirement has caused a "generic stack trace symbolization library" to be needed at all? What is the problem you are trying to solve that is not already solved by existing tools? > Library has to make a decision to either cache parsed contents of > /proc/<pid>/maps for service future requests (if application requests to > symbolize another set of addresses, captured at some later time, which > is typical for periodic/continuous profiling cases) to avoid higher > costs of needed to re-parse this file or caching the contents in memory > to speed up future requests. In the former case, more memory is used for > the cache and there is a risk of getting stale data if application > loaded/unloaded shared libraries, or otherwise changed its set of VMAs > through additiona mmap() calls (and other means of altering memory > address space). In the latter case, it's the performance hit that comes > from re-opening the file and re-reading/re-parsing its contents all over > again. Again, "performance hit" needs to be justified, it shouldn't be much overall. > This patch aims to solve this problem by providing a new API built on > top of /proc/<pid>/maps. It is ioctl()-based and built as a binary > interface, avoiding the cost and awkwardness of textual representation > for programmatic use. Some people find text easier to handle for programmatic use :) > It's designed to be extensible and > forward/backward compatible by including user-specified field size and > using copy_struct_from_user() approach. But, most importantly, it allows > to do point queries for specific single address, specified by user. And > this is done efficiently using VMA iterator. Ok, maybe this is the main issue, you only want one at a time? > User has a choice to pick either getting VMA that covers provided > address or -ENOENT if none is found (exact, least surprising, case). Or, > with an extra query flag (PROCFS_PROCMAP_EXACT_OR_NEXT_VMA), they can > get either VMA that covers the address (if there is one), or the closest > next VMA (i.e., VMA with the smallest vm_start > addr). The later allows > more efficient use, but, given it could be a surprising behavior, > requires an explicit opt-in. > > Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes > sense given it's querying the same set of VMA data. All the permissions > checks performed on /proc/<pid>/maps opening fit here as well. > ioctl-based implementation is fetching remembered mm_struct reference, > but otherwise doesn't interfere with seq_file-based implementation of > /proc/<pid>/maps textual interface, and so could be used together or > independently without paying any price for that. > > There is one extra thing that /proc/<pid>/maps doesn't currently > provide, and that's an ability to fetch ELF build ID, if present. User > has control over whether this piece of information is requested or not > by either setting build_id_size field to zero or non-zero maximum buffer > size they provided through build_id_addr field (which encodes user > pointer as __u64 field). > > The need to get ELF build ID reliably is an important aspect when > dealing with profiling and stack trace symbolization, and > /proc/<pid>/maps textual representation doesn't help with this, > requiring applications to open underlying ELF binary through > /proc/<pid>/map_files/<start>-<end> symlink, which adds an extra > permissions implications due giving a full access to the binary from > (potentially) another process, while all application is interested in is > build ID. Giving an ability to request just build ID doesn't introduce > any additional security concerns, on top of what /proc/<pid>/maps is > already concerned with, simplifying the overall logic. > > Kernel already implements build ID fetching, which is used from BPF > subsystem. We are reusing this code here, but plan a follow up changes > to make it work better under more relaxed assumption (compared to what > existing code assumes) of being called from user process context, in > which page faults are allowed. BPF-specific implementation currently > bails out if necessary part of ELF file is not paged in, all due to > extra BPF-specific restrictions (like the need to fetch build ID in > restrictive contexts such as NMI handler). > > Note also, that fetching VMA name (e.g., backing file path, or special > hard-coded or user-provided names) is optional just like build ID. If > user sets vma_name_size to zero, kernel code won't attempt to retrieve > it, saving resources. > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Where is the userspace code that uses this new api you have created? > --- > fs/proc/task_mmu.c | 165 ++++++++++++++++++++++++++++++++++++++++ > include/uapi/linux/fs.h | 32 ++++++++ > 2 files changed, 197 insertions(+) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 8e503a1635b7..cb7b1ff1a144 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -22,6 +22,7 @@ > #include <linux/pkeys.h> > #include <linux/minmax.h> > #include <linux/overflow.h> > +#include <linux/buildid.h> > > #include <asm/elf.h> > #include <asm/tlb.h> > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file) > return do_maps_open(inode, file, &proc_pid_maps_op); > } > > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) > +{ > + struct procfs_procmap_query karg; > + struct vma_iterator iter; > + struct vm_area_struct *vma; > + struct mm_struct *mm; > + const char *name = NULL; > + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL; > + __u64 usize; > + int err; > + > + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize))) > + return -EFAULT; > + if (usize > PAGE_SIZE) Nice, where did you document that? And how is that portable given that PAGE_SIZE can be different on different systems? and why aren't you checking the actual structure size instead? You can easily run off the end here without knowing it. > + return -E2BIG; > + if (usize < offsetofend(struct procfs_procmap_query, query_addr)) > + return -EINVAL; Ok, so you have two checks? How can the first one ever fail? > + err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize); > + if (err) > + return err; > + > + if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) > + return -EINVAL; > + if (!!karg.vma_name_size != !!karg.vma_name_addr) > + return -EINVAL; > + if (!!karg.build_id_size != !!karg.build_id_addr) > + return -EINVAL; So you want values to be set, right? > + > + mm = priv->mm; > + if (!mm || !mmget_not_zero(mm)) > + return -ESRCH; What is this error for? Where is this documentned? > + if (mmap_read_lock_killable(mm)) { > + mmput(mm); > + return -EINTR; > + } > + > + vma_iter_init(&iter, mm, karg.query_addr); > + vma = vma_next(&iter); > + if (!vma) { > + err = -ENOENT; > + goto out; > + } > + /* user wants covering VMA, not the closest next one */ > + if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) && > + vma->vm_start > karg.query_addr) { > + err = -ENOENT; > + goto out; > + } > + > + karg.vma_start = vma->vm_start; > + karg.vma_end = vma->vm_end; > + > + if (vma->vm_file) { > + const struct inode *inode = file_user_inode(vma->vm_file); > + > + karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT; > + karg.dev_major = MAJOR(inode->i_sb->s_dev); > + karg.dev_minor = MINOR(inode->i_sb->s_dev); So the major/minor is that of the file superblock? Why? > + karg.inode = inode->i_ino; What is userspace going to do with this? > + } else { > + karg.vma_offset = 0; > + karg.dev_major = 0; > + karg.dev_minor = 0; > + karg.inode = 0; Why not set everything to 0 up above at the beginning so you never miss anything, and you don't miss any holes accidentally in the future. > + } > + > + karg.vma_flags = 0; > + if (vma->vm_flags & VM_READ) > + karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE; > + if (vma->vm_flags & VM_WRITE) > + karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE; > + if (vma->vm_flags & VM_EXEC) > + karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE; > + if (vma->vm_flags & VM_MAYSHARE) > + karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED; > + > + if (karg.build_id_size) { > + __u32 build_id_sz = BUILD_ID_SIZE_MAX; > + > + err = build_id_parse(vma, build_id_buf, &build_id_sz); > + if (!err) { > + if (karg.build_id_size < build_id_sz) { > + err = -ENAMETOOLONG; > + goto out; > + } > + karg.build_id_size = build_id_sz; > + } > + } > + > + if (karg.vma_name_size) { > + size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size); > + const struct path *path; > + const char *name_fmt; > + size_t name_sz = 0; > + > + get_vma_name(vma, &path, &name, &name_fmt); > + > + if (path || name_fmt || name) { > + name_buf = kmalloc(name_buf_sz, GFP_KERNEL); > + if (!name_buf) { > + err = -ENOMEM; > + goto out; > + } > + } > + if (path) { > + name = d_path(path, name_buf, name_buf_sz); > + if (IS_ERR(name)) { > + err = PTR_ERR(name); > + goto out; > + } > + name_sz = name_buf + name_buf_sz - name; > + } else if (name || name_fmt) { > + name_sz = 1 + snprintf(name_buf, name_buf_sz, name_fmt ?: "%s", name); > + name = name_buf; > + } > + if (name_sz > name_buf_sz) { > + err = -ENAMETOOLONG; > + goto out; > + } > + karg.vma_name_size = name_sz; > + } > + > + /* unlock and put mm_struct before copying data to user */ > + mmap_read_unlock(mm); > + mmput(mm); > + > + if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr, > + name, karg.vma_name_size)) { > + kfree(name_buf); > + return -EFAULT; > + } > + kfree(name_buf); > + > + if (karg.build_id_size && copy_to_user((void __user *)karg.build_id_addr, > + build_id_buf, karg.build_id_size)) > + return -EFAULT; > + > + if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize))) > + return -EFAULT; > + > + return 0; > + > +out: > + mmap_read_unlock(mm); > + mmput(mm); > + kfree(name_buf); > + return err; > +} > + > +static long procfs_procmap_ioctl(struct file *file, unsigned int cmd, unsigned long arg) > +{ > + struct seq_file *seq = file->private_data; > + struct proc_maps_private *priv = seq->private; > + > + switch (cmd) { > + case PROCFS_PROCMAP_QUERY: > + return do_procmap_query(priv, (void __user *)arg); > + default: > + return -ENOIOCTLCMD; > + } > +} > + > const struct file_operations proc_pid_maps_operations = { > .open = pid_maps_open, > .read = seq_read, > .llseek = seq_lseek, > .release = proc_map_release, > + .unlocked_ioctl = procfs_procmap_ioctl, > + .compat_ioctl = procfs_procmap_ioctl, > }; > > /* > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > index 45e4e64fd664..fe8924a8d916 100644 > --- a/include/uapi/linux/fs.h > +++ b/include/uapi/linux/fs.h > @@ -393,4 +393,36 @@ struct pm_scan_arg { > __u64 return_mask; > }; > > +/* /proc/<pid>/maps ioctl */ > +#define PROCFS_IOCTL_MAGIC 0x9f Don't you need to document this in the proper place? > +#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query) > + > +enum procmap_query_flags { > + PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01, > +}; > + > +enum procmap_vma_flags { > + PROCFS_PROCMAP_VMA_READABLE = 0x01, > + PROCFS_PROCMAP_VMA_WRITABLE = 0x02, > + PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04, > + PROCFS_PROCMAP_VMA_SHARED = 0x08, Are these bits? If so, please use the bit macro for it to make it obvious. > +}; > + > +struct procfs_procmap_query { > + __u64 size; > + __u64 query_flags; /* in */ Does this map to the procmap_vma_flags enum? if so, please say so. > + __u64 query_addr; /* in */ > + __u64 vma_start; /* out */ > + __u64 vma_end; /* out */ > + __u64 vma_flags; /* out */ > + __u64 vma_offset; /* out */ > + __u64 inode; /* out */ What is the inode for, you have an inode for the file already, why give it another one? > + __u32 dev_major; /* out */ > + __u32 dev_minor; /* out */ What is major/minor for? > + __u32 vma_name_size; /* in/out */ > + __u32 build_id_size; /* in/out */ > + __u64 vma_name_addr; /* in */ > + __u64 build_id_addr; /* in */ Why not document this all using kerneldoc above the structure? anyway, I don't like ioctls, but there is a place for them, you just have to actually justify the use for them and not say "not efficient enough" as that normally isn't an issue overall. thanks, greg k-h
On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote: > > /proc/<pid>/maps file is extremely useful in practice for various tasks > > involving figuring out process memory layout, what files are backing any > > given memory range, etc. One important class of applications that > > absolutely rely on this are profilers/stack symbolizers. They would > > normally capture stack trace containing absolute memory addresses of > > some functions, and would then use /proc/<pid>/maps file to file > > corresponding backing ELF files, file offsets within them, and then > > continue from there to get yet more information (ELF symbols, DWARF > > information) to get human-readable symbolic information. > > > > As such, there are both performance and correctness requirement > > involved. This address to VMA information translation has to be done as > > efficiently as possible, but also not miss any VMA (especially in the > > case of loading/unloading shared libraries). > > > > Unfortunately, for all the /proc/<pid>/maps file universality and > > usefulness, it doesn't fit the above 100%. > > Is this a new change or has it always been this way? > Probably always has been this way. My first exposure to profiling and stack symbolization was about 7 years ago, and already then /proc/<pid>/maps was the only way to do this, and not a 100% fit even then. > > First, it's text based, which makes its programmatic use from > > applications and libraries unnecessarily cumbersome and slow due to the > > need to do text parsing to get necessary pieces of information. > > slow in what way? How has it never been noticed before as a problem? It's just inherently slower to parse text to fish out a bunch of integers (vma_start address, offset, inode+dev and file paths are typical pieces needed to "normalize" captured stack trace addresses). It's not too bad in terms of programming and performance for scanf-like APIs, but without scanf, you are dealing with splitting by whitespaces and tons of unnecessary string allocations. It was noticed, I think people using this for profiling/symbolization are not necessarily well versed in kernel development and they just get by with what kernel provides. > > And exact numbers are appreciated please, yes open/read/close seems > slower than open/ioctl/close, but is it really overall an issue in the > real world for anything? > > Text apis are good as everyone can handle them, ioctls are harder for > obvious reasons. Yes, and acknowledged the usefulness of text-based interface. But it's my (and other people I've talked with that had to deal with these textual interfaces) opinion that using binary interfaces are far superior when it comes to *programmatic* usage (i.e., from C/C++/Rust/whatever languages directly). Textual is great for bash scripts and human debugging, of course. > > > Second, it's main purpose is to emit all VMAs sequentially, but in > > practice captured addresses would fall only into a small subset of all > > process' VMAs, mainly containing executable text. Yet, library would > > need to parse most or all of the contents to find needed VMAs, as there > > is no way to skip VMAs that are of no use. Efficient library can do the > > linear pass and it is still relatively efficient, but it's definitely an > > overhead that can be avoided, if there was a way to do more targeted > > querying of the relevant VMA information. > > I don't understand, is this a bug in the current files? If so, why not > just fix that up? > It's not a bug, I think /proc/<pid>/maps was targeted to describe *entire* address space, but for profiling and symbolization needs we need to find only a small subset of relevant VMAs. There is nothing wrong with existing implementation, it's just not a 100% fit for the more specialized "let's find relevant VMAs for this set of addresses" problem. > And again "efficient" need to be quantified. You probably saw patch #5 where I solve exactly the same problem in two different ways. And the problem is typical for symbolization: you are given a bunch of addresses within some process, we need to find files they belong to and what file offset they are mapped to. This is then used to, for example, match them to ELF symbols representing functions. > > > Another problem when writing generic stack trace symbolization library > > is an unfortunate performance-vs-correctness tradeoff that needs to be > > made. > > What requirement has caused a "generic stack trace symbolization > library" to be needed at all? What is the problem you are trying to > solve that is not already solved by existing tools? Capturing stack trace is a very common part, especially for BPF-based tools and applications. E.g., bpftrace allows one to capture stack traces for some "interesting events" (whatever that is, some kernel function call, user function call, perf event, there is tons of flexibility). Stack traces answer "how did we get here", but it's just an array of addresses, which need to be translated to something that humans can make sense of. That's what the symbolization library is helping with. This process is multi-step, quite involved, hard to get right with a good balance of efficiency, correctness and fullness of information (there is always a choice of doing simplistic symbolization using just ELF symbols, or much more expensive but also fuller symbolization using DWARF information, which gives also file name + line number information, can symbolize inlined functions, etc). One such library is blazesym ([0], cc'ed Daniel, who's working on it), which is developed by Meta for both internal use in our fleet-wide profiler, and is also in the process of being integrated into bpftrace (to improve bpftrace's current somewhat limited symbolization approach based on BCC). There is also a non-Meta project (I believe Datadog) that is using it for its own needs. Symbolization is quite a common task, that's highly non-trivial. [0] https://github.com/libbpf/blazesym > > > Library has to make a decision to either cache parsed contents of > > /proc/<pid>/maps for service future requests (if application requests to > > symbolize another set of addresses, captured at some later time, which > > is typical for periodic/continuous profiling cases) to avoid higher > > costs of needed to re-parse this file or caching the contents in memory > > to speed up future requests. In the former case, more memory is used for > > the cache and there is a risk of getting stale data if application > > loaded/unloaded shared libraries, or otherwise changed its set of VMAs > > through additiona mmap() calls (and other means of altering memory > > address space). In the latter case, it's the performance hit that comes > > from re-opening the file and re-reading/re-parsing its contents all over > > again. > > Again, "performance hit" needs to be justified, it shouldn't be much > overall. I'm not sure how to answer whether it's much or not. Can you be a bit more specific on what you'd like to see? But I want to say that sensitivity to any overhead differs a lot depending on specifics. As general rule, we try to minimize any resource usage of the profiler/symbolizer itself on the host that is being profiled, to minimize the disruption of the production workload. So anything that can be done to optimize any part of the overall profiling process is a benefit. But while for big servers tolerance might be higher in terms of re-opening and re-parsing a bunch of text files, we also have use cases on much less powerful and very performance sensitive Oculus VR devices, for example. There, any extra piece of work is scrutinized, so having to parse text on those relatively weak devices does add up. Enough to spend effort to optimize text parsing in blazesym's Rust code (see [1] for recent improvements). [1] https://github.com/libbpf/blazesym/pull/643/commits/b89b91b42b994b135a0079bf04b2319c0054f745 > > > This patch aims to solve this problem by providing a new API built on > > top of /proc/<pid>/maps. It is ioctl()-based and built as a binary > > interface, avoiding the cost and awkwardness of textual representation > > for programmatic use. > > Some people find text easier to handle for programmatic use :) I don't disagree, but pretty much everyone I discussed having to deal with text-based kernel APIs are pretty uniformly in favor of binary-based interfaces, if they are available. But note, I'm not proposing to deprecate or remove text-based /proc/<pid>/maps. And the main point of this work is not so much binary vs text, as more selecting "point-based" querying capability as opposed to the "iterate everything" approach of /proc/<pid>/maps. > > > It's designed to be extensible and > > forward/backward compatible by including user-specified field size and > > using copy_struct_from_user() approach. But, most importantly, it allows > > to do point queries for specific single address, specified by user. And > > this is done efficiently using VMA iterator. > > Ok, maybe this is the main issue, you only want one at a time? Yes. More or less, I need "a few" that cover a captured set of addresses. > > > User has a choice to pick either getting VMA that covers provided > > address or -ENOENT if none is found (exact, least surprising, case). Or, > > with an extra query flag (PROCFS_PROCMAP_EXACT_OR_NEXT_VMA), they can > > get either VMA that covers the address (if there is one), or the closest > > next VMA (i.e., VMA with the smallest vm_start > addr). The later allows > > more efficient use, but, given it could be a surprising behavior, > > requires an explicit opt-in. > > > > Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes > > sense given it's querying the same set of VMA data. All the permissions > > checks performed on /proc/<pid>/maps opening fit here as well. > > ioctl-based implementation is fetching remembered mm_struct reference, > > but otherwise doesn't interfere with seq_file-based implementation of > > /proc/<pid>/maps textual interface, and so could be used together or > > independently without paying any price for that. > > > > There is one extra thing that /proc/<pid>/maps doesn't currently > > provide, and that's an ability to fetch ELF build ID, if present. User > > has control over whether this piece of information is requested or not > > by either setting build_id_size field to zero or non-zero maximum buffer > > size they provided through build_id_addr field (which encodes user > > pointer as __u64 field). > > > > The need to get ELF build ID reliably is an important aspect when > > dealing with profiling and stack trace symbolization, and > > /proc/<pid>/maps textual representation doesn't help with this, > > requiring applications to open underlying ELF binary through > > /proc/<pid>/map_files/<start>-<end> symlink, which adds an extra > > permissions implications due giving a full access to the binary from > > (potentially) another process, while all application is interested in is > > build ID. Giving an ability to request just build ID doesn't introduce > > any additional security concerns, on top of what /proc/<pid>/maps is > > already concerned with, simplifying the overall logic. > > > > Kernel already implements build ID fetching, which is used from BPF > > subsystem. We are reusing this code here, but plan a follow up changes > > to make it work better under more relaxed assumption (compared to what > > existing code assumes) of being called from user process context, in > > which page faults are allowed. BPF-specific implementation currently > > bails out if necessary part of ELF file is not paged in, all due to > > extra BPF-specific restrictions (like the need to fetch build ID in > > restrictive contexts such as NMI handler). > > > > Note also, that fetching VMA name (e.g., backing file path, or special > > hard-coded or user-provided names) is optional just like build ID. If > > user sets vma_name_size to zero, kernel code won't attempt to retrieve > > it, saving resources. > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > Where is the userspace code that uses this new api you have created? So I added a faithful comparison of existing /proc/<pid>/maps vs new ioctl() API to solve a common problem (as described above) in patch #5. The plan is to put it in mentioned blazesym library at the very least. I'm sure perf would benefit from this as well (cc'ed Arnaldo and linux-perf-user), as they need to do stack symbolization as well. It will be up to other similar projects to adopt this, but we'll definitely get this into blazesym as it is actually a problem for the abovementioned Oculus use case. We already had to make a tradeoff (see [2], this wasn't done just because we could, but it was requested by Oculus customers) to cache the contents of /proc/<pid>/maps and run the risk of missing some shared libraries that can be loaded later. It would be great to not have to do this tradeoff, which this new API would enable. [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf > > > --- > > fs/proc/task_mmu.c | 165 ++++++++++++++++++++++++++++++++++++++++ > > include/uapi/linux/fs.h | 32 ++++++++ > > 2 files changed, 197 insertions(+) > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > index 8e503a1635b7..cb7b1ff1a144 100644 > > --- a/fs/proc/task_mmu.c > > +++ b/fs/proc/task_mmu.c > > @@ -22,6 +22,7 @@ > > #include <linux/pkeys.h> > > #include <linux/minmax.h> > > #include <linux/overflow.h> > > +#include <linux/buildid.h> > > > > #include <asm/elf.h> > > #include <asm/tlb.h> > > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file) > > return do_maps_open(inode, file, &proc_pid_maps_op); > > } > > > > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) > > +{ > > + struct procfs_procmap_query karg; > > + struct vma_iterator iter; > > + struct vm_area_struct *vma; > > + struct mm_struct *mm; > > + const char *name = NULL; > > + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL; > > + __u64 usize; > > + int err; > > + > > + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize))) > > + return -EFAULT; > > + if (usize > PAGE_SIZE) > > Nice, where did you document that? And how is that portable given that > PAGE_SIZE can be different on different systems? I'm happy to document everything, can you please help by pointing where this documentation has to live? This is mostly fool-proofing, though, because the user has to pass sizeof(struct procfs_procmap_query), which I don't see ever getting close to even 4KB (not even saying about 64KB). This is just to prevent copy_struct_from_user() below to do too much zero-checking. > > and why aren't you checking the actual structure size instead? You can > easily run off the end here without knowing it. See copy_struct_from_user(), it does more checks. This is a helper designed specifically to deal with use cases like this where kernel struct size can change and user space might be newer or older. copy_struct_from_user() has a nice documentation describing all these nuances. > > > + return -E2BIG; > > + if (usize < offsetofend(struct procfs_procmap_query, query_addr)) > > + return -EINVAL; > > Ok, so you have two checks? How can the first one ever fail? Hmm.. If usize = 8, copy_from_user() won't fail, usize > PAGE_SIZE won't fail, but this one will fail. The point of this check is that user has to specify at least first three fields of procfs_procmap_query (size, query_flags, and query_addr), because without those the query is meaningless. > > > > + err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize); and this helper does more checks validating that the user either has a shorter struct (and then zero-fills the rest of kernel-side struct) or has longer (and then the longer part has to be zero filled). Do check copy_struct_from_user() documentation, it's great. > > + if (err) > > + return err; > > + > > + if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) > > + return -EINVAL; > > + if (!!karg.vma_name_size != !!karg.vma_name_addr) > > + return -EINVAL; > > + if (!!karg.build_id_size != !!karg.build_id_addr) > > + return -EINVAL; > > So you want values to be set, right? Either both should be set, or neither. It's ok for both size/addr fields to be zero, in which case it indicates that the user doesn't want this part of information (which is usually a bit more expensive to get and might not be necessary for all the cases). > > > + > > + mm = priv->mm; > > + if (!mm || !mmget_not_zero(mm)) > > + return -ESRCH; > > What is this error for? Where is this documentned? I copied it from existing /proc/<pid>/maps checks. I presume it's guarding the case when mm might be already put. So if the process is gone, but we have /proc/<pid>/maps file open? > > > + if (mmap_read_lock_killable(mm)) { > > + mmput(mm); > > + return -EINTR; > > + } > > + > > + vma_iter_init(&iter, mm, karg.query_addr); > > + vma = vma_next(&iter); > > + if (!vma) { > > + err = -ENOENT; > > + goto out; > > + } > > + /* user wants covering VMA, not the closest next one */ > > + if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) && > > + vma->vm_start > karg.query_addr) { > > + err = -ENOENT; > > + goto out; > > + } > > + > > + karg.vma_start = vma->vm_start; > > + karg.vma_end = vma->vm_end; > > + > > + if (vma->vm_file) { > > + const struct inode *inode = file_user_inode(vma->vm_file); > > + > > + karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT; > > + karg.dev_major = MAJOR(inode->i_sb->s_dev); > > + karg.dev_minor = MINOR(inode->i_sb->s_dev); > > So the major/minor is that of the file superblock? Why? Because inode number is unique only within given super block (and even then it's more complicated, e.g., btrfs subvolumes add more headaches, I believe). inode + dev maj/min is sometimes used for cache/reuse of per-binary information (e.g., pre-processed DWARF information, which is *very* expensive, so anything that allows to avoid doing this is helpful). > > > + karg.inode = inode->i_ino; > > What is userspace going to do with this? > See above. > > + } else { > > + karg.vma_offset = 0; > > + karg.dev_major = 0; > > + karg.dev_minor = 0; > > + karg.inode = 0; > > Why not set everything to 0 up above at the beginning so you never miss > anything, and you don't miss any holes accidentally in the future. > Stylistic preference, I find this more explicit, but I don't care much one way or another. > > + } > > + > > + karg.vma_flags = 0; > > + if (vma->vm_flags & VM_READ) > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE; > > + if (vma->vm_flags & VM_WRITE) > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE; > > + if (vma->vm_flags & VM_EXEC) > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE; > > + if (vma->vm_flags & VM_MAYSHARE) > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED; > > + [...] > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > > index 45e4e64fd664..fe8924a8d916 100644 > > --- a/include/uapi/linux/fs.h > > +++ b/include/uapi/linux/fs.h > > @@ -393,4 +393,36 @@ struct pm_scan_arg { > > __u64 return_mask; > > }; > > > > +/* /proc/<pid>/maps ioctl */ > > +#define PROCFS_IOCTL_MAGIC 0x9f > > Don't you need to document this in the proper place? I probably do, but I'm asking for help in knowing where. procfs is not a typical area of kernel I'm working with, so any pointers are highly appreciated. > > > +#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query) > > + > > +enum procmap_query_flags { > > + PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01, > > +}; > > + > > +enum procmap_vma_flags { > > + PROCFS_PROCMAP_VMA_READABLE = 0x01, > > + PROCFS_PROCMAP_VMA_WRITABLE = 0x02, > > + PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04, > > + PROCFS_PROCMAP_VMA_SHARED = 0x08, > > Are these bits? If so, please use the bit macro for it to make it > obvious. > Yes, they are. When I tried BIT(1), it didn't compile. I chose not to add any extra #includes to this UAPI header, but I can figure out the necessary dependency and do BIT(), I just didn't feel like BIT() adds much here, tbh. > > +}; > > + > > +struct procfs_procmap_query { > > + __u64 size; > > + __u64 query_flags; /* in */ > > Does this map to the procmap_vma_flags enum? if so, please say so. no, procmap_query_flags, and yes, I will > > > + __u64 query_addr; /* in */ > > + __u64 vma_start; /* out */ > > + __u64 vma_end; /* out */ > > + __u64 vma_flags; /* out */ > > + __u64 vma_offset; /* out */ > > + __u64 inode; /* out */ > > What is the inode for, you have an inode for the file already, why give > it another one? This is inode of vma's backing file, same as /proc/<pid>/maps' file column. What inode of file do I already have here? You mean of /proc/<pid>/maps itself? It's useless for the intended purposes. > > > + __u32 dev_major; /* out */ > > + __u32 dev_minor; /* out */ > > What is major/minor for? This is the same information as emitted by /proc/<pid>/maps, identifies superblock of vma's backing file. As I mentioned above, it can be used for caching per-file (i.e., per-ELF binary) information (for example). > > > + __u32 vma_name_size; /* in/out */ > > + __u32 build_id_size; /* in/out */ > > + __u64 vma_name_addr; /* in */ > > + __u64 build_id_addr; /* in */ > > Why not document this all using kerneldoc above the structure? Yes, sorry, I slacked a bit on adding this upfront. I knew we'll be figuring out the best place and approach, and so wanted to avoid documentation churn. Would something like what we have for pm_scan_arg and pagemap APIs work? I see it added a few simple descriptions for pm_scan_arg struct, and there is Documentation/admin-guide/mm/pagemap.rst. Should I add Documentation/admin-guide/mm/procmap.rst (admin-guide part feels off, though)? Anyways, I'm hoping for pointers where all this should be documented. Thank you! > > anyway, I don't like ioctls, but there is a place for them, you just > have to actually justify the use for them and not say "not efficient > enough" as that normally isn't an issue overall. I've written a demo tool in patch #5 which performs real-world task: mapping addresses to their VMAs (specifically calculating file offset, finding vma_start + vma_end range to further access files from /proc/<pid>/map_files/<start>-<end>). I did the implementation faithfully, doing it in the most optimal way for both APIs. I showed that for "typical" (it's hard to specify what typical is, of course, too many variables) scenario (it was data collected on a real server running real service, 30 seconds of process-specific stack traces were captured, if I remember correctly). I showed that doing exactly the same amount of work is ~35x times slower with /proc/<pid>/maps. Take another process, another set of addresses, another anything, and the numbers will be different, but I think it gives the right idea. But I think we are overpivoting on text vs binary distinction here. It's the more targeted querying of VMAs that's beneficial here. This allows applications to not cache anything and just re-query when doing periodic or continuous profiling (where addresses are coming in not as one batch, as a sequence of batches extended in time). /proc/<pid>/maps, for all its usefulness, just can't provide this sort of ability, as it wasn't designed to do that and is targeting different use cases. And then, a new ability to request reliable (it's not 100% reliable today, I'm going to address that as a follow up) build ID is *crucial* for some scenarios. The mentioned Oculus use case, the need to fully access underlying ELF binary just to get build ID is frowned upon. And for a good reason. Profiler only needs build ID, which is no secret and not sensitive information. This new (and binary, yes) API allows to add this into an API without breaking any backwards compatibility. > > thanks, > > greg k-h
Hi Andrii, kernel test robot noticed the following build warnings: [auto build test WARNING on next-20240503] [also build test WARNING on v6.9-rc6] [cannot apply to bpf-next/master bpf/master perf-tools-next/perf-tools-next tip/perf/core perf-tools/perf-tools brauner-vfs/vfs.all linus/master acme/perf/core v6.9-rc6 v6.9-rc5 v6.9-rc4] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Andrii-Nakryiko/fs-procfs-extract-logic-for-getting-VMA-name-constituents/20240504-083146 base: next-20240503 patch link: https://lore.kernel.org/r/20240504003006.3303334-3-andrii%40kernel.org patch subject: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps config: openrisc-allnoconfig (https://download.01.org/0day-ci/archive/20240505/202405050750.5oyajnPF-lkp@intel.com/config) compiler: or1k-linux-gcc (GCC) 13.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240505/202405050750.5oyajnPF-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202405050750.5oyajnPF-lkp@intel.com/ All warnings (new ones prefixed by >>): fs/proc/task_mmu.c: In function 'do_procmap_query': >> fs/proc/task_mmu.c:505:48: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] 505 | if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr, | ^ fs/proc/task_mmu.c:512:48: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] 512 | if (karg.build_id_size && copy_to_user((void __user *)karg.build_id_addr, | ^ vim +505 fs/proc/task_mmu.c 378 379 static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) 380 { 381 struct procfs_procmap_query karg; 382 struct vma_iterator iter; 383 struct vm_area_struct *vma; 384 struct mm_struct *mm; 385 const char *name = NULL; 386 char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL; 387 __u64 usize; 388 int err; 389 390 if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize))) 391 return -EFAULT; 392 if (usize > PAGE_SIZE) 393 return -E2BIG; 394 if (usize < offsetofend(struct procfs_procmap_query, query_addr)) 395 return -EINVAL; 396 err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize); 397 if (err) 398 return err; 399 400 if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) 401 return -EINVAL; 402 if (!!karg.vma_name_size != !!karg.vma_name_addr) 403 return -EINVAL; 404 if (!!karg.build_id_size != !!karg.build_id_addr) 405 return -EINVAL; 406 407 mm = priv->mm; 408 if (!mm || !mmget_not_zero(mm)) 409 return -ESRCH; 410 if (mmap_read_lock_killable(mm)) { 411 mmput(mm); 412 return -EINTR; 413 } 414 415 vma_iter_init(&iter, mm, karg.query_addr); 416 vma = vma_next(&iter); 417 if (!vma) { 418 err = -ENOENT; 419 goto out; 420 } 421 /* user wants covering VMA, not the closest next one */ 422 if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) && 423 vma->vm_start > karg.query_addr) { 424 err = -ENOENT; 425 goto out; 426 } 427 428 karg.vma_start = vma->vm_start; 429 karg.vma_end = vma->vm_end; 430 431 if (vma->vm_file) { 432 const struct inode *inode = file_user_inode(vma->vm_file); 433 434 karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT; 435 karg.dev_major = MAJOR(inode->i_sb->s_dev); 436 karg.dev_minor = MINOR(inode->i_sb->s_dev); 437 karg.inode = inode->i_ino; 438 } else { 439 karg.vma_offset = 0; 440 karg.dev_major = 0; 441 karg.dev_minor = 0; 442 karg.inode = 0; 443 } 444 445 karg.vma_flags = 0; 446 if (vma->vm_flags & VM_READ) 447 karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE; 448 if (vma->vm_flags & VM_WRITE) 449 karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE; 450 if (vma->vm_flags & VM_EXEC) 451 karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE; 452 if (vma->vm_flags & VM_MAYSHARE) 453 karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED; 454 455 if (karg.build_id_size) { 456 __u32 build_id_sz = BUILD_ID_SIZE_MAX; 457 458 err = build_id_parse(vma, build_id_buf, &build_id_sz); 459 if (!err) { 460 if (karg.build_id_size < build_id_sz) { 461 err = -ENAMETOOLONG; 462 goto out; 463 } 464 karg.build_id_size = build_id_sz; 465 } 466 } 467 468 if (karg.vma_name_size) { 469 size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size); 470 const struct path *path; 471 const char *name_fmt; 472 size_t name_sz = 0; 473 474 get_vma_name(vma, &path, &name, &name_fmt); 475 476 if (path || name_fmt || name) { 477 name_buf = kmalloc(name_buf_sz, GFP_KERNEL); 478 if (!name_buf) { 479 err = -ENOMEM; 480 goto out; 481 } 482 } 483 if (path) { 484 name = d_path(path, name_buf, name_buf_sz); 485 if (IS_ERR(name)) { 486 err = PTR_ERR(name); 487 goto out; 488 } 489 name_sz = name_buf + name_buf_sz - name; 490 } else if (name || name_fmt) { 491 name_sz = 1 + snprintf(name_buf, name_buf_sz, name_fmt ?: "%s", name); 492 name = name_buf; 493 } 494 if (name_sz > name_buf_sz) { 495 err = -ENAMETOOLONG; 496 goto out; 497 } 498 karg.vma_name_size = name_sz; 499 } 500 501 /* unlock and put mm_struct before copying data to user */ 502 mmap_read_unlock(mm); 503 mmput(mm); 504 > 505 if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr, 506 name, karg.vma_name_size)) { 507 kfree(name_buf); 508 return -EFAULT; 509 } 510 kfree(name_buf); 511 512 if (karg.build_id_size && copy_to_user((void __user *)karg.build_id_addr, 513 build_id_buf, karg.build_id_size)) 514 return -EFAULT; 515 516 if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize))) 517 return -EFAULT; 518 519 return 0; 520 521 out: 522 mmap_read_unlock(mm); 523 mmput(mm); 524 kfree(name_buf); 525 return err; 526 } 527
On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote: > On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote: > > > Note also, that fetching VMA name (e.g., backing file path, or special > > > hard-coded or user-provided names) is optional just like build ID. If > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve > > > it, saving resources. > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > Where is the userspace code that uses this new api you have created? > So I added a faithful comparison of existing /proc/<pid>/maps vs new > ioctl() API to solve a common problem (as described above) in patch > #5. The plan is to put it in mentioned blazesym library at the very > least. > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and > linux-perf-user), as they need to do stack symbolization as well. At some point, when BPF iterators became a thing we thought about, IIRC Jiri did some experimentation, but I lost track, of using BPF to synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout as in uapi/linux/perf_event.h: /* * The MMAP2 records are an augmented version of MMAP, they add * maj, min, ino numbers to be used to uniquely identify each mapping * * struct { * struct perf_event_header header; * * u32 pid, tid; * u64 addr; * u64 len; * u64 pgoff; * union { * struct { * u32 maj; * u32 min; * u64 ino; * u64 ino_generation; * }; * struct { * u8 build_id_size; * u8 __reserved_1; * u16 __reserved_2; * u8 build_id[20]; * }; * }; * u32 prot, flags; * char filename[]; * struct sample_id sample_id; * }; */ PERF_RECORD_MMAP2 = 10, * PERF_RECORD_MISC_MMAP_BUILD_ID - PERF_RECORD_MMAP2 event As perf.data files can be used for many purposes we want them all, so we setup a meta data perf file descriptor to go on receiving the new mmaps while we read /proc/<pid>/maps, to reduce the chance of missing maps, do it in parallel, etc: ⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis' Usage: perf record [<options>] [<command>] or: perf record [<options>] -- <command> [<options>] --num-thread-synthesize <n> number of threads to run for event synthesis --synth <no|all|task|mmap|cgroup> Fine-tune event synthesis: default=all ⬢[acme@toolbox perf-tools-next]$ For this specific initial synthesis of everything the plan, as mentioned about Jiri's experiments, was to use a BPF iterator to just feed the perf ring buffer with those events, that way userspace would just receive the usual records it gets when a new mmap is put in place, the BPF iterator would just feed the preexisting mmaps, as instructed via the perf_event_attr for the perf_event_open syscall. For people not wanting BPF, i.e. disabling it altogether in perf or disabling just BPF skels, then we would fallback to the current method, or to the one being discussed here when it becomes available. One thing to have in mind is for this iterator not to generate duplicate records for non-pre-existing mmaps, i.e. we would need some generation number that would be bumped when asking for such pre-existing maps PERF_RECORD_MMAP2 dumps. > It will be up to other similar projects to adopt this, but we'll > definitely get this into blazesym as it is actually a problem for the At some point looking at plugging blazesym somehow with perf may be something to consider, indeed. - Arnaldo > abovementioned Oculus use case. We already had to make a tradeoff (see > [2], this wasn't done just because we could, but it was requested by > Oculus customers) to cache the contents of /proc/<pid>/maps and run > the risk of missing some shared libraries that can be loaded later. It > would be great to not have to do this tradeoff, which this new API > would enable. > > [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf > > > > > > --- > > > fs/proc/task_mmu.c | 165 ++++++++++++++++++++++++++++++++++++++++ > > > include/uapi/linux/fs.h | 32 ++++++++ > > > 2 files changed, 197 insertions(+) > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > index 8e503a1635b7..cb7b1ff1a144 100644 > > > --- a/fs/proc/task_mmu.c > > > +++ b/fs/proc/task_mmu.c > > > @@ -22,6 +22,7 @@ > > > #include <linux/pkeys.h> > > > #include <linux/minmax.h> > > > #include <linux/overflow.h> > > > +#include <linux/buildid.h> > > > > > > #include <asm/elf.h> > > > #include <asm/tlb.h> > > > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file) > > > return do_maps_open(inode, file, &proc_pid_maps_op); > > > } > > > > > > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) > > > +{ > > > + struct procfs_procmap_query karg; > > > + struct vma_iterator iter; > > > + struct vm_area_struct *vma; > > > + struct mm_struct *mm; > > > + const char *name = NULL; > > > + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL; > > > + __u64 usize; > > > + int err; > > > + > > > + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize))) > > > + return -EFAULT; > > > + if (usize > PAGE_SIZE) > > > > Nice, where did you document that? And how is that portable given that > > PAGE_SIZE can be different on different systems? > > I'm happy to document everything, can you please help by pointing > where this documentation has to live? > > This is mostly fool-proofing, though, because the user has to pass > sizeof(struct procfs_procmap_query), which I don't see ever getting > close to even 4KB (not even saying about 64KB). This is just to > prevent copy_struct_from_user() below to do too much zero-checking. > > > > > and why aren't you checking the actual structure size instead? You can > > easily run off the end here without knowing it. > > See copy_struct_from_user(), it does more checks. This is a helper > designed specifically to deal with use cases like this where kernel > struct size can change and user space might be newer or older. > copy_struct_from_user() has a nice documentation describing all these > nuances. > > > > > > + return -E2BIG; > > > + if (usize < offsetofend(struct procfs_procmap_query, query_addr)) > > > + return -EINVAL; > > > > Ok, so you have two checks? How can the first one ever fail? > > Hmm.. If usize = 8, copy_from_user() won't fail, usize > PAGE_SIZE > won't fail, but this one will fail. > > The point of this check is that user has to specify at least first > three fields of procfs_procmap_query (size, query_flags, and > query_addr), because without those the query is meaningless. > > > > > > > + err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize); > > and this helper does more checks validating that the user either has a > shorter struct (and then zero-fills the rest of kernel-side struct) or > has longer (and then the longer part has to be zero filled). Do check > copy_struct_from_user() documentation, it's great. > > > > + if (err) > > > + return err; > > > + > > > + if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) > > > + return -EINVAL; > > > + if (!!karg.vma_name_size != !!karg.vma_name_addr) > > > + return -EINVAL; > > > + if (!!karg.build_id_size != !!karg.build_id_addr) > > > + return -EINVAL; > > > > So you want values to be set, right? > > Either both should be set, or neither. It's ok for both size/addr > fields to be zero, in which case it indicates that the user doesn't > want this part of information (which is usually a bit more expensive > to get and might not be necessary for all the cases). > > > > > > + > > > + mm = priv->mm; > > > + if (!mm || !mmget_not_zero(mm)) > > > + return -ESRCH; > > > > What is this error for? Where is this documentned? > > I copied it from existing /proc/<pid>/maps checks. I presume it's > guarding the case when mm might be already put. So if the process is > gone, but we have /proc/<pid>/maps file open? > > > > > > + if (mmap_read_lock_killable(mm)) { > > > + mmput(mm); > > > + return -EINTR; > > > + } > > > + > > > + vma_iter_init(&iter, mm, karg.query_addr); > > > + vma = vma_next(&iter); > > > + if (!vma) { > > > + err = -ENOENT; > > > + goto out; > > > + } > > > + /* user wants covering VMA, not the closest next one */ > > > + if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) && > > > + vma->vm_start > karg.query_addr) { > > > + err = -ENOENT; > > > + goto out; > > > + } > > > + > > > + karg.vma_start = vma->vm_start; > > > + karg.vma_end = vma->vm_end; > > > + > > > + if (vma->vm_file) { > > > + const struct inode *inode = file_user_inode(vma->vm_file); > > > + > > > + karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT; > > > + karg.dev_major = MAJOR(inode->i_sb->s_dev); > > > + karg.dev_minor = MINOR(inode->i_sb->s_dev); > > > > So the major/minor is that of the file superblock? Why? > > Because inode number is unique only within given super block (and even > then it's more complicated, e.g., btrfs subvolumes add more headaches, > I believe). inode + dev maj/min is sometimes used for cache/reuse of > per-binary information (e.g., pre-processed DWARF information, which > is *very* expensive, so anything that allows to avoid doing this is > helpful). > > > > > > + karg.inode = inode->i_ino; > > > > What is userspace going to do with this? > > > > See above. > > > > + } else { > > > + karg.vma_offset = 0; > > > + karg.dev_major = 0; > > > + karg.dev_minor = 0; > > > + karg.inode = 0; > > > > Why not set everything to 0 up above at the beginning so you never miss > > anything, and you don't miss any holes accidentally in the future. > > > > Stylistic preference, I find this more explicit, but I don't care much > one way or another. > > > > + } > > > + > > > + karg.vma_flags = 0; > > > + if (vma->vm_flags & VM_READ) > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE; > > > + if (vma->vm_flags & VM_WRITE) > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE; > > > + if (vma->vm_flags & VM_EXEC) > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE; > > > + if (vma->vm_flags & VM_MAYSHARE) > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED; > > > + > > [...] > > > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > > > index 45e4e64fd664..fe8924a8d916 100644 > > > --- a/include/uapi/linux/fs.h > > > +++ b/include/uapi/linux/fs.h > > > @@ -393,4 +393,36 @@ struct pm_scan_arg { > > > __u64 return_mask; > > > }; > > > > > > +/* /proc/<pid>/maps ioctl */ > > > +#define PROCFS_IOCTL_MAGIC 0x9f > > > > Don't you need to document this in the proper place? > > I probably do, but I'm asking for help in knowing where. procfs is not > a typical area of kernel I'm working with, so any pointers are highly > appreciated. > > > > > > +#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query) > > > + > > > +enum procmap_query_flags { > > > + PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01, > > > +}; > > > + > > > +enum procmap_vma_flags { > > > + PROCFS_PROCMAP_VMA_READABLE = 0x01, > > > + PROCFS_PROCMAP_VMA_WRITABLE = 0x02, > > > + PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04, > > > + PROCFS_PROCMAP_VMA_SHARED = 0x08, > > > > Are these bits? If so, please use the bit macro for it to make it > > obvious. > > > > Yes, they are. When I tried BIT(1), it didn't compile. I chose not to > add any extra #includes to this UAPI header, but I can figure out the > necessary dependency and do BIT(), I just didn't feel like BIT() adds > much here, tbh. > > > > +}; > > > + > > > +struct procfs_procmap_query { > > > + __u64 size; > > > + __u64 query_flags; /* in */ > > > > Does this map to the procmap_vma_flags enum? if so, please say so. > > no, procmap_query_flags, and yes, I will > > > > > > + __u64 query_addr; /* in */ > > > + __u64 vma_start; /* out */ > > > + __u64 vma_end; /* out */ > > > + __u64 vma_flags; /* out */ > > > + __u64 vma_offset; /* out */ > > > + __u64 inode; /* out */ > > > > What is the inode for, you have an inode for the file already, why give > > it another one? > > This is inode of vma's backing file, same as /proc/<pid>/maps' file > column. What inode of file do I already have here? You mean of > /proc/<pid>/maps itself? It's useless for the intended purposes. > > > > > > + __u32 dev_major; /* out */ > > > + __u32 dev_minor; /* out */ > > > > What is major/minor for? > > This is the same information as emitted by /proc/<pid>/maps, > identifies superblock of vma's backing file. As I mentioned above, it > can be used for caching per-file (i.e., per-ELF binary) information > (for example). > > > > > > + __u32 vma_name_size; /* in/out */ > > > + __u32 build_id_size; /* in/out */ > > > + __u64 vma_name_addr; /* in */ > > > + __u64 build_id_addr; /* in */ > > > > Why not document this all using kerneldoc above the structure? > > Yes, sorry, I slacked a bit on adding this upfront. I knew we'll be > figuring out the best place and approach, and so wanted to avoid > documentation churn. > > Would something like what we have for pm_scan_arg and pagemap APIs > work? I see it added a few simple descriptions for pm_scan_arg struct, > and there is Documentation/admin-guide/mm/pagemap.rst. Should I add > Documentation/admin-guide/mm/procmap.rst (admin-guide part feels off, > though)? Anyways, I'm hoping for pointers where all this should be > documented. Thank you! > > > > > anyway, I don't like ioctls, but there is a place for them, you just > > have to actually justify the use for them and not say "not efficient > > enough" as that normally isn't an issue overall. > > I've written a demo tool in patch #5 which performs real-world task: > mapping addresses to their VMAs (specifically calculating file offset, > finding vma_start + vma_end range to further access files from > /proc/<pid>/map_files/<start>-<end>). I did the implementation > faithfully, doing it in the most optimal way for both APIs. I showed > that for "typical" (it's hard to specify what typical is, of course, > too many variables) scenario (it was data collected on a real server > running real service, 30 seconds of process-specific stack traces were > captured, if I remember correctly). I showed that doing exactly the > same amount of work is ~35x times slower with /proc/<pid>/maps. > > Take another process, another set of addresses, another anything, and > the numbers will be different, but I think it gives the right idea. > > But I think we are overpivoting on text vs binary distinction here. > It's the more targeted querying of VMAs that's beneficial here. This > allows applications to not cache anything and just re-query when doing > periodic or continuous profiling (where addresses are coming in not as > one batch, as a sequence of batches extended in time). > > /proc/<pid>/maps, for all its usefulness, just can't provide this sort > of ability, as it wasn't designed to do that and is targeting > different use cases. > > And then, a new ability to request reliable (it's not 100% reliable > today, I'm going to address that as a follow up) build ID is *crucial* > for some scenarios. The mentioned Oculus use case, the need to fully > access underlying ELF binary just to get build ID is frowned upon. And > for a good reason. Profiler only needs build ID, which is no secret > and not sensitive information. This new (and binary, yes) API allows > to add this into an API without breaking any backwards compatibility. > > > > > thanks, > > > > greg k-h
Hello, On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote: > > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote: > > On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote: > > > > Note also, that fetching VMA name (e.g., backing file path, or special > > > > hard-coded or user-provided names) is optional just like build ID. If > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve > > > > it, saving resources. > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > > Where is the userspace code that uses this new api you have created? > > > So I added a faithful comparison of existing /proc/<pid>/maps vs new > > ioctl() API to solve a common problem (as described above) in patch > > #5. The plan is to put it in mentioned blazesym library at the very > > least. > > > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and > > linux-perf-user), as they need to do stack symbolization as well. I think the general use case in perf is different. This ioctl API is great for live tracing of a single (or a small number of) process(es). And yes, perf tools have those tracing use cases too. But I think the major use case of perf tools is system-wide profiling. For system-wide profiling, you need to process samples of many different processes at a high frequency. Now perf record doesn't process them and just save it for offline processing (well, it does at the end to find out build-ID but it can be omitted). Doing it online is possible (like perf top) but it would add more overhead during the profiling. And we cannot move processing or symbolization to the end of profiling because some (short- lived) tasks can go away. Also it should support perf report (offline) on data from a different kernel or even a different machine. So it saves the memory map of processes and symbolizes the stack trace with it later. Of course it needs to be updated as the memory map changes and that's why it tracks mmap or similar syscalls with PERF_RECORD_MMAP[2] records. A problem with this approach is to get the initial state of all (or a target for non-system-wide mode) existing processes. We call it synthesizing, and read /proc/PID/maps to generate the mmap records. I think the below comment from Arnaldo talked about how we can improve the synthesizing (which is sequential access to proc maps) using BPF. Thanks, Namhyung > > At some point, when BPF iterators became a thing we thought about, IIRC > Jiri did some experimentation, but I lost track, of using BPF to > synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout > as in uapi/linux/perf_event.h: > > /* > * The MMAP2 records are an augmented version of MMAP, they add > * maj, min, ino numbers to be used to uniquely identify each mapping > * > * struct { > * struct perf_event_header header; > * > * u32 pid, tid; > * u64 addr; > * u64 len; > * u64 pgoff; > * union { > * struct { > * u32 maj; > * u32 min; > * u64 ino; > * u64 ino_generation; > * }; > * struct { > * u8 build_id_size; > * u8 __reserved_1; > * u16 __reserved_2; > * u8 build_id[20]; > * }; > * }; > * u32 prot, flags; > * char filename[]; > * struct sample_id sample_id; > * }; > */ > PERF_RECORD_MMAP2 = 10, > > * PERF_RECORD_MISC_MMAP_BUILD_ID - PERF_RECORD_MMAP2 event > > As perf.data files can be used for many purposes we want them all, so we > setup a meta data perf file descriptor to go on receiving the new mmaps > while we read /proc/<pid>/maps, to reduce the chance of missing maps, do > it in parallel, etc: > > ⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis' > > Usage: perf record [<options>] [<command>] > or: perf record [<options>] -- <command> [<options>] > > --num-thread-synthesize <n> > number of threads to run for event synthesis > --synth <no|all|task|mmap|cgroup> > Fine-tune event synthesis: default=all > > ⬢[acme@toolbox perf-tools-next]$ > > For this specific initial synthesis of everything the plan, as mentioned > about Jiri's experiments, was to use a BPF iterator to just feed the > perf ring buffer with those events, that way userspace would just > receive the usual records it gets when a new mmap is put in place, the > BPF iterator would just feed the preexisting mmaps, as instructed via > the perf_event_attr for the perf_event_open syscall. > > For people not wanting BPF, i.e. disabling it altogether in perf or > disabling just BPF skels, then we would fallback to the current method, > or to the one being discussed here when it becomes available. > > One thing to have in mind is for this iterator not to generate duplicate > records for non-pre-existing mmaps, i.e. we would need some generation > number that would be bumped when asking for such pre-existing maps > PERF_RECORD_MMAP2 dumps. > > > It will be up to other similar projects to adopt this, but we'll > > definitely get this into blazesym as it is actually a problem for the > > At some point looking at plugging blazesym somehow with perf may be > something to consider, indeed. > > - Arnaldo > > > abovementioned Oculus use case. We already had to make a tradeoff (see > > [2], this wasn't done just because we could, but it was requested by > > Oculus customers) to cache the contents of /proc/<pid>/maps and run > > the risk of missing some shared libraries that can be loaded later. It > > would be great to not have to do this tradeoff, which this new API > > would enable. > > > > [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf > > > > > > > > > --- > > > > fs/proc/task_mmu.c | 165 ++++++++++++++++++++++++++++++++++++++++ > > > > include/uapi/linux/fs.h | 32 ++++++++ > > > > 2 files changed, 197 insertions(+) > > > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > index 8e503a1635b7..cb7b1ff1a144 100644 > > > > --- a/fs/proc/task_mmu.c > > > > +++ b/fs/proc/task_mmu.c > > > > @@ -22,6 +22,7 @@ > > > > #include <linux/pkeys.h> > > > > #include <linux/minmax.h> > > > > #include <linux/overflow.h> > > > > +#include <linux/buildid.h> > > > > > > > > #include <asm/elf.h> > > > > #include <asm/tlb.h> > > > > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file) > > > > return do_maps_open(inode, file, &proc_pid_maps_op); > > > > } > > > > > > > > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) > > > > +{ > > > > + struct procfs_procmap_query karg; > > > > + struct vma_iterator iter; > > > > + struct vm_area_struct *vma; > > > > + struct mm_struct *mm; > > > > + const char *name = NULL; > > > > + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL; > > > > + __u64 usize; > > > > + int err; > > > > + > > > > + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize))) > > > > + return -EFAULT; > > > > + if (usize > PAGE_SIZE) > > > > > > Nice, where did you document that? And how is that portable given that > > > PAGE_SIZE can be different on different systems? > > > > I'm happy to document everything, can you please help by pointing > > where this documentation has to live? > > > > This is mostly fool-proofing, though, because the user has to pass > > sizeof(struct procfs_procmap_query), which I don't see ever getting > > close to even 4KB (not even saying about 64KB). This is just to > > prevent copy_struct_from_user() below to do too much zero-checking. > > > > > > > > and why aren't you checking the actual structure size instead? You can > > > easily run off the end here without knowing it. > > > > See copy_struct_from_user(), it does more checks. This is a helper > > designed specifically to deal with use cases like this where kernel > > struct size can change and user space might be newer or older. > > copy_struct_from_user() has a nice documentation describing all these > > nuances. > > > > > > > > > + return -E2BIG; > > > > + if (usize < offsetofend(struct procfs_procmap_query, query_addr)) > > > > + return -EINVAL; > > > > > > Ok, so you have two checks? How can the first one ever fail? > > > > Hmm.. If usize = 8, copy_from_user() won't fail, usize > PAGE_SIZE > > won't fail, but this one will fail. > > > > The point of this check is that user has to specify at least first > > three fields of procfs_procmap_query (size, query_flags, and > > query_addr), because without those the query is meaningless. > > > > > > > > > > + err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize); > > > > and this helper does more checks validating that the user either has a > > shorter struct (and then zero-fills the rest of kernel-side struct) or > > has longer (and then the longer part has to be zero filled). Do check > > copy_struct_from_user() documentation, it's great. > > > > > > + if (err) > > > > + return err; > > > > + > > > > + if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) > > > > + return -EINVAL; > > > > + if (!!karg.vma_name_size != !!karg.vma_name_addr) > > > > + return -EINVAL; > > > > + if (!!karg.build_id_size != !!karg.build_id_addr) > > > > + return -EINVAL; > > > > > > So you want values to be set, right? > > > > Either both should be set, or neither. It's ok for both size/addr > > fields to be zero, in which case it indicates that the user doesn't > > want this part of information (which is usually a bit more expensive > > to get and might not be necessary for all the cases). > > > > > > > > > + > > > > + mm = priv->mm; > > > > + if (!mm || !mmget_not_zero(mm)) > > > > + return -ESRCH; > > > > > > What is this error for? Where is this documentned? > > > > I copied it from existing /proc/<pid>/maps checks. I presume it's > > guarding the case when mm might be already put. So if the process is > > gone, but we have /proc/<pid>/maps file open? > > > > > > > > > + if (mmap_read_lock_killable(mm)) { > > > > + mmput(mm); > > > > + return -EINTR; > > > > + } > > > > + > > > > + vma_iter_init(&iter, mm, karg.query_addr); > > > > + vma = vma_next(&iter); > > > > + if (!vma) { > > > > + err = -ENOENT; > > > > + goto out; > > > > + } > > > > + /* user wants covering VMA, not the closest next one */ > > > > + if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) && > > > > + vma->vm_start > karg.query_addr) { > > > > + err = -ENOENT; > > > > + goto out; > > > > + } > > > > + > > > > + karg.vma_start = vma->vm_start; > > > > + karg.vma_end = vma->vm_end; > > > > + > > > > + if (vma->vm_file) { > > > > + const struct inode *inode = file_user_inode(vma->vm_file); > > > > + > > > > + karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT; > > > > + karg.dev_major = MAJOR(inode->i_sb->s_dev); > > > > + karg.dev_minor = MINOR(inode->i_sb->s_dev); > > > > > > So the major/minor is that of the file superblock? Why? > > > > Because inode number is unique only within given super block (and even > > then it's more complicated, e.g., btrfs subvolumes add more headaches, > > I believe). inode + dev maj/min is sometimes used for cache/reuse of > > per-binary information (e.g., pre-processed DWARF information, which > > is *very* expensive, so anything that allows to avoid doing this is > > helpful). > > > > > > > > > + karg.inode = inode->i_ino; > > > > > > What is userspace going to do with this? > > > > > > > See above. > > > > > > + } else { > > > > + karg.vma_offset = 0; > > > > + karg.dev_major = 0; > > > > + karg.dev_minor = 0; > > > > + karg.inode = 0; > > > > > > Why not set everything to 0 up above at the beginning so you never miss > > > anything, and you don't miss any holes accidentally in the future. > > > > > > > Stylistic preference, I find this more explicit, but I don't care much > > one way or another. > > > > > > + } > > > > + > > > > + karg.vma_flags = 0; > > > > + if (vma->vm_flags & VM_READ) > > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE; > > > > + if (vma->vm_flags & VM_WRITE) > > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE; > > > > + if (vma->vm_flags & VM_EXEC) > > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE; > > > > + if (vma->vm_flags & VM_MAYSHARE) > > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED; > > > > + > > > > [...] > > > > > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > > > > index 45e4e64fd664..fe8924a8d916 100644 > > > > --- a/include/uapi/linux/fs.h > > > > +++ b/include/uapi/linux/fs.h > > > > @@ -393,4 +393,36 @@ struct pm_scan_arg { > > > > __u64 return_mask; > > > > }; > > > > > > > > +/* /proc/<pid>/maps ioctl */ > > > > +#define PROCFS_IOCTL_MAGIC 0x9f > > > > > > Don't you need to document this in the proper place? > > > > I probably do, but I'm asking for help in knowing where. procfs is not > > a typical area of kernel I'm working with, so any pointers are highly > > appreciated. > > > > > > > > > +#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query) > > > > + > > > > +enum procmap_query_flags { > > > > + PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01, > > > > +}; > > > > + > > > > +enum procmap_vma_flags { > > > > + PROCFS_PROCMAP_VMA_READABLE = 0x01, > > > > + PROCFS_PROCMAP_VMA_WRITABLE = 0x02, > > > > + PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04, > > > > + PROCFS_PROCMAP_VMA_SHARED = 0x08, > > > > > > Are these bits? If so, please use the bit macro for it to make it > > > obvious. > > > > > > > Yes, they are. When I tried BIT(1), it didn't compile. I chose not to > > add any extra #includes to this UAPI header, but I can figure out the > > necessary dependency and do BIT(), I just didn't feel like BIT() adds > > much here, tbh. > > > > > > +}; > > > > + > > > > +struct procfs_procmap_query { > > > > + __u64 size; > > > > + __u64 query_flags; /* in */ > > > > > > Does this map to the procmap_vma_flags enum? if so, please say so. > > > > no, procmap_query_flags, and yes, I will > > > > > > > > > + __u64 query_addr; /* in */ > > > > + __u64 vma_start; /* out */ > > > > + __u64 vma_end; /* out */ > > > > + __u64 vma_flags; /* out */ > > > > + __u64 vma_offset; /* out */ > > > > + __u64 inode; /* out */ > > > > > > What is the inode for, you have an inode for the file already, why give > > > it another one? > > > > This is inode of vma's backing file, same as /proc/<pid>/maps' file > > column. What inode of file do I already have here? You mean of > > /proc/<pid>/maps itself? It's useless for the intended purposes. > > > > > > > > > + __u32 dev_major; /* out */ > > > > + __u32 dev_minor; /* out */ > > > > > > What is major/minor for? > > > > This is the same information as emitted by /proc/<pid>/maps, > > identifies superblock of vma's backing file. As I mentioned above, it > > can be used for caching per-file (i.e., per-ELF binary) information > > (for example). > > > > > > > > > + __u32 vma_name_size; /* in/out */ > > > > + __u32 build_id_size; /* in/out */ > > > > + __u64 vma_name_addr; /* in */ > > > > + __u64 build_id_addr; /* in */ > > > > > > Why not document this all using kerneldoc above the structure? > > > > Yes, sorry, I slacked a bit on adding this upfront. I knew we'll be > > figuring out the best place and approach, and so wanted to avoid > > documentation churn. > > > > Would something like what we have for pm_scan_arg and pagemap APIs > > work? I see it added a few simple descriptions for pm_scan_arg struct, > > and there is Documentation/admin-guide/mm/pagemap.rst. Should I add > > Documentation/admin-guide/mm/procmap.rst (admin-guide part feels off, > > though)? Anyways, I'm hoping for pointers where all this should be > > documented. Thank you! > > > > > > > > anyway, I don't like ioctls, but there is a place for them, you just > > > have to actually justify the use for them and not say "not efficient > > > enough" as that normally isn't an issue overall. > > > > I've written a demo tool in patch #5 which performs real-world task: > > mapping addresses to their VMAs (specifically calculating file offset, > > finding vma_start + vma_end range to further access files from > > /proc/<pid>/map_files/<start>-<end>). I did the implementation > > faithfully, doing it in the most optimal way for both APIs. I showed > > that for "typical" (it's hard to specify what typical is, of course, > > too many variables) scenario (it was data collected on a real server > > running real service, 30 seconds of process-specific stack traces were > > captured, if I remember correctly). I showed that doing exactly the > > same amount of work is ~35x times slower with /proc/<pid>/maps. > > > > Take another process, another set of addresses, another anything, and > > the numbers will be different, but I think it gives the right idea. > > > > But I think we are overpivoting on text vs binary distinction here. > > It's the more targeted querying of VMAs that's beneficial here. This > > allows applications to not cache anything and just re-query when doing > > periodic or continuous profiling (where addresses are coming in not as > > one batch, as a sequence of batches extended in time). > > > > /proc/<pid>/maps, for all its usefulness, just can't provide this sort > > of ability, as it wasn't designed to do that and is targeting > > different use cases. > > > > And then, a new ability to request reliable (it's not 100% reliable > > today, I'm going to address that as a follow up) build ID is *crucial* > > for some scenarios. The mentioned Oculus use case, the need to fully > > access underlying ELF binary just to get build ID is frowned upon. And > > for a good reason. Profiler only needs build ID, which is no secret > > and not sensitive information. This new (and binary, yes) API allows > > to add this into an API without breaking any backwards compatibility. > > > > > > > > thanks, > > > > > > greg k-h >
On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote: > > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote: > > On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote: > > > > Note also, that fetching VMA name (e.g., backing file path, or special > > > > hard-coded or user-provided names) is optional just like build ID. If > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve > > > > it, saving resources. > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > > Where is the userspace code that uses this new api you have created? > > > So I added a faithful comparison of existing /proc/<pid>/maps vs new > > ioctl() API to solve a common problem (as described above) in patch > > #5. The plan is to put it in mentioned blazesym library at the very > > least. > > > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and > > linux-perf-user), as they need to do stack symbolization as well. > > At some point, when BPF iterators became a thing we thought about, IIRC > Jiri did some experimentation, but I lost track, of using BPF to > synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout > as in uapi/linux/perf_event.h: > > /* > * The MMAP2 records are an augmented version of MMAP, they add > * maj, min, ino numbers to be used to uniquely identify each mapping > * > * struct { > * struct perf_event_header header; > * > * u32 pid, tid; > * u64 addr; > * u64 len; > * u64 pgoff; > * union { > * struct { > * u32 maj; > * u32 min; > * u64 ino; > * u64 ino_generation; > * }; > * struct { > * u8 build_id_size; > * u8 __reserved_1; > * u16 __reserved_2; > * u8 build_id[20]; > * }; > * }; > * u32 prot, flags; > * char filename[]; > * struct sample_id sample_id; > * }; > */ > PERF_RECORD_MMAP2 = 10, > > * PERF_RECORD_MISC_MMAP_BUILD_ID - PERF_RECORD_MMAP2 event > > As perf.data files can be used for many purposes we want them all, so we ok, so because you want them all and you don't know which VMAs will be useful or not, it's a different problem. BPF iterators will be faster purely due to avoiding binary -> text -> binary conversion path, but other than that you'll still retrieve all VMAs. You can still do the same full VMA iteration with this new API, of course, but advantages are probably smaller as you'll be retrieving a full set of VMAs regardless (though it would be interesting to compare anyways). > setup a meta data perf file descriptor to go on receiving the new mmaps > while we read /proc/<pid>/maps, to reduce the chance of missing maps, do > it in parallel, etc: > > ⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis' > > Usage: perf record [<options>] [<command>] > or: perf record [<options>] -- <command> [<options>] > > --num-thread-synthesize <n> > number of threads to run for event synthesis > --synth <no|all|task|mmap|cgroup> > Fine-tune event synthesis: default=all > > ⬢[acme@toolbox perf-tools-next]$ > > For this specific initial synthesis of everything the plan, as mentioned > about Jiri's experiments, was to use a BPF iterator to just feed the > perf ring buffer with those events, that way userspace would just > receive the usual records it gets when a new mmap is put in place, the > BPF iterator would just feed the preexisting mmaps, as instructed via > the perf_event_attr for the perf_event_open syscall. > > For people not wanting BPF, i.e. disabling it altogether in perf or > disabling just BPF skels, then we would fallback to the current method, > or to the one being discussed here when it becomes available. > > One thing to have in mind is for this iterator not to generate duplicate > records for non-pre-existing mmaps, i.e. we would need some generation > number that would be bumped when asking for such pre-existing maps > PERF_RECORD_MMAP2 dumps. Looking briefly at struct vm_area_struct, it doesn't seems like the kernel maintains any sort of generation (at least not at vm_area_struct level), so this would be nice to have, I'm sure, but isn't really related to adding this API. Once the kernel does have this "VMA generation" counter, it can be trivially added to this binary interface (which can't be said about /proc/<pid>/maps, unfortunately). > > > It will be up to other similar projects to adopt this, but we'll > > definitely get this into blazesym as it is actually a problem for the > > At some point looking at plugging blazesym somehow with perf may be > something to consider, indeed. In the above I meant direct use of this new API in perf code itself, but yes, blazesym is a generic library for symbolization that handles ELF/DWARF/GSYM (and I believe more formats), so it indeed might make sense to use it. > > - Arnaldo > > > abovementioned Oculus use case. We already had to make a tradeoff (see > > [2], this wasn't done just because we could, but it was requested by > > Oculus customers) to cache the contents of /proc/<pid>/maps and run > > the risk of missing some shared libraries that can be loaded later. It > > would be great to not have to do this tradeoff, which this new API > > would enable. > > > > [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf > > [...]
On Mon, May 6, 2024 at 11:05 AM Namhyung Kim <namhyung@kernel.org> wrote: > > Hello, > > On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote: > > > > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote: > > > On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote: > > > > > Note also, that fetching VMA name (e.g., backing file path, or special > > > > > hard-coded or user-provided names) is optional just like build ID. If > > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve > > > > > it, saving resources. > > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > > > > Where is the userspace code that uses this new api you have created? > > > > > So I added a faithful comparison of existing /proc/<pid>/maps vs new > > > ioctl() API to solve a common problem (as described above) in patch > > > #5. The plan is to put it in mentioned blazesym library at the very > > > least. > > > > > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and > > > linux-perf-user), as they need to do stack symbolization as well. > > I think the general use case in perf is different. This ioctl API is great > for live tracing of a single (or a small number of) process(es). And > yes, perf tools have those tracing use cases too. But I think the > major use case of perf tools is system-wide profiling. The intended use case is also a system-wide profiling, but I haven't heard that opening a file per process is a big bottleneck or a limitation, tbh. > > For system-wide profiling, you need to process samples of many > different processes at a high frequency. Now perf record doesn't > process them and just save it for offline processing (well, it does > at the end to find out build-ID but it can be omitted). > > Doing it online is possible (like perf top) but it would add more > overhead during the profiling. And we cannot move processing > or symbolization to the end of profiling because some (short- > lived) tasks can go away. We do have some setups where we install a BPF program that monitors process exit and mmap() events and emits (proactively) VMA information. It's not applicable everywhere, and in some setups (like Oculus case) we just accept that short-lived processes will be missed at the expense of less interruption, simpler and less privileged "agents" doing profiling and address resolution logic. So the problem space, as can be seen, is pretty vast and varied, and there is no single API that would serve all the needs perfectly. > > Also it should support perf report (offline) on data from a > different kernel or even a different machine. We fetch build ID (and resolve file offset) and offload actual symbolization to a dedicated fleet of servers, whenever possible. We don't yet do it for kernel stack traces, but we are moving in this direction (and there are their own problems with /proc/kallsyms being text-based, listing everything, and pretty big all in itself; but that's a separate topic). > > So it saves the memory map of processes and symbolizes > the stack trace with it later. Of course it needs to be updated > as the memory map changes and that's why it tracks mmap > or similar syscalls with PERF_RECORD_MMAP[2] records. > > A problem with this approach is to get the initial state of all > (or a target for non-system-wide mode) existing processes. > We call it synthesizing, and read /proc/PID/maps to generate > the mmap records. > > I think the below comment from Arnaldo talked about how > we can improve the synthesizing (which is sequential access > to proc maps) using BPF. Yep. We can also benchmark using this new ioctl() to fetch a full set of VMAs, it might still be good enough. > > Thanks, > Namhyung > [...]
On Mon, May 06, 2024 at 11:05:17AM -0700, Namhyung Kim wrote: > On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote: > > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote: > > > On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote: > > > > > Note also, that fetching VMA name (e.g., backing file path, or special > > > > > hard-coded or user-provided names) is optional just like build ID. If > > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve > > > > > it, saving resources. > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > > Where is the userspace code that uses this new api you have created? > > > So I added a faithful comparison of existing /proc/<pid>/maps vs new > > > ioctl() API to solve a common problem (as described above) in patch > > > #5. The plan is to put it in mentioned blazesym library at the very > > > least. > > > > > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and > > > linux-perf-user), as they need to do stack symbolization as well. > I think the general use case in perf is different. This ioctl API is great > for live tracing of a single (or a small number of) process(es). And > yes, perf tools have those tracing use cases too. But I think the > major use case of perf tools is system-wide profiling. > For system-wide profiling, you need to process samples of many > different processes at a high frequency. Now perf record doesn't > process them and just save it for offline processing (well, it does > at the end to find out build-ID but it can be omitted). Since: Author: Jiri Olsa <jolsa@kernel.org> Date: Mon Dec 14 11:54:49 2020 +0100 1ca6e80254141d26 ("perf tools: Store build id when available in PERF_RECORD_MMAP2 metadata events") We don't need to to process the events to find the build ids. I haven't checked if we still do it to find out which DSOs had hits, but we shouldn't need to do it for build-ids (unless they were not in memory when the kernel tried to stash them in the PERF_RECORD_MMAP2, which I haven't checked but IIRC is a possibility if that ELF part isn't in memory at the time we want to copy it). If we're still traversing it like that I guess we can have a knob and make it the default to not do that and instead create the perf.data build ID header table with all the build-ids we got from PERF_RECORD_MMAP2, a (slightly) bigger perf.data file but no event processing at the end of a 'perf record' session. > Doing it online is possible (like perf top) but it would add more > overhead during the profiling. And we cannot move processing It comes in the PERF_RECORD_MMAP2, filled by the kernel. > or symbolization to the end of profiling because some (short- > lived) tasks can go away. right > Also it should support perf report (offline) on data from a > different kernel or even a different machine. right > So it saves the memory map of processes and symbolizes > the stack trace with it later. Of course it needs to be updated > as the memory map changes and that's why it tracks mmap > or similar syscalls with PERF_RECORD_MMAP[2] records. > A problem with this approach is to get the initial state of all > (or a target for non-system-wide mode) existing processes. > We call it synthesizing, and read /proc/PID/maps to generate > the mmap records. > I think the below comment from Arnaldo talked about how > we can improve the synthesizing (which is sequential access > to proc maps) using BPF. Yes, I wonder how far Jiri went, Jiri? - Arnaldo > Thanks, > Namhyung > > > > > > At some point, when BPF iterators became a thing we thought about, IIRC > > Jiri did some experimentation, but I lost track, of using BPF to > > synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout > > as in uapi/linux/perf_event.h: > > > > /* > > * The MMAP2 records are an augmented version of MMAP, they add > > * maj, min, ino numbers to be used to uniquely identify each mapping > > * > > * struct { > > * struct perf_event_header header; > > * > > * u32 pid, tid; > > * u64 addr; > > * u64 len; > > * u64 pgoff; > > * union { > > * struct { > > * u32 maj; > > * u32 min; > > * u64 ino; > > * u64 ino_generation; > > * }; > > * struct { > > * u8 build_id_size; > > * u8 __reserved_1; > > * u16 __reserved_2; > > * u8 build_id[20]; > > * }; > > * }; > > * u32 prot, flags; > > * char filename[]; > > * struct sample_id sample_id; > > * }; > > */ > > PERF_RECORD_MMAP2 = 10, > > > > * PERF_RECORD_MISC_MMAP_BUILD_ID - PERF_RECORD_MMAP2 event > > > > As perf.data files can be used for many purposes we want them all, so we > > setup a meta data perf file descriptor to go on receiving the new mmaps > > while we read /proc/<pid>/maps, to reduce the chance of missing maps, do > > it in parallel, etc: > > > > ⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis' > > > > Usage: perf record [<options>] [<command>] > > or: perf record [<options>] -- <command> [<options>] > > > > --num-thread-synthesize <n> > > number of threads to run for event synthesis > > --synth <no|all|task|mmap|cgroup> > > Fine-tune event synthesis: default=all > > > > ⬢[acme@toolbox perf-tools-next]$ > > > > For this specific initial synthesis of everything the plan, as mentioned > > about Jiri's experiments, was to use a BPF iterator to just feed the > > perf ring buffer with those events, that way userspace would just > > receive the usual records it gets when a new mmap is put in place, the > > BPF iterator would just feed the preexisting mmaps, as instructed via > > the perf_event_attr for the perf_event_open syscall. > > > > For people not wanting BPF, i.e. disabling it altogether in perf or > > disabling just BPF skels, then we would fallback to the current method, > > or to the one being discussed here when it becomes available. > > > > One thing to have in mind is for this iterator not to generate duplicate > > records for non-pre-existing mmaps, i.e. we would need some generation > > number that would be bumped when asking for such pre-existing maps > > PERF_RECORD_MMAP2 dumps. > > > > > It will be up to other similar projects to adopt this, but we'll > > > definitely get this into blazesym as it is actually a problem for the > > > > At some point looking at plugging blazesym somehow with perf may be > > something to consider, indeed. > > > > - Arnaldo > > > > > abovementioned Oculus use case. We already had to make a tradeoff (see > > > [2], this wasn't done just because we could, but it was requested by > > > Oculus customers) to cache the contents of /proc/<pid>/maps and run > > > the risk of missing some shared libraries that can be loaded later. It > > > would be great to not have to do this tradeoff, which this new API > > > would enable. > > > > > > [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf > > > > > > > > > > > > --- > > > > > fs/proc/task_mmu.c | 165 ++++++++++++++++++++++++++++++++++++++++ > > > > > include/uapi/linux/fs.h | 32 ++++++++ > > > > > 2 files changed, 197 insertions(+) > > > > > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > > index 8e503a1635b7..cb7b1ff1a144 100644 > > > > > --- a/fs/proc/task_mmu.c > > > > > +++ b/fs/proc/task_mmu.c > > > > > @@ -22,6 +22,7 @@ > > > > > #include <linux/pkeys.h> > > > > > #include <linux/minmax.h> > > > > > #include <linux/overflow.h> > > > > > +#include <linux/buildid.h> > > > > > > > > > > #include <asm/elf.h> > > > > > #include <asm/tlb.h> > > > > > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file) > > > > > return do_maps_open(inode, file, &proc_pid_maps_op); > > > > > } > > > > > > > > > > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) > > > > > +{ > > > > > + struct procfs_procmap_query karg; > > > > > + struct vma_iterator iter; > > > > > + struct vm_area_struct *vma; > > > > > + struct mm_struct *mm; > > > > > + const char *name = NULL; > > > > > + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL; > > > > > + __u64 usize; > > > > > + int err; > > > > > + > > > > > + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize))) > > > > > + return -EFAULT; > > > > > + if (usize > PAGE_SIZE) > > > > > > > > Nice, where did you document that? And how is that portable given that > > > > PAGE_SIZE can be different on different systems? > > > > > > I'm happy to document everything, can you please help by pointing > > > where this documentation has to live? > > > > > > This is mostly fool-proofing, though, because the user has to pass > > > sizeof(struct procfs_procmap_query), which I don't see ever getting > > > close to even 4KB (not even saying about 64KB). This is just to > > > prevent copy_struct_from_user() below to do too much zero-checking. > > > > > > > > > > > and why aren't you checking the actual structure size instead? You can > > > > easily run off the end here without knowing it. > > > > > > See copy_struct_from_user(), it does more checks. This is a helper > > > designed specifically to deal with use cases like this where kernel > > > struct size can change and user space might be newer or older. > > > copy_struct_from_user() has a nice documentation describing all these > > > nuances. > > > > > > > > > > > > + return -E2BIG; > > > > > + if (usize < offsetofend(struct procfs_procmap_query, query_addr)) > > > > > + return -EINVAL; > > > > > > > > Ok, so you have two checks? How can the first one ever fail? > > > > > > Hmm.. If usize = 8, copy_from_user() won't fail, usize > PAGE_SIZE > > > won't fail, but this one will fail. > > > > > > The point of this check is that user has to specify at least first > > > three fields of procfs_procmap_query (size, query_flags, and > > > query_addr), because without those the query is meaningless. > > > > > > > > > > > > > + err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize); > > > > > > and this helper does more checks validating that the user either has a > > > shorter struct (and then zero-fills the rest of kernel-side struct) or > > > has longer (and then the longer part has to be zero filled). Do check > > > copy_struct_from_user() documentation, it's great. > > > > > > > > + if (err) > > > > > + return err; > > > > > + > > > > > + if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) > > > > > + return -EINVAL; > > > > > + if (!!karg.vma_name_size != !!karg.vma_name_addr) > > > > > + return -EINVAL; > > > > > + if (!!karg.build_id_size != !!karg.build_id_addr) > > > > > + return -EINVAL; > > > > > > > > So you want values to be set, right? > > > > > > Either both should be set, or neither. It's ok for both size/addr > > > fields to be zero, in which case it indicates that the user doesn't > > > want this part of information (which is usually a bit more expensive > > > to get and might not be necessary for all the cases). > > > > > > > > > > > > + > > > > > + mm = priv->mm; > > > > > + if (!mm || !mmget_not_zero(mm)) > > > > > + return -ESRCH; > > > > > > > > What is this error for? Where is this documentned? > > > > > > I copied it from existing /proc/<pid>/maps checks. I presume it's > > > guarding the case when mm might be already put. So if the process is > > > gone, but we have /proc/<pid>/maps file open? > > > > > > > > > > > > + if (mmap_read_lock_killable(mm)) { > > > > > + mmput(mm); > > > > > + return -EINTR; > > > > > + } > > > > > + > > > > > + vma_iter_init(&iter, mm, karg.query_addr); > > > > > + vma = vma_next(&iter); > > > > > + if (!vma) { > > > > > + err = -ENOENT; > > > > > + goto out; > > > > > + } > > > > > + /* user wants covering VMA, not the closest next one */ > > > > > + if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) && > > > > > + vma->vm_start > karg.query_addr) { > > > > > + err = -ENOENT; > > > > > + goto out; > > > > > + } > > > > > + > > > > > + karg.vma_start = vma->vm_start; > > > > > + karg.vma_end = vma->vm_end; > > > > > + > > > > > + if (vma->vm_file) { > > > > > + const struct inode *inode = file_user_inode(vma->vm_file); > > > > > + > > > > > + karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT; > > > > > + karg.dev_major = MAJOR(inode->i_sb->s_dev); > > > > > + karg.dev_minor = MINOR(inode->i_sb->s_dev); > > > > > > > > So the major/minor is that of the file superblock? Why? > > > > > > Because inode number is unique only within given super block (and even > > > then it's more complicated, e.g., btrfs subvolumes add more headaches, > > > I believe). inode + dev maj/min is sometimes used for cache/reuse of > > > per-binary information (e.g., pre-processed DWARF information, which > > > is *very* expensive, so anything that allows to avoid doing this is > > > helpful). > > > > > > > > > > > > + karg.inode = inode->i_ino; > > > > > > > > What is userspace going to do with this? > > > > > > > > > > See above. > > > > > > > > + } else { > > > > > + karg.vma_offset = 0; > > > > > + karg.dev_major = 0; > > > > > + karg.dev_minor = 0; > > > > > + karg.inode = 0; > > > > > > > > Why not set everything to 0 up above at the beginning so you never miss > > > > anything, and you don't miss any holes accidentally in the future. > > > > > > > > > > Stylistic preference, I find this more explicit, but I don't care much > > > one way or another. > > > > > > > > + } > > > > > + > > > > > + karg.vma_flags = 0; > > > > > + if (vma->vm_flags & VM_READ) > > > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE; > > > > > + if (vma->vm_flags & VM_WRITE) > > > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE; > > > > > + if (vma->vm_flags & VM_EXEC) > > > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE; > > > > > + if (vma->vm_flags & VM_MAYSHARE) > > > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED; > > > > > + > > > > > > [...] > > > > > > > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > > > > > index 45e4e64fd664..fe8924a8d916 100644 > > > > > --- a/include/uapi/linux/fs.h > > > > > +++ b/include/uapi/linux/fs.h > > > > > @@ -393,4 +393,36 @@ struct pm_scan_arg { > > > > > __u64 return_mask; > > > > > }; > > > > > > > > > > +/* /proc/<pid>/maps ioctl */ > > > > > +#define PROCFS_IOCTL_MAGIC 0x9f > > > > > > > > Don't you need to document this in the proper place? > > > > > > I probably do, but I'm asking for help in knowing where. procfs is not > > > a typical area of kernel I'm working with, so any pointers are highly > > > appreciated. > > > > > > > > > > > > +#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query) > > > > > + > > > > > +enum procmap_query_flags { > > > > > + PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01, > > > > > +}; > > > > > + > > > > > +enum procmap_vma_flags { > > > > > + PROCFS_PROCMAP_VMA_READABLE = 0x01, > > > > > + PROCFS_PROCMAP_VMA_WRITABLE = 0x02, > > > > > + PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04, > > > > > + PROCFS_PROCMAP_VMA_SHARED = 0x08, > > > > > > > > Are these bits? If so, please use the bit macro for it to make it > > > > obvious. > > > > > > > > > > Yes, they are. When I tried BIT(1), it didn't compile. I chose not to > > > add any extra #includes to this UAPI header, but I can figure out the > > > necessary dependency and do BIT(), I just didn't feel like BIT() adds > > > much here, tbh. > > > > > > > > +}; > > > > > + > > > > > +struct procfs_procmap_query { > > > > > + __u64 size; > > > > > + __u64 query_flags; /* in */ > > > > > > > > Does this map to the procmap_vma_flags enum? if so, please say so. > > > > > > no, procmap_query_flags, and yes, I will > > > > > > > > > > > > + __u64 query_addr; /* in */ > > > > > + __u64 vma_start; /* out */ > > > > > + __u64 vma_end; /* out */ > > > > > + __u64 vma_flags; /* out */ > > > > > + __u64 vma_offset; /* out */ > > > > > + __u64 inode; /* out */ > > > > > > > > What is the inode for, you have an inode for the file already, why give > > > > it another one? > > > > > > This is inode of vma's backing file, same as /proc/<pid>/maps' file > > > column. What inode of file do I already have here? You mean of > > > /proc/<pid>/maps itself? It's useless for the intended purposes. > > > > > > > > > > > > + __u32 dev_major; /* out */ > > > > > + __u32 dev_minor; /* out */ > > > > > > > > What is major/minor for? > > > > > > This is the same information as emitted by /proc/<pid>/maps, > > > identifies superblock of vma's backing file. As I mentioned above, it > > > can be used for caching per-file (i.e., per-ELF binary) information > > > (for example). > > > > > > > > > > > > + __u32 vma_name_size; /* in/out */ > > > > > + __u32 build_id_size; /* in/out */ > > > > > + __u64 vma_name_addr; /* in */ > > > > > + __u64 build_id_addr; /* in */ > > > > > > > > Why not document this all using kerneldoc above the structure? > > > > > > Yes, sorry, I slacked a bit on adding this upfront. I knew we'll be > > > figuring out the best place and approach, and so wanted to avoid > > > documentation churn. > > > > > > Would something like what we have for pm_scan_arg and pagemap APIs > > > work? I see it added a few simple descriptions for pm_scan_arg struct, > > > and there is Documentation/admin-guide/mm/pagemap.rst. Should I add > > > Documentation/admin-guide/mm/procmap.rst (admin-guide part feels off, > > > though)? Anyways, I'm hoping for pointers where all this should be > > > documented. Thank you! > > > > > > > > > > > anyway, I don't like ioctls, but there is a place for them, you just > > > > have to actually justify the use for them and not say "not efficient > > > > enough" as that normally isn't an issue overall. > > > > > > I've written a demo tool in patch #5 which performs real-world task: > > > mapping addresses to their VMAs (specifically calculating file offset, > > > finding vma_start + vma_end range to further access files from > > > /proc/<pid>/map_files/<start>-<end>). I did the implementation > > > faithfully, doing it in the most optimal way for both APIs. I showed > > > that for "typical" (it's hard to specify what typical is, of course, > > > too many variables) scenario (it was data collected on a real server > > > running real service, 30 seconds of process-specific stack traces were > > > captured, if I remember correctly). I showed that doing exactly the > > > same amount of work is ~35x times slower with /proc/<pid>/maps. > > > > > > Take another process, another set of addresses, another anything, and > > > the numbers will be different, but I think it gives the right idea. > > > > > > But I think we are overpivoting on text vs binary distinction here. > > > It's the more targeted querying of VMAs that's beneficial here. This > > > allows applications to not cache anything and just re-query when doing > > > periodic or continuous profiling (where addresses are coming in not as > > > one batch, as a sequence of batches extended in time). > > > > > > /proc/<pid>/maps, for all its usefulness, just can't provide this sort > > > of ability, as it wasn't designed to do that and is targeting > > > different use cases. > > > > > > And then, a new ability to request reliable (it's not 100% reliable > > > today, I'm going to address that as a follow up) build ID is *crucial* > > > for some scenarios. The mentioned Oculus use case, the need to fully > > > access underlying ELF binary just to get build ID is frowned upon. And > > > for a good reason. Profiler only needs build ID, which is no secret > > > and not sensitive information. This new (and binary, yes) API allows > > > to add this into an API without breaking any backwards compatibility. > > > > > > > > > > > thanks, > > > > > > > > greg k-h > >
On Mon, May 06, 2024 at 03:53:40PM -0300, Arnaldo Carvalho de Melo wrote: > On Mon, May 06, 2024 at 11:05:17AM -0700, Namhyung Kim wrote: > > On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote: > > > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote: > > > > On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote: > > > > > > Note also, that fetching VMA name (e.g., backing file path, or special > > > > > > hard-coded or user-provided names) is optional just like build ID. If > > > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve > > > > > > it, saving resources. > > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > > > > Where is the userspace code that uses this new api you have created? > > > > > So I added a faithful comparison of existing /proc/<pid>/maps vs new > > > > ioctl() API to solve a common problem (as described above) in patch > > > > #5. The plan is to put it in mentioned blazesym library at the very > > > > least. > > > > > > > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and > > > > linux-perf-user), as they need to do stack symbolization as well. > > > I think the general use case in perf is different. This ioctl API is great > > for live tracing of a single (or a small number of) process(es). And > > yes, perf tools have those tracing use cases too. But I think the > > major use case of perf tools is system-wide profiling. > > > For system-wide profiling, you need to process samples of many > > different processes at a high frequency. Now perf record doesn't > > process them and just save it for offline processing (well, it does > > at the end to find out build-ID but it can be omitted). > > Since: > > Author: Jiri Olsa <jolsa@kernel.org> > Date: Mon Dec 14 11:54:49 2020 +0100 > 1ca6e80254141d26 ("perf tools: Store build id when available in PERF_RECORD_MMAP2 metadata events") > > We don't need to to process the events to find the build ids. I haven't > checked if we still do it to find out which DSOs had hits, but we > shouldn't need to do it for build-ids (unless they were not in memory > when the kernel tried to stash them in the PERF_RECORD_MMAP2, which I > haven't checked but IIRC is a possibility if that ELF part isn't in > memory at the time we want to copy it). > If we're still traversing it like that I guess we can have a knob and > make it the default to not do that and instead create the perf.data > build ID header table with all the build-ids we got from > PERF_RECORD_MMAP2, a (slightly) bigger perf.data file but no event > processing at the end of a 'perf record' session. But then we don't process the PERF_RECORD_MMAP2 in 'perf record', it just goes on directly to the perf.data file :-\ Humm, perhaps the sideband thread... - Arnaldo
On Mon, May 06, 2024 at 11:41:43AM -0700, Andrii Nakryiko wrote: > On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote: > > > > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote: > > > On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote: > > > > > Note also, that fetching VMA name (e.g., backing file path, or special > > > > > hard-coded or user-provided names) is optional just like build ID. If > > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve > > > > > it, saving resources. > > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > > > > Where is the userspace code that uses this new api you have created? > > > > > So I added a faithful comparison of existing /proc/<pid>/maps vs new > > > ioctl() API to solve a common problem (as described above) in patch > > > #5. The plan is to put it in mentioned blazesym library at the very > > > least. > > > > > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and > > > linux-perf-user), as they need to do stack symbolization as well. > > > > At some point, when BPF iterators became a thing we thought about, IIRC > > Jiri did some experimentation, but I lost track, of using BPF to > > synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout > > as in uapi/linux/perf_event.h: > > > > /* > > * The MMAP2 records are an augmented version of MMAP, they add > > * maj, min, ino numbers to be used to uniquely identify each mapping > > * > > * struct { > > * struct perf_event_header header; > > * > > * u32 pid, tid; > > * u64 addr; > > * u64 len; > > * u64 pgoff; > > * union { > > * struct { > > * u32 maj; > > * u32 min; > > * u64 ino; > > * u64 ino_generation; > > * }; > > * struct { > > * u8 build_id_size; > > * u8 __reserved_1; > > * u16 __reserved_2; > > * u8 build_id[20]; > > * }; > > * }; > > * u32 prot, flags; > > * char filename[]; > > * struct sample_id sample_id; > > * }; > > */ > > PERF_RECORD_MMAP2 = 10, > > > > * PERF_RECORD_MISC_MMAP_BUILD_ID - PERF_RECORD_MMAP2 event > > > > As perf.data files can be used for many purposes we want them all, so we > > ok, so because you want them all and you don't know which VMAs will be > useful or not, it's a different problem. BPF iterators will be faster > purely due to avoiding binary -> text -> binary conversion path, but > other than that you'll still retrieve all VMAs. But not using tons of syscalls to parse text data from /proc. > You can still do the same full VMA iteration with this new API, of > course, but advantages are probably smaller as you'll be retrieving a > full set of VMAs regardless (though it would be interesting to compare > anyways). sure, I can't see how it would be faster, but yeah, interesting to see what is the difference. > > setup a meta data perf file descriptor to go on receiving the new mmaps > > while we read /proc/<pid>/maps, to reduce the chance of missing maps, do > > it in parallel, etc: > > > > ⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis' > > > > Usage: perf record [<options>] [<command>] > > or: perf record [<options>] -- <command> [<options>] > > > > --num-thread-synthesize <n> > > number of threads to run for event synthesis > > --synth <no|all|task|mmap|cgroup> > > Fine-tune event synthesis: default=all > > > > ⬢[acme@toolbox perf-tools-next]$ > > > > For this specific initial synthesis of everything the plan, as mentioned > > about Jiri's experiments, was to use a BPF iterator to just feed the > > perf ring buffer with those events, that way userspace would just > > receive the usual records it gets when a new mmap is put in place, the > > BPF iterator would just feed the preexisting mmaps, as instructed via > > the perf_event_attr for the perf_event_open syscall. > > > > For people not wanting BPF, i.e. disabling it altogether in perf or > > disabling just BPF skels, then we would fallback to the current method, > > or to the one being discussed here when it becomes available. > > > > One thing to have in mind is for this iterator not to generate duplicate > > records for non-pre-existing mmaps, i.e. we would need some generation > > number that would be bumped when asking for such pre-existing maps > > PERF_RECORD_MMAP2 dumps. > > Looking briefly at struct vm_area_struct, it doesn't seems like the > kernel maintains any sort of generation (at least not at > vm_area_struct level), so this would be nice to have, I'm sure, but Yeah, this would be something specific to the "retrieve me the list of VMAs" bulky thing, i.e. the kernel perf code (or the BPF that would generate the PERF_RECORD_MMAP2 records by using a BPF vma iterator) would bump the generation number and store it to the VMA in perf_event_mmap() so that the iterator doesn't consider it, as it is a new mmap that is being just sent to whoever is listening, and the perf tool that put in place the BPF program to iterate is listening. > isn't really related to adding this API. Once the kernel does have Well, perf wants to enumerate pre-existing mmaps _and_ after that finishes to know about new mmaps, so we need to know a way to avoid having the BPF program enumerating pre-existing maps sending PERF_RECORD_MMAP2 for maps perf already knows about via a regular PERF_RECORD_MMAP2 sent when a new mmap is put in place. So there is an overlap where perf (or any other tool wanting to enumerate all pre-existing maps and new ones) can receive info for the same map from the enumerator and from the existing mechanism generating PERF_RECORD_MMAP2 records. - Arnaldo > this "VMA generation" counter, it can be trivially added to this > binary interface (which can't be said about /proc/<pid>/maps, > unfortunately). > > > > > > It will be up to other similar projects to adopt this, but we'll > > > definitely get this into blazesym as it is actually a problem for the > > > > At some point looking at plugging blazesym somehow with perf may be > > something to consider, indeed. > > In the above I meant direct use of this new API in perf code itself, > but yes, blazesym is a generic library for symbolization that handles > ELF/DWARF/GSYM (and I believe more formats), so it indeed might make > sense to use it. > > > > > - Arnaldo > > > > > abovementioned Oculus use case. We already had to make a tradeoff (see > > > [2], this wasn't done just because we could, but it was requested by > > > Oculus customers) to cache the contents of /proc/<pid>/maps and run > > > the risk of missing some shared libraries that can be loaded later. It > > > would be great to not have to do this tradeoff, which this new API > > > would enable. > > > > > > [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf > > > > > [...]
On Mon, May 6, 2024 at 1:35 PM Arnaldo Carvalho de Melo <acme@kernel.org> wrote: > > On Mon, May 06, 2024 at 11:41:43AM -0700, Andrii Nakryiko wrote: > > On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote: > > > > > > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote: > > > > On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote: > > > > > > Note also, that fetching VMA name (e.g., backing file path, or special > > > > > > hard-coded or user-provided names) is optional just like build ID. If > > > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve > > > > > > it, saving resources. > > > > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > > > > > > Where is the userspace code that uses this new api you have created? > > > > > > > So I added a faithful comparison of existing /proc/<pid>/maps vs new > > > > ioctl() API to solve a common problem (as described above) in patch > > > > #5. The plan is to put it in mentioned blazesym library at the very > > > > least. > > > > > > > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and > > > > linux-perf-user), as they need to do stack symbolization as well. > > > > > > At some point, when BPF iterators became a thing we thought about, IIRC > > > Jiri did some experimentation, but I lost track, of using BPF to > > > synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout > > > as in uapi/linux/perf_event.h: > > > > > > /* > > > * The MMAP2 records are an augmented version of MMAP, they add > > > * maj, min, ino numbers to be used to uniquely identify each mapping > > > * > > > * struct { > > > * struct perf_event_header header; > > > * > > > * u32 pid, tid; > > > * u64 addr; > > > * u64 len; > > > * u64 pgoff; > > > * union { > > > * struct { > > > * u32 maj; > > > * u32 min; > > > * u64 ino; > > > * u64 ino_generation; > > > * }; > > > * struct { > > > * u8 build_id_size; > > > * u8 __reserved_1; > > > * u16 __reserved_2; > > > * u8 build_id[20]; > > > * }; > > > * }; > > > * u32 prot, flags; > > > * char filename[]; > > > * struct sample_id sample_id; > > > * }; > > > */ > > > PERF_RECORD_MMAP2 = 10, > > > > > > * PERF_RECORD_MISC_MMAP_BUILD_ID - PERF_RECORD_MMAP2 event > > > > > > As perf.data files can be used for many purposes we want them all, so we > > > > ok, so because you want them all and you don't know which VMAs will be > > useful or not, it's a different problem. BPF iterators will be faster > > purely due to avoiding binary -> text -> binary conversion path, but > > other than that you'll still retrieve all VMAs. > > But not using tons of syscalls to parse text data from /proc. In terms of syscall *count* you win with 4KB text reads, there are fewer syscalls because of this 4KB-based batching. But the cost of syscall + amount of user-space processing is a different matter. My benchmark in perf (see patch #5 discussion) suggests that even with more ioctl() syscalls, perf would win here. But I also realized that what you really need (I think, correct me if I'm wrong) is only file-backed VMAs, because all the other ones are not that useful for symbolization. So I'm adding a minimal change to my code to allow the user to specify another query flag to only return file-backed VMAs. I'm going to try it with perf code and see how that helps. I'll post results in patch #5 thread, once I have them. > > > You can still do the same full VMA iteration with this new API, of > > course, but advantages are probably smaller as you'll be retrieving a > > full set of VMAs regardless (though it would be interesting to compare > > anyways). > > sure, I can't see how it would be faster, but yeah, interesting to see > what is the difference. see patch #5 thread, seems like it's still a bit faster > > > > setup a meta data perf file descriptor to go on receiving the new mmaps > > > while we read /proc/<pid>/maps, to reduce the chance of missing maps, do > > > it in parallel, etc: > > > > > > ⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis' > > > > > > Usage: perf record [<options>] [<command>] > > > or: perf record [<options>] -- <command> [<options>] > > > > > > --num-thread-synthesize <n> > > > number of threads to run for event synthesis > > > --synth <no|all|task|mmap|cgroup> > > > Fine-tune event synthesis: default=all > > > > > > ⬢[acme@toolbox perf-tools-next]$ > > > > > > For this specific initial synthesis of everything the plan, as mentioned > > > about Jiri's experiments, was to use a BPF iterator to just feed the > > > perf ring buffer with those events, that way userspace would just > > > receive the usual records it gets when a new mmap is put in place, the > > > BPF iterator would just feed the preexisting mmaps, as instructed via > > > the perf_event_attr for the perf_event_open syscall. > > > > > > For people not wanting BPF, i.e. disabling it altogether in perf or > > > disabling just BPF skels, then we would fallback to the current method, > > > or to the one being discussed here when it becomes available. > > > > > > One thing to have in mind is for this iterator not to generate duplicate > > > records for non-pre-existing mmaps, i.e. we would need some generation > > > number that would be bumped when asking for such pre-existing maps > > > PERF_RECORD_MMAP2 dumps. > > > > Looking briefly at struct vm_area_struct, it doesn't seems like the > > kernel maintains any sort of generation (at least not at > > vm_area_struct level), so this would be nice to have, I'm sure, but > > Yeah, this would be something specific to the "retrieve me the list of > VMAs" bulky thing, i.e. the kernel perf code (or the BPF that would > generate the PERF_RECORD_MMAP2 records by using a BPF vma iterator) > would bump the generation number and store it to the VMA in > perf_event_mmap() so that the iterator doesn't consider it, as it is a > new mmap that is being just sent to whoever is listening, and the perf > tool that put in place the BPF program to iterate is listening. Ok, we went on *so many* tangents in emails on this patch set :) Seems like there are a bunch of perf-specific improvements possible which are completely irrelevant to the API I'm proposing. Let's please keep them separate (and you, perf folks, should propose them upstream), it's getting hard to see what this patch set is actually about with all the tangential emails. > > > isn't really related to adding this API. Once the kernel does have > > Well, perf wants to enumerate pre-existing mmaps _and_ after that > finishes to know about new mmaps, so we need to know a way to avoid > having the BPF program enumerating pre-existing maps sending > PERF_RECORD_MMAP2 for maps perf already knows about via a regular > PERF_RECORD_MMAP2 sent when a new mmap is put in place. > > So there is an overlap where perf (or any other tool wanting to > enumerate all pre-existing maps and new ones) can receive info for the > same map from the enumerator and from the existing mechanism generating > PERF_RECORD_MMAP2 records. > > - Arnaldo > > > this "VMA generation" counter, it can be trivially added to this > > binary interface (which can't be said about /proc/<pid>/maps, > > unfortunately). > > > > > > > > > It will be up to other similar projects to adopt this, but we'll > > > > definitely get this into blazesym as it is actually a problem for the > > > > > > At some point looking at plugging blazesym somehow with perf may be > > > something to consider, indeed. > > > > In the above I meant direct use of this new API in perf code itself, > > but yes, blazesym is a generic library for symbolization that handles > > ELF/DWARF/GSYM (and I believe more formats), so it indeed might make > > sense to use it. > > > > > > > > - Arnaldo > > > > > > > abovementioned Oculus use case. We already had to make a tradeoff (see > > > > [2], this wasn't done just because we could, but it was requested by > > > > Oculus customers) to cache the contents of /proc/<pid>/maps and run > > > > the risk of missing some shared libraries that can be loaded later. It > > > > would be great to not have to do this tradeoff, which this new API > > > > would enable. > > > > > > > > [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf > > > > > > > > [...]
* Andrii Nakryiko <andrii@kernel.org> [240503 20:30]: > /proc/<pid>/maps file is extremely useful in practice for various tasks > involving figuring out process memory layout, what files are backing any > given memory range, etc. One important class of applications that > absolutely rely on this are profilers/stack symbolizers. They would > normally capture stack trace containing absolute memory addresses of > some functions, and would then use /proc/<pid>/maps file to file > corresponding backing ELF files, file offsets within them, and then > continue from there to get yet more information (ELF symbols, DWARF > information) to get human-readable symbolic information. > > As such, there are both performance and correctness requirement > involved. This address to VMA information translation has to be done as > efficiently as possible, but also not miss any VMA (especially in the > case of loading/unloading shared libraries). > > Unfortunately, for all the /proc/<pid>/maps file universality and > usefulness, it doesn't fit the above 100%. > > First, it's text based, which makes its programmatic use from > applications and libraries unnecessarily cumbersome and slow due to the > need to do text parsing to get necessary pieces of information. > > Second, it's main purpose is to emit all VMAs sequentially, but in > practice captured addresses would fall only into a small subset of all > process' VMAs, mainly containing executable text. Yet, library would > need to parse most or all of the contents to find needed VMAs, as there > is no way to skip VMAs that are of no use. Efficient library can do the > linear pass and it is still relatively efficient, but it's definitely an > overhead that can be avoided, if there was a way to do more targeted > querying of the relevant VMA information. > > Another problem when writing generic stack trace symbolization library > is an unfortunate performance-vs-correctness tradeoff that needs to be > made. Library has to make a decision to either cache parsed contents of > /proc/<pid>/maps for service future requests (if application requests to > symbolize another set of addresses, captured at some later time, which > is typical for periodic/continuous profiling cases) to avoid higher > costs of needed to re-parse this file or caching the contents in memory > to speed up future requests. In the former case, more memory is used for > the cache and there is a risk of getting stale data if application > loaded/unloaded shared libraries, or otherwise changed its set of VMAs > through additiona mmap() calls (and other means of altering memory > address space). In the latter case, it's the performance hit that comes > from re-opening the file and re-reading/re-parsing its contents all over > again. > > This patch aims to solve this problem by providing a new API built on > top of /proc/<pid>/maps. It is ioctl()-based and built as a binary > interface, avoiding the cost and awkwardness of textual representation > for programmatic use. It's designed to be extensible and > forward/backward compatible by including user-specified field size and > using copy_struct_from_user() approach. But, most importantly, it allows > to do point queries for specific single address, specified by user. And > this is done efficiently using VMA iterator. > > User has a choice to pick either getting VMA that covers provided > address or -ENOENT if none is found (exact, least surprising, case). Or, > with an extra query flag (PROCFS_PROCMAP_EXACT_OR_NEXT_VMA), they can > get either VMA that covers the address (if there is one), or the closest > next VMA (i.e., VMA with the smallest vm_start > addr). The later allows > more efficient use, but, given it could be a surprising behavior, > requires an explicit opt-in. > > Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes > sense given it's querying the same set of VMA data. All the permissions > checks performed on /proc/<pid>/maps opening fit here as well. > ioctl-based implementation is fetching remembered mm_struct reference, > but otherwise doesn't interfere with seq_file-based implementation of > /proc/<pid>/maps textual interface, and so could be used together or > independently without paying any price for that. > > There is one extra thing that /proc/<pid>/maps doesn't currently > provide, and that's an ability to fetch ELF build ID, if present. User > has control over whether this piece of information is requested or not > by either setting build_id_size field to zero or non-zero maximum buffer > size they provided through build_id_addr field (which encodes user > pointer as __u64 field). > > The need to get ELF build ID reliably is an important aspect when > dealing with profiling and stack trace symbolization, and > /proc/<pid>/maps textual representation doesn't help with this, > requiring applications to open underlying ELF binary through > /proc/<pid>/map_files/<start>-<end> symlink, which adds an extra > permissions implications due giving a full access to the binary from > (potentially) another process, while all application is interested in is > build ID. Giving an ability to request just build ID doesn't introduce > any additional security concerns, on top of what /proc/<pid>/maps is > already concerned with, simplifying the overall logic. > > Kernel already implements build ID fetching, which is used from BPF > subsystem. We are reusing this code here, but plan a follow up changes > to make it work better under more relaxed assumption (compared to what > existing code assumes) of being called from user process context, in > which page faults are allowed. BPF-specific implementation currently > bails out if necessary part of ELF file is not paged in, all due to > extra BPF-specific restrictions (like the need to fetch build ID in > restrictive contexts such as NMI handler). > > Note also, that fetching VMA name (e.g., backing file path, or special > hard-coded or user-provided names) is optional just like build ID. If > user sets vma_name_size to zero, kernel code won't attempt to retrieve > it, saving resources. > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > --- > fs/proc/task_mmu.c | 165 ++++++++++++++++++++++++++++++++++++++++ > include/uapi/linux/fs.h | 32 ++++++++ > 2 files changed, 197 insertions(+) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 8e503a1635b7..cb7b1ff1a144 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -22,6 +22,7 @@ > #include <linux/pkeys.h> > #include <linux/minmax.h> > #include <linux/overflow.h> > +#include <linux/buildid.h> > > #include <asm/elf.h> > #include <asm/tlb.h> > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file) > return do_maps_open(inode, file, &proc_pid_maps_op); > } > > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) > +{ > + struct procfs_procmap_query karg; > + struct vma_iterator iter; > + struct vm_area_struct *vma; > + struct mm_struct *mm; > + const char *name = NULL; > + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL; > + __u64 usize; > + int err; > + > + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize))) > + return -EFAULT; > + if (usize > PAGE_SIZE) > + return -E2BIG; > + if (usize < offsetofend(struct procfs_procmap_query, query_addr)) > + return -EINVAL; > + err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize); > + if (err) > + return err; > + > + if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) > + return -EINVAL; > + if (!!karg.vma_name_size != !!karg.vma_name_addr) > + return -EINVAL; > + if (!!karg.build_id_size != !!karg.build_id_addr) > + return -EINVAL; > + > + mm = priv->mm; > + if (!mm || !mmget_not_zero(mm)) > + return -ESRCH; > + if (mmap_read_lock_killable(mm)) { > + mmput(mm); > + return -EINTR; > + } Using the rcu lookup here will allow for more success rate with less lock contention. > + > + vma_iter_init(&iter, mm, karg.query_addr); > + vma = vma_next(&iter); > + if (!vma) { > + err = -ENOENT; > + goto out; > + } > + /* user wants covering VMA, not the closest next one */ > + if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) && > + vma->vm_start > karg.query_addr) { > + err = -ENOENT; > + goto out; > + } The interface you are using is a start address to search from to the end of the address space, so this won't work as you intended with the PROCFS_PROCMAP_EXACT_OR_NEXT_VMA flag. I do not think the vma iterator has the desired interface you want as the single address lookup doesn't use the vma iterator. I'd just run the vma_next() and check the limits. See find_exact_vma() for the limit checks. > + > + karg.vma_start = vma->vm_start; > + karg.vma_end = vma->vm_end; > + > + if (vma->vm_file) { > + const struct inode *inode = file_user_inode(vma->vm_file); > + > + karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT; > + karg.dev_major = MAJOR(inode->i_sb->s_dev); > + karg.dev_minor = MINOR(inode->i_sb->s_dev); > + karg.inode = inode->i_ino; > + } else { > + karg.vma_offset = 0; > + karg.dev_major = 0; > + karg.dev_minor = 0; > + karg.inode = 0; > + } > + > + karg.vma_flags = 0; > + if (vma->vm_flags & VM_READ) > + karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE; > + if (vma->vm_flags & VM_WRITE) > + karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE; > + if (vma->vm_flags & VM_EXEC) > + karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE; > + if (vma->vm_flags & VM_MAYSHARE) > + karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED; > + > + if (karg.build_id_size) { > + __u32 build_id_sz = BUILD_ID_SIZE_MAX; > + > + err = build_id_parse(vma, build_id_buf, &build_id_sz); > + if (!err) { > + if (karg.build_id_size < build_id_sz) { > + err = -ENAMETOOLONG; > + goto out; > + } > + karg.build_id_size = build_id_sz; > + } > + } > + > + if (karg.vma_name_size) { > + size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size); > + const struct path *path; > + const char *name_fmt; > + size_t name_sz = 0; > + > + get_vma_name(vma, &path, &name, &name_fmt); > + > + if (path || name_fmt || name) { > + name_buf = kmalloc(name_buf_sz, GFP_KERNEL); > + if (!name_buf) { > + err = -ENOMEM; > + goto out; > + } > + } > + if (path) { > + name = d_path(path, name_buf, name_buf_sz); > + if (IS_ERR(name)) { > + err = PTR_ERR(name); > + goto out; > + } > + name_sz = name_buf + name_buf_sz - name; > + } else if (name || name_fmt) { > + name_sz = 1 + snprintf(name_buf, name_buf_sz, name_fmt ?: "%s", name); > + name = name_buf; > + } > + if (name_sz > name_buf_sz) { > + err = -ENAMETOOLONG; > + goto out; > + } > + karg.vma_name_size = name_sz; > + } > + > + /* unlock and put mm_struct before copying data to user */ > + mmap_read_unlock(mm); > + mmput(mm); > + > + if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr, > + name, karg.vma_name_size)) { > + kfree(name_buf); > + return -EFAULT; > + } > + kfree(name_buf); > + > + if (karg.build_id_size && copy_to_user((void __user *)karg.build_id_addr, > + build_id_buf, karg.build_id_size)) > + return -EFAULT; > + > + if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize))) > + return -EFAULT; > + > + return 0; > + > +out: > + mmap_read_unlock(mm); > + mmput(mm); > + kfree(name_buf); > + return err; > +} > + > +static long procfs_procmap_ioctl(struct file *file, unsigned int cmd, unsigned long arg) > +{ > + struct seq_file *seq = file->private_data; > + struct proc_maps_private *priv = seq->private; > + > + switch (cmd) { > + case PROCFS_PROCMAP_QUERY: > + return do_procmap_query(priv, (void __user *)arg); > + default: > + return -ENOIOCTLCMD; > + } > +} > + > const struct file_operations proc_pid_maps_operations = { > .open = pid_maps_open, > .read = seq_read, > .llseek = seq_lseek, > .release = proc_map_release, > + .unlocked_ioctl = procfs_procmap_ioctl, > + .compat_ioctl = procfs_procmap_ioctl, > }; > > /* > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > index 45e4e64fd664..fe8924a8d916 100644 > --- a/include/uapi/linux/fs.h > +++ b/include/uapi/linux/fs.h > @@ -393,4 +393,36 @@ struct pm_scan_arg { > __u64 return_mask; > }; > > +/* /proc/<pid>/maps ioctl */ > +#define PROCFS_IOCTL_MAGIC 0x9f > +#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query) > + > +enum procmap_query_flags { > + PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01, > +}; > + > +enum procmap_vma_flags { > + PROCFS_PROCMAP_VMA_READABLE = 0x01, > + PROCFS_PROCMAP_VMA_WRITABLE = 0x02, > + PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04, > + PROCFS_PROCMAP_VMA_SHARED = 0x08, > +}; > + > +struct procfs_procmap_query { > + __u64 size; > + __u64 query_flags; /* in */ > + __u64 query_addr; /* in */ > + __u64 vma_start; /* out */ > + __u64 vma_end; /* out */ > + __u64 vma_flags; /* out */ > + __u64 vma_offset; /* out */ > + __u64 inode; /* out */ > + __u32 dev_major; /* out */ > + __u32 dev_minor; /* out */ > + __u32 vma_name_size; /* in/out */ > + __u32 build_id_size; /* in/out */ > + __u64 vma_name_addr; /* in */ > + __u64 build_id_addr; /* in */ > +}; > + > #endif /* _UAPI_LINUX_FS_H */ > -- > 2.43.0 > >
On Tue, May 7, 2024 at 11:10 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > * Andrii Nakryiko <andrii@kernel.org> [240503 20:30]: > > /proc/<pid>/maps file is extremely useful in practice for various tasks > > involving figuring out process memory layout, what files are backing any > > given memory range, etc. One important class of applications that > > absolutely rely on this are profilers/stack symbolizers. They would > > normally capture stack trace containing absolute memory addresses of > > some functions, and would then use /proc/<pid>/maps file to file > > corresponding backing ELF files, file offsets within them, and then > > continue from there to get yet more information (ELF symbols, DWARF > > information) to get human-readable symbolic information. > > > > As such, there are both performance and correctness requirement > > involved. This address to VMA information translation has to be done as > > efficiently as possible, but also not miss any VMA (especially in the > > case of loading/unloading shared libraries). > > > > Unfortunately, for all the /proc/<pid>/maps file universality and > > usefulness, it doesn't fit the above 100%. > > > > First, it's text based, which makes its programmatic use from > > applications and libraries unnecessarily cumbersome and slow due to the > > need to do text parsing to get necessary pieces of information. > > > > Second, it's main purpose is to emit all VMAs sequentially, but in > > practice captured addresses would fall only into a small subset of all > > process' VMAs, mainly containing executable text. Yet, library would > > need to parse most or all of the contents to find needed VMAs, as there > > is no way to skip VMAs that are of no use. Efficient library can do the > > linear pass and it is still relatively efficient, but it's definitely an > > overhead that can be avoided, if there was a way to do more targeted > > querying of the relevant VMA information. > > > > Another problem when writing generic stack trace symbolization library > > is an unfortunate performance-vs-correctness tradeoff that needs to be > > made. Library has to make a decision to either cache parsed contents of > > /proc/<pid>/maps for service future requests (if application requests to > > symbolize another set of addresses, captured at some later time, which > > is typical for periodic/continuous profiling cases) to avoid higher > > costs of needed to re-parse this file or caching the contents in memory > > to speed up future requests. In the former case, more memory is used for > > the cache and there is a risk of getting stale data if application > > loaded/unloaded shared libraries, or otherwise changed its set of VMAs > > through additiona mmap() calls (and other means of altering memory > > address space). In the latter case, it's the performance hit that comes > > from re-opening the file and re-reading/re-parsing its contents all over > > again. > > > > This patch aims to solve this problem by providing a new API built on > > top of /proc/<pid>/maps. It is ioctl()-based and built as a binary > > interface, avoiding the cost and awkwardness of textual representation > > for programmatic use. It's designed to be extensible and > > forward/backward compatible by including user-specified field size and > > using copy_struct_from_user() approach. But, most importantly, it allows > > to do point queries for specific single address, specified by user. And > > this is done efficiently using VMA iterator. > > > > User has a choice to pick either getting VMA that covers provided > > address or -ENOENT if none is found (exact, least surprising, case). Or, > > with an extra query flag (PROCFS_PROCMAP_EXACT_OR_NEXT_VMA), they can > > get either VMA that covers the address (if there is one), or the closest > > next VMA (i.e., VMA with the smallest vm_start > addr). The later allows > > more efficient use, but, given it could be a surprising behavior, > > requires an explicit opt-in. > > > > Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes > > sense given it's querying the same set of VMA data. All the permissions > > checks performed on /proc/<pid>/maps opening fit here as well. > > ioctl-based implementation is fetching remembered mm_struct reference, > > but otherwise doesn't interfere with seq_file-based implementation of > > /proc/<pid>/maps textual interface, and so could be used together or > > independently without paying any price for that. > > > > There is one extra thing that /proc/<pid>/maps doesn't currently > > provide, and that's an ability to fetch ELF build ID, if present. User > > has control over whether this piece of information is requested or not > > by either setting build_id_size field to zero or non-zero maximum buffer > > size they provided through build_id_addr field (which encodes user > > pointer as __u64 field). > > > > The need to get ELF build ID reliably is an important aspect when > > dealing with profiling and stack trace symbolization, and > > /proc/<pid>/maps textual representation doesn't help with this, > > requiring applications to open underlying ELF binary through > > /proc/<pid>/map_files/<start>-<end> symlink, which adds an extra > > permissions implications due giving a full access to the binary from > > (potentially) another process, while all application is interested in is > > build ID. Giving an ability to request just build ID doesn't introduce > > any additional security concerns, on top of what /proc/<pid>/maps is > > already concerned with, simplifying the overall logic. > > > > Kernel already implements build ID fetching, which is used from BPF > > subsystem. We are reusing this code here, but plan a follow up changes > > to make it work better under more relaxed assumption (compared to what > > existing code assumes) of being called from user process context, in > > which page faults are allowed. BPF-specific implementation currently > > bails out if necessary part of ELF file is not paged in, all due to > > extra BPF-specific restrictions (like the need to fetch build ID in > > restrictive contexts such as NMI handler). > > > > Note also, that fetching VMA name (e.g., backing file path, or special > > hard-coded or user-provided names) is optional just like build ID. If > > user sets vma_name_size to zero, kernel code won't attempt to retrieve > > it, saving resources. > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > --- > > fs/proc/task_mmu.c | 165 ++++++++++++++++++++++++++++++++++++++++ > > include/uapi/linux/fs.h | 32 ++++++++ > > 2 files changed, 197 insertions(+) > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > index 8e503a1635b7..cb7b1ff1a144 100644 > > --- a/fs/proc/task_mmu.c > > +++ b/fs/proc/task_mmu.c > > @@ -22,6 +22,7 @@ > > #include <linux/pkeys.h> > > #include <linux/minmax.h> > > #include <linux/overflow.h> > > +#include <linux/buildid.h> > > > > #include <asm/elf.h> > > #include <asm/tlb.h> > > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file) > > return do_maps_open(inode, file, &proc_pid_maps_op); > > } > > > > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) > > +{ > > + struct procfs_procmap_query karg; > > + struct vma_iterator iter; > > + struct vm_area_struct *vma; > > + struct mm_struct *mm; > > + const char *name = NULL; > > + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL; > > + __u64 usize; > > + int err; > > + > > + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize))) > > + return -EFAULT; > > + if (usize > PAGE_SIZE) > > + return -E2BIG; > > + if (usize < offsetofend(struct procfs_procmap_query, query_addr)) > > + return -EINVAL; > > + err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize); > > + if (err) > > + return err; > > + > > + if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) > > + return -EINVAL; > > + if (!!karg.vma_name_size != !!karg.vma_name_addr) > > + return -EINVAL; > > + if (!!karg.build_id_size != !!karg.build_id_addr) > > + return -EINVAL; > > + > > + mm = priv->mm; > > + if (!mm || !mmget_not_zero(mm)) > > + return -ESRCH; > > + if (mmap_read_lock_killable(mm)) { > > + mmput(mm); > > + return -EINTR; > > + } > > Using the rcu lookup here will allow for more success rate with less > lock contention. > If you have any code pointers, I'd appreciate it. If not, I'll try to find it myself, no worries. > > + > > + vma_iter_init(&iter, mm, karg.query_addr); > > + vma = vma_next(&iter); > > + if (!vma) { > > + err = -ENOENT; > > + goto out; > > + } > > + /* user wants covering VMA, not the closest next one */ > > + if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) && > > + vma->vm_start > karg.query_addr) { > > + err = -ENOENT; > > + goto out; > > + } > > The interface you are using is a start address to search from to the end > of the address space, so this won't work as you intended with the > PROCFS_PROCMAP_EXACT_OR_NEXT_VMA flag. I do not think the vma iterator Maybe the name isn't the best, by "EXACT" here I meant "VMA that exactly covers provided address", so maybe "COVERING_OR_NEXT_VMA" would be better wording. With that out of the way, I think this API works exactly how I expect it to work: # cat /proc/3406/maps | grep -C1 7f42099fe000 7f42099fa000-7f42099fc000 rw-p 00000000 00:00 0 7f42099fc000-7f42099fe000 r--p 00000000 00:21 109331 /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8 7f42099fe000-7f4209a0e000 r-xp 00002000 00:21 109331 /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8 7f4209a0e000-7f4209a14000 r--p 00012000 00:21 109331 /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8 # cat addrs.txt 0x7f42099fe010 # ./procfs_query -f addrs.txt -p 3406 -v -Q PID: 3406 PATH: addrs.txt READ 1 addrs! SORTED ADDRS (1): ADDR #0: 0x7f42099fe010 VMA FOUND (addr 7f42099fe010): 7f42099fe000-7f4209a0e000 r-xp 00002000 00:21 109331 /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8 (build ID: NO, 0 bytes) RESOLVED ADDRS (1): RESOLVED #0: 0x7f42099fe010 -> OFF 0x2010 NAME /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8 You can see above that for the requested 0x7f42099fe010 address we got a VMA that starts before this address: 7f42099fe000-7f4209a0e000, which is what we want. Before submitting I ran the tool with /proc/<pid>/maps and ioctl to "resolve" the exact same set of addresses and I compared results. They were identical. Note, there is a small bug in the tool I added in patch #5. I changed `-i` argument to `-Q` at the very last moment and haven't updated the code in one place. But other than that I didn't change anything. For the above output, I added "VMA FOUND" verbose logging to see all the details of VMA, not just resolved offset. I'll add that in v2. > has the desired interface you want as the single address lookup doesn't > use the vma iterator. I'd just run the vma_next() and check the limits. > See find_exact_vma() for the limit checks. > > > + > > + karg.vma_start = vma->vm_start; > > + karg.vma_end = vma->vm_end; > > + [...]
On Mon, May 6, 2024 at 12:16 PM Arnaldo Carvalho de Melo <acme@kernel.org> wrote: > > On Mon, May 06, 2024 at 03:53:40PM -0300, Arnaldo Carvalho de Melo wrote: > > On Mon, May 06, 2024 at 11:05:17AM -0700, Namhyung Kim wrote: > > > On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote: > > > > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote: > > > > > On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote: > > > > > > > Note also, that fetching VMA name (e.g., backing file path, or special > > > > > > > hard-coded or user-provided names) is optional just like build ID. If > > > > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve > > > > > > > it, saving resources. > > > > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > > > > > > Where is the userspace code that uses this new api you have created? > > > > > > > So I added a faithful comparison of existing /proc/<pid>/maps vs new > > > > > ioctl() API to solve a common problem (as described above) in patch > > > > > #5. The plan is to put it in mentioned blazesym library at the very > > > > > least. > > > > > > > > > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and > > > > > linux-perf-user), as they need to do stack symbolization as well. > > > > > I think the general use case in perf is different. This ioctl API is great > > > for live tracing of a single (or a small number of) process(es). And > > > yes, perf tools have those tracing use cases too. But I think the > > > major use case of perf tools is system-wide profiling. > > > > > For system-wide profiling, you need to process samples of many > > > different processes at a high frequency. Now perf record doesn't > > > process them and just save it for offline processing (well, it does > > > at the end to find out build-ID but it can be omitted). > > > > Since: > > > > Author: Jiri Olsa <jolsa@kernel.org> > > Date: Mon Dec 14 11:54:49 2020 +0100 > > 1ca6e80254141d26 ("perf tools: Store build id when available in PERF_RECORD_MMAP2 metadata events") > > > > We don't need to to process the events to find the build ids. I haven't > > checked if we still do it to find out which DSOs had hits, but we > > shouldn't need to do it for build-ids (unless they were not in memory > > when the kernel tried to stash them in the PERF_RECORD_MMAP2, which I > > haven't checked but IIRC is a possibility if that ELF part isn't in > > memory at the time we want to copy it). > > > If we're still traversing it like that I guess we can have a knob and > > make it the default to not do that and instead create the perf.data > > build ID header table with all the build-ids we got from > > PERF_RECORD_MMAP2, a (slightly) bigger perf.data file but no event > > processing at the end of a 'perf record' session. > > But then we don't process the PERF_RECORD_MMAP2 in 'perf record', it > just goes on directly to the perf.data file :-\ Yep, we don't process build-IDs at the end if --buildid-mmap option is given. It won't have build-ID header table but it's not needed anymore and perf report can know build-ID from MMAP2 directly. Thanks, Namhyung
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 8e503a1635b7..cb7b1ff1a144 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -22,6 +22,7 @@ #include <linux/pkeys.h> #include <linux/minmax.h> #include <linux/overflow.h> +#include <linux/buildid.h> #include <asm/elf.h> #include <asm/tlb.h> @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file) return do_maps_open(inode, file, &proc_pid_maps_op); } +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) +{ + struct procfs_procmap_query karg; + struct vma_iterator iter; + struct vm_area_struct *vma; + struct mm_struct *mm; + const char *name = NULL; + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL; + __u64 usize; + int err; + + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize))) + return -EFAULT; + if (usize > PAGE_SIZE) + return -E2BIG; + if (usize < offsetofend(struct procfs_procmap_query, query_addr)) + return -EINVAL; + err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize); + if (err) + return err; + + if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) + return -EINVAL; + if (!!karg.vma_name_size != !!karg.vma_name_addr) + return -EINVAL; + if (!!karg.build_id_size != !!karg.build_id_addr) + return -EINVAL; + + mm = priv->mm; + if (!mm || !mmget_not_zero(mm)) + return -ESRCH; + if (mmap_read_lock_killable(mm)) { + mmput(mm); + return -EINTR; + } + + vma_iter_init(&iter, mm, karg.query_addr); + vma = vma_next(&iter); + if (!vma) { + err = -ENOENT; + goto out; + } + /* user wants covering VMA, not the closest next one */ + if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) && + vma->vm_start > karg.query_addr) { + err = -ENOENT; + goto out; + } + + karg.vma_start = vma->vm_start; + karg.vma_end = vma->vm_end; + + if (vma->vm_file) { + const struct inode *inode = file_user_inode(vma->vm_file); + + karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT; + karg.dev_major = MAJOR(inode->i_sb->s_dev); + karg.dev_minor = MINOR(inode->i_sb->s_dev); + karg.inode = inode->i_ino; + } else { + karg.vma_offset = 0; + karg.dev_major = 0; + karg.dev_minor = 0; + karg.inode = 0; + } + + karg.vma_flags = 0; + if (vma->vm_flags & VM_READ) + karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE; + if (vma->vm_flags & VM_WRITE) + karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE; + if (vma->vm_flags & VM_EXEC) + karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE; + if (vma->vm_flags & VM_MAYSHARE) + karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED; + + if (karg.build_id_size) { + __u32 build_id_sz = BUILD_ID_SIZE_MAX; + + err = build_id_parse(vma, build_id_buf, &build_id_sz); + if (!err) { + if (karg.build_id_size < build_id_sz) { + err = -ENAMETOOLONG; + goto out; + } + karg.build_id_size = build_id_sz; + } + } + + if (karg.vma_name_size) { + size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size); + const struct path *path; + const char *name_fmt; + size_t name_sz = 0; + + get_vma_name(vma, &path, &name, &name_fmt); + + if (path || name_fmt || name) { + name_buf = kmalloc(name_buf_sz, GFP_KERNEL); + if (!name_buf) { + err = -ENOMEM; + goto out; + } + } + if (path) { + name = d_path(path, name_buf, name_buf_sz); + if (IS_ERR(name)) { + err = PTR_ERR(name); + goto out; + } + name_sz = name_buf + name_buf_sz - name; + } else if (name || name_fmt) { + name_sz = 1 + snprintf(name_buf, name_buf_sz, name_fmt ?: "%s", name); + name = name_buf; + } + if (name_sz > name_buf_sz) { + err = -ENAMETOOLONG; + goto out; + } + karg.vma_name_size = name_sz; + } + + /* unlock and put mm_struct before copying data to user */ + mmap_read_unlock(mm); + mmput(mm); + + if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr, + name, karg.vma_name_size)) { + kfree(name_buf); + return -EFAULT; + } + kfree(name_buf); + + if (karg.build_id_size && copy_to_user((void __user *)karg.build_id_addr, + build_id_buf, karg.build_id_size)) + return -EFAULT; + + if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize))) + return -EFAULT; + + return 0; + +out: + mmap_read_unlock(mm); + mmput(mm); + kfree(name_buf); + return err; +} + +static long procfs_procmap_ioctl(struct file *file, unsigned int cmd, unsigned long arg) +{ + struct seq_file *seq = file->private_data; + struct proc_maps_private *priv = seq->private; + + switch (cmd) { + case PROCFS_PROCMAP_QUERY: + return do_procmap_query(priv, (void __user *)arg); + default: + return -ENOIOCTLCMD; + } +} + const struct file_operations proc_pid_maps_operations = { .open = pid_maps_open, .read = seq_read, .llseek = seq_lseek, .release = proc_map_release, + .unlocked_ioctl = procfs_procmap_ioctl, + .compat_ioctl = procfs_procmap_ioctl, }; /* diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 45e4e64fd664..fe8924a8d916 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -393,4 +393,36 @@ struct pm_scan_arg { __u64 return_mask; }; +/* /proc/<pid>/maps ioctl */ +#define PROCFS_IOCTL_MAGIC 0x9f +#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query) + +enum procmap_query_flags { + PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01, +}; + +enum procmap_vma_flags { + PROCFS_PROCMAP_VMA_READABLE = 0x01, + PROCFS_PROCMAP_VMA_WRITABLE = 0x02, + PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04, + PROCFS_PROCMAP_VMA_SHARED = 0x08, +}; + +struct procfs_procmap_query { + __u64 size; + __u64 query_flags; /* in */ + __u64 query_addr; /* in */ + __u64 vma_start; /* out */ + __u64 vma_end; /* out */ + __u64 vma_flags; /* out */ + __u64 vma_offset; /* out */ + __u64 inode; /* out */ + __u32 dev_major; /* out */ + __u32 dev_minor; /* out */ + __u32 vma_name_size; /* in/out */ + __u32 build_id_size; /* in/out */ + __u64 vma_name_addr; /* in */ + __u64 build_id_addr; /* in */ +}; + #endif /* _UAPI_LINUX_FS_H */
/proc/<pid>/maps file is extremely useful in practice for various tasks involving figuring out process memory layout, what files are backing any given memory range, etc. One important class of applications that absolutely rely on this are profilers/stack symbolizers. They would normally capture stack trace containing absolute memory addresses of some functions, and would then use /proc/<pid>/maps file to file corresponding backing ELF files, file offsets within them, and then continue from there to get yet more information (ELF symbols, DWARF information) to get human-readable symbolic information. As such, there are both performance and correctness requirement involved. This address to VMA information translation has to be done as efficiently as possible, but also not miss any VMA (especially in the case of loading/unloading shared libraries). Unfortunately, for all the /proc/<pid>/maps file universality and usefulness, it doesn't fit the above 100%. First, it's text based, which makes its programmatic use from applications and libraries unnecessarily cumbersome and slow due to the need to do text parsing to get necessary pieces of information. Second, it's main purpose is to emit all VMAs sequentially, but in practice captured addresses would fall only into a small subset of all process' VMAs, mainly containing executable text. Yet, library would need to parse most or all of the contents to find needed VMAs, as there is no way to skip VMAs that are of no use. Efficient library can do the linear pass and it is still relatively efficient, but it's definitely an overhead that can be avoided, if there was a way to do more targeted querying of the relevant VMA information. Another problem when writing generic stack trace symbolization library is an unfortunate performance-vs-correctness tradeoff that needs to be made. Library has to make a decision to either cache parsed contents of /proc/<pid>/maps for service future requests (if application requests to symbolize another set of addresses, captured at some later time, which is typical for periodic/continuous profiling cases) to avoid higher costs of needed to re-parse this file or caching the contents in memory to speed up future requests. In the former case, more memory is used for the cache and there is a risk of getting stale data if application loaded/unloaded shared libraries, or otherwise changed its set of VMAs through additiona mmap() calls (and other means of altering memory address space). In the latter case, it's the performance hit that comes from re-opening the file and re-reading/re-parsing its contents all over again. This patch aims to solve this problem by providing a new API built on top of /proc/<pid>/maps. It is ioctl()-based and built as a binary interface, avoiding the cost and awkwardness of textual representation for programmatic use. It's designed to be extensible and forward/backward compatible by including user-specified field size and using copy_struct_from_user() approach. But, most importantly, it allows to do point queries for specific single address, specified by user. And this is done efficiently using VMA iterator. User has a choice to pick either getting VMA that covers provided address or -ENOENT if none is found (exact, least surprising, case). Or, with an extra query flag (PROCFS_PROCMAP_EXACT_OR_NEXT_VMA), they can get either VMA that covers the address (if there is one), or the closest next VMA (i.e., VMA with the smallest vm_start > addr). The later allows more efficient use, but, given it could be a surprising behavior, requires an explicit opt-in. Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes sense given it's querying the same set of VMA data. All the permissions checks performed on /proc/<pid>/maps opening fit here as well. ioctl-based implementation is fetching remembered mm_struct reference, but otherwise doesn't interfere with seq_file-based implementation of /proc/<pid>/maps textual interface, and so could be used together or independently without paying any price for that. There is one extra thing that /proc/<pid>/maps doesn't currently provide, and that's an ability to fetch ELF build ID, if present. User has control over whether this piece of information is requested or not by either setting build_id_size field to zero or non-zero maximum buffer size they provided through build_id_addr field (which encodes user pointer as __u64 field). The need to get ELF build ID reliably is an important aspect when dealing with profiling and stack trace symbolization, and /proc/<pid>/maps textual representation doesn't help with this, requiring applications to open underlying ELF binary through /proc/<pid>/map_files/<start>-<end> symlink, which adds an extra permissions implications due giving a full access to the binary from (potentially) another process, while all application is interested in is build ID. Giving an ability to request just build ID doesn't introduce any additional security concerns, on top of what /proc/<pid>/maps is already concerned with, simplifying the overall logic. Kernel already implements build ID fetching, which is used from BPF subsystem. We are reusing this code here, but plan a follow up changes to make it work better under more relaxed assumption (compared to what existing code assumes) of being called from user process context, in which page faults are allowed. BPF-specific implementation currently bails out if necessary part of ELF file is not paged in, all due to extra BPF-specific restrictions (like the need to fetch build ID in restrictive contexts such as NMI handler). Note also, that fetching VMA name (e.g., backing file path, or special hard-coded or user-provided names) is optional just like build ID. If user sets vma_name_size to zero, kernel code won't attempt to retrieve it, saving resources. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> --- fs/proc/task_mmu.c | 165 ++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/fs.h | 32 ++++++++ 2 files changed, 197 insertions(+)