diff mbox series

[v6,3/6] fs/procfs: add build ID fetching to PROCMAP_QUERY API

Message ID 20240627170900.1672542-4-andrii@kernel.org (mailing list archive)
State New
Headers show
Series ioctl()-based API to query VMAs from /proc/<pid>/maps | expand

Commit Message

Andrii Nakryiko June 27, 2024, 5:08 p.m. UTC
The need to get ELF build ID reliably is an important aspect when
dealing with profiling and stack trace symbolization, and
/proc/<pid>/maps textual representation doesn't help with this.

To get backing file's ELF build ID, application has to first resolve
VMA, then use it's start/end address range to follow a special
/proc/<pid>/map_files/<start>-<end> symlink to open the ELF file (this
is necessary because backing file might have been removed from the disk
or was already replaced with another binary in the same file path.

Such approach, beyond just adding complexity of having to do a bunch of
extra work, has extra security implications. Because application opens
underlying ELF file and needs read access to its entire contents (as far
as kernel is concerned), kernel puts additional capable() checks on
following /proc/<pid>/map_files/<start>-<end> symlink. And that makes
sense in general.

But in the case of build ID, profiler/symbolizer doesn't need the
contents of ELF file, per se. It's only build ID that is of interest,
and ELF build ID itself doesn't provide any sensitive information.

So this patch adds a way to request backing file's ELF build ID along
the rest of VMA information in the same API. User has control over
whether this piece of information is requested or not by either setting
build_id_size field to zero or non-zero maximum buffer size they
provided through build_id_addr field (which encodes user pointer as
__u64 field). This is a completely optional piece of information, and so
has no performance implications for user cases that don't care about
build ID, while improving performance and simplifying the setup for
those application that do need it.

Kernel already implements build ID fetching, which is used from BPF
subsystem. We are reusing this code here, but plan a follow up changes
to make it work better under more relaxed assumption (compared to what
existing code assumes) of being called from user process context, in
which page faults are allowed. BPF-specific implementation currently
bails out if necessary part of ELF file is not paged in, all due to
extra BPF-specific restrictions (like the need to fetch build ID in
restrictive contexts such as NMI handler).

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 fs/proc/task_mmu.c      | 25 ++++++++++++++++++++++++-
 include/uapi/linux/fs.h | 28 ++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+), 1 deletion(-)

Comments

Andi Kleen June 27, 2024, 11 p.m. UTC | #1
Andrii Nakryiko <andrii@kernel.org> writes:

> The need to get ELF build ID reliably is an important aspect when
> dealing with profiling and stack trace symbolization, and
> /proc/<pid>/maps textual representation doesn't help with this.
>
> To get backing file's ELF build ID, application has to first resolve
> VMA, then use it's start/end address range to follow a special
> /proc/<pid>/map_files/<start>-<end> symlink to open the ELF file (this
> is necessary because backing file might have been removed from the disk
> or was already replaced with another binary in the same file path.
>
> Such approach, beyond just adding complexity of having to do a bunch of
> extra work, has extra security implications. Because application opens
> underlying ELF file and needs read access to its entire contents (as far
> as kernel is concerned), kernel puts additional capable() checks on
> following /proc/<pid>/map_files/<start>-<end> symlink. And that makes
> sense in general.

I was curious about this statement. It has still certainly potential
for side channels e.g. for files that are execute only, or with
some other special protection.

But actually just looking at the parsing code it seems to fail basic
TOCTTOU rules, and since you don't check if the VMA mapping is executable
(I think), so there's no EBUSY checking for writes, it likely is exploitable.


        /* only supports phdr that fits in one page */
                if (ehdr->e_phnum >
                   (PAGE_SIZE - sizeof(Elf64_Ehdr)) / sizeof(Elf64_Phdr))
                <---------- check in memory
                                return -EINVAL;

        phdr = (Elf64_Phdr *)(page_addr + sizeof(Elf64_Ehdr));

<---- but page is shared in the page cache. So if anybody manages to map
it for write 


        for (i = 0; i < ehdr->e_phnum; ++i) {   <----- this loop can go
                        off into the next page.
                        if (phdr[i].p_type == PT_NOTE &&
                                            !parse_build_id(page_addr, build_id, size,
                                                            page_addr + phdr[i].p_offset,
                                                            phdr[i].p_filesz))
                                                                                    return 0;

Here's an untested patch


diff --git a/lib/buildid.c b/lib/buildid.c
index 7954dd92e36c..6c022fcd03ec 100644
--- a/lib/buildid.c
+++ b/lib/buildid.c
@@ -72,19 +72,20 @@ static int get_build_id_32(const void *page_addr, unsigned char *build_id,
 	Elf32_Ehdr *ehdr = (Elf32_Ehdr *)page_addr;
 	Elf32_Phdr *phdr;
 	int i;
+	unsigned phnum = READ_ONCE(ehdr->e_phnum);
 
 	/* only supports phdr that fits in one page */
-	if (ehdr->e_phnum >
+	if (phnum >
 	    (PAGE_SIZE - sizeof(Elf32_Ehdr)) / sizeof(Elf32_Phdr))
 		return -EINVAL;
 
 	phdr = (Elf32_Phdr *)(page_addr + sizeof(Elf32_Ehdr));
 
-	for (i = 0; i < ehdr->e_phnum; ++i) {
+	for (i = 0; i < phnum; ++i) {
 		if (phdr[i].p_type == PT_NOTE &&
 		    !parse_build_id(page_addr, build_id, size,
 				    page_addr + phdr[i].p_offset,
-				    phdr[i].p_filesz))
+				    READ_ONCE(phdr[i].p_filesz)))
 			return 0;
 	}
 	return -EINVAL;
@@ -97,15 +98,16 @@ static int get_build_id_64(const void *page_addr, unsigned char *build_id,
 	Elf64_Ehdr *ehdr = (Elf64_Ehdr *)page_addr;
 	Elf64_Phdr *phdr;
 	int i;
+	unsigned phnum = READ_ONCE(ehdr->e_phnum);
 
 	/* only supports phdr that fits in one page */
-	if (ehdr->e_phnum >
+	if (phnum >
 	    (PAGE_SIZE - sizeof(Elf64_Ehdr)) / sizeof(Elf64_Phdr))
 		return -EINVAL;
 
 	phdr = (Elf64_Phdr *)(page_addr + sizeof(Elf64_Ehdr));
 
-	for (i = 0; i < ehdr->e_phnum; ++i) {
+	for (i = 0; i < phnum; ++i) {
 		if (phdr[i].p_type == PT_NOTE &&
 		    !parse_build_id(page_addr, build_id, size,
 				    page_addr + phdr[i].p_offset,
Andrii Nakryiko June 28, 2024, 4:36 p.m. UTC | #2
On Thu, Jun 27, 2024 at 4:00 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> Andrii Nakryiko <andrii@kernel.org> writes:
>
> > The need to get ELF build ID reliably is an important aspect when
> > dealing with profiling and stack trace symbolization, and
> > /proc/<pid>/maps textual representation doesn't help with this.
> >
> > To get backing file's ELF build ID, application has to first resolve
> > VMA, then use it's start/end address range to follow a special
> > /proc/<pid>/map_files/<start>-<end> symlink to open the ELF file (this
> > is necessary because backing file might have been removed from the disk
> > or was already replaced with another binary in the same file path.
> >
> > Such approach, beyond just adding complexity of having to do a bunch of
> > extra work, has extra security implications. Because application opens
> > underlying ELF file and needs read access to its entire contents (as far
> > as kernel is concerned), kernel puts additional capable() checks on
> > following /proc/<pid>/map_files/<start>-<end> symlink. And that makes
> > sense in general.
>
> I was curious about this statement. It has still certainly potential
> for side channels e.g. for files that are execute only, or with
> some other special protection.
>
> But actually just looking at the parsing code it seems to fail basic
> TOCTTOU rules, and since you don't check if the VMA mapping is executable
> (I think), so there's no EBUSY checking for writes, it likely is exploitable.
>
>
>         /* only supports phdr that fits in one page */
>                 if (ehdr->e_phnum >
>                    (PAGE_SIZE - sizeof(Elf64_Ehdr)) / sizeof(Elf64_Phdr))
>                 <---------- check in memory
>                                 return -EINVAL;
>
>         phdr = (Elf64_Phdr *)(page_addr + sizeof(Elf64_Ehdr));
>
> <---- but page is shared in the page cache. So if anybody manages to map
> it for write
>
>
>         for (i = 0; i < ehdr->e_phnum; ++i) {   <----- this loop can go
>                         off into the next page.
>                         if (phdr[i].p_type == PT_NOTE &&
>                                             !parse_build_id(page_addr, build_id, size,
>                                                             page_addr + phdr[i].p_offset,
>                                                             phdr[i].p_filesz))
>                                                                                     return 0;
>
> Here's an untested patch
>
>

Yep, makes sense. I'm currently reworking this whole lib/buildid.c
implementation to remove all the restrictions on data being in the
first page only, and making it work in a faultable context more
reliably. I can audit the code for TOCTOU issues and incorporate your
feedback. I'll probably post the patch set next week, will cc you as
well.

> diff --git a/lib/buildid.c b/lib/buildid.c
> index 7954dd92e36c..6c022fcd03ec 100644
> --- a/lib/buildid.c
> +++ b/lib/buildid.c
> @@ -72,19 +72,20 @@ static int get_build_id_32(const void *page_addr, unsigned char *build_id,
>         Elf32_Ehdr *ehdr = (Elf32_Ehdr *)page_addr;
>         Elf32_Phdr *phdr;
>         int i;
> +       unsigned phnum = READ_ONCE(ehdr->e_phnum);
>
>         /* only supports phdr that fits in one page */
> -       if (ehdr->e_phnum >
> +       if (phnum >
>             (PAGE_SIZE - sizeof(Elf32_Ehdr)) / sizeof(Elf32_Phdr))
>                 return -EINVAL;
>
>         phdr = (Elf32_Phdr *)(page_addr + sizeof(Elf32_Ehdr));
>
> -       for (i = 0; i < ehdr->e_phnum; ++i) {
> +       for (i = 0; i < phnum; ++i) {
>                 if (phdr[i].p_type == PT_NOTE &&
>                     !parse_build_id(page_addr, build_id, size,
>                                     page_addr + phdr[i].p_offset,
> -                                   phdr[i].p_filesz))
> +                                   READ_ONCE(phdr[i].p_filesz)))
>                         return 0;
>         }
>         return -EINVAL;
> @@ -97,15 +98,16 @@ static int get_build_id_64(const void *page_addr, unsigned char *build_id,
>         Elf64_Ehdr *ehdr = (Elf64_Ehdr *)page_addr;
>         Elf64_Phdr *phdr;
>         int i;
> +       unsigned phnum = READ_ONCE(ehdr->e_phnum);
>
>         /* only supports phdr that fits in one page */
> -       if (ehdr->e_phnum >
> +       if (phnum >
>             (PAGE_SIZE - sizeof(Elf64_Ehdr)) / sizeof(Elf64_Phdr))
>                 return -EINVAL;
>
>         phdr = (Elf64_Phdr *)(page_addr + sizeof(Elf64_Ehdr));
>
> -       for (i = 0; i < ehdr->e_phnum; ++i) {
> +       for (i = 0; i < phnum; ++i) {
>                 if (phdr[i].p_type == PT_NOTE &&
>                     !parse_build_id(page_addr, build_id, size,
>                                     page_addr + phdr[i].p_offset,
Andi Kleen June 28, 2024, 10:33 p.m. UTC | #3
> Yep, makes sense. I'm currently reworking this whole lib/buildid.c
> implementation to remove all the restrictions on data being in the
> first page only, and making it work in a faultable context more
> reliably. I can audit the code for TOCTOU issues and incorporate your
> feedback. I'll probably post the patch set next week, will cc you as
> well.

Please also add checks that the mapping is executable, to
close the obscure "can check the first 4 bytes of every mapped 
file is ELF\0" hole.

But it will still need the hardening because mappings from
ld.so are not EBUSY for writes.

-Andi
Andrii Nakryiko June 28, 2024, 11:03 p.m. UTC | #4
On Fri, Jun 28, 2024 at 3:33 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> > Yep, makes sense. I'm currently reworking this whole lib/buildid.c
> > implementation to remove all the restrictions on data being in the
> > first page only, and making it work in a faultable context more
> > reliably. I can audit the code for TOCTOU issues and incorporate your
> > feedback. I'll probably post the patch set next week, will cc you as
> > well.
>
> Please also add checks that the mapping is executable, to
> close the obscure "can check the first 4 bytes of every mapped
> file is ELF\0" hole.
>
> But it will still need the hardening because mappings from
> ld.so are not EBUSY for writes.

I'm a bit confused. Two things:

1) non-executable file-backed VMA still has build ID associated with
it. Note, build ID is extracted from the backing file's content, not
from VMA itself. The part of ELF file that contains build ID isn't
necessarily mmap()'ed at all

2) What sort of exploitation are we talking about here? it's not
enough for backing file to have correct 4 starting bytes (0x7f"ELF"),
we still have to find correct PT_NOTE segment, and .note.gnu.build-id
section within it, that has correct type (3) and key name "GNU".

I'm trying to understand what we are protecting against here.
Especially that opening /proc/<pid>/maps already requires
PTRACE_MODE_READ permissions anyways (or pid should be self).

>
> -Andi
Andi Kleen July 2, 2024, 2:49 p.m. UTC | #5
> 1) non-executable file-backed VMA still has build ID associated with
> it. Note, build ID is extracted from the backing file's content, not
> from VMA itself. The part of ELF file that contains build ID isn't
> necessarily mmap()'ed at all

That's true, but there should be at least one executable mapping
for any useful ELF file.

Basically such a check guarantee that you cannot tell anything
about a non x mapping not related to ELF.

> 
> 2) What sort of exploitation are we talking about here? it's not
> enough for backing file to have correct 4 starting bytes (0x7f"ELF"),
> we still have to find correct PT_NOTE segment, and .note.gnu.build-id
> section within it, that has correct type (3) and key name "GNU".

There's a timing side channel, you can tell where the checks
stop. I don't think it's a big problem, but it's still better to avoid
such leaks in the first place as much as possible.

> 
> I'm trying to understand what we are protecting against here.
> Especially that opening /proc/<pid>/maps already requires
> PTRACE_MODE_READ permissions anyways (or pid should be self).

While that's true for the standard security permission model there might
be non standard ones where the relationship is more complicated.

-Andi
Andrii Nakryiko July 2, 2024, 11:08 p.m. UTC | #6
On Tue, Jul 2, 2024 at 7:50 AM Andi Kleen <ak@linux.intel.com> wrote:
>
> > 1) non-executable file-backed VMA still has build ID associated with
> > it. Note, build ID is extracted from the backing file's content, not
> > from VMA itself. The part of ELF file that contains build ID isn't
> > necessarily mmap()'ed at all
>
> That's true, but there should be at least one executable mapping
> for any useful ELF file.
>
> Basically such a check guarantee that you cannot tell anything
> about a non x mapping not related to ELF.
>

Ok, I can add this check. If you know off the top of your head how to
do that for struct address_space, I'd appreciate the pointer. Quick
glance didn't show anything useful in linux/fs.h, but I'll dig deeper
a bit later.

> >
> > 2) What sort of exploitation are we talking about here? it's not
> > enough for backing file to have correct 4 starting bytes (0x7f"ELF"),
> > we still have to find correct PT_NOTE segment, and .note.gnu.build-id
> > section within it, that has correct type (3) and key name "GNU".
>
> There's a timing side channel, you can tell where the checks
> stop. I don't think it's a big problem, but it's still better to avoid
> such leaks in the first place as much as possible.
>
> >
> > I'm trying to understand what we are protecting against here.
> > Especially that opening /proc/<pid>/maps already requires
> > PTRACE_MODE_READ permissions anyways (or pid should be self).
>
> While that's true for the standard security permission model there might
> be non standard ones where the relationship is more complicated.
>

Presumably non-standard ones will have more and custom security checks
(LSM, seccomp, etc) involved. Basically, I acknowledge your point, but
I'm not sure it changes anything about adding this API.

> -Andi
Andrii Nakryiko July 8, 2024, 11:43 p.m. UTC | #7
On Tue, Jul 2, 2024 at 7:50 AM Andi Kleen <ak@linux.intel.com> wrote:
>
> > 1) non-executable file-backed VMA still has build ID associated with
> > it. Note, build ID is extracted from the backing file's content, not
> > from VMA itself. The part of ELF file that contains build ID isn't
> > necessarily mmap()'ed at all
>
> That's true, but there should be at least one executable mapping
> for any useful ELF file.
>
> Basically such a check guarantee that you cannot tell anything
> about a non x mapping not related to ELF.

Hey Andi,

So when we were discussing this I was imagining that
inode/address_space does have something like VMA's VM_MAYEXEC flag and
it would be easy and fast to check that. But it doesn't seem so.

So what exactly did you have in mind when you were proposing that
check? Did you mean to do a pass over all VMAs within the process to
check if there is at least one executable VMA belonging to
address_space? If yes, then that would certainly be way too expensive
to be usable.

If I missed something obvious, please point me in the right direction.

As it stands, I don't see any reasonable way to check what you asked
performantly. And given this is a bit of over-cautious check, I'm
inclined to just not add it. Worst case someone with PTRACE_MODE_READ
access would be able to tell if the first 4 bytes of a file are ELF
signature or not. Given PTRACE_MODE_READ, I'd imagine that's not
really a problem.

>
> >
> > 2) What sort of exploitation are we talking about here? it's not
> > enough for backing file to have correct 4 starting bytes (0x7f"ELF"),
> > we still have to find correct PT_NOTE segment, and .note.gnu.build-id
> > section within it, that has correct type (3) and key name "GNU".
>
> There's a timing side channel, you can tell where the checks
> stop. I don't think it's a big problem, but it's still better to avoid
> such leaks in the first place as much as possible.
>
> >
> > I'm trying to understand what we are protecting against here.
> > Especially that opening /proc/<pid>/maps already requires
> > PTRACE_MODE_READ permissions anyways (or pid should be self).
>
> While that's true for the standard security permission model there might
> be non standard ones where the relationship is more complicated.
>
> -Andi
Andi Kleen July 9, 2024, 1:27 a.m. UTC | #8
> So what exactly did you have in mind when you were proposing that
> check? Did you mean to do a pass over all VMAs within the process to
> check if there is at least one executable VMA belonging to
> address_space? If yes, then that would certainly be way too expensive
> to be usable.

I was thinking to only report the build ID when the VMA queried
is executable. If software wanted to look up a data symbol
and needs that buildid it would need to check a x vma too.

Normally tools iterate over all the mappings anyways so this
shouldn't be a big burden for them.

Did I miss something?

I guess an alternative would be a new VMA flag, but iirc we're low on
bits there already.

-Andi
Andrii Nakryiko July 9, 2024, 3:14 a.m. UTC | #9
On Mon, Jul 8, 2024 at 6:27 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> > So what exactly did you have in mind when you were proposing that
> > check? Did you mean to do a pass over all VMAs within the process to
> > check if there is at least one executable VMA belonging to
> > address_space? If yes, then that would certainly be way too expensive
> > to be usable.
>
> I was thinking to only report the build ID when the VMA queried
> is executable. If software wanted to look up a data symbol
> and needs that buildid it would need to check a x vma too.

I think that's way too restrictive and for no good reason, tbh. If
there is some .rodata ELF section mapped as r/o VMA, I don't see any
reason why user shouldn't be able to request build ID for it.

>
> Normally tools iterate over all the mappings anyways so this
> shouldn't be a big burden for them.
>

This API aims to make this unnecessary. So that tools can request only
relevant VMAs based on whatever captured data or code addresses it got
from, say, profiling of perf events. And if there are some locks or
other global data structures that fall into mapped portions of ELF
data sections, the ability to get build ID for those is just as
important as getting build ID for executable sections.

> Did I miss something?
>
> I guess an alternative would be a new VMA flag, but iirc we're low on
> bits there already.

I think we should just keep things as is. I don't think there is any
real added security in restricting this just to executable VMAs.

>
> -Andi
Jann Horn July 29, 2024, 3:47 p.m. UTC | #10
On Thu, Jun 27, 2024 at 7:08 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> The need to get ELF build ID reliably is an important aspect when
> dealing with profiling and stack trace symbolization, and
> /proc/<pid>/maps textual representation doesn't help with this.
[...]
> @@ -539,6 +543,21 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
>                 }
>         }
>
> +       if (karg.build_id_size) {
> +               __u32 build_id_sz;
> +
> +               err = build_id_parse(vma, build_id_buf, &build_id_sz);
> +               if (err) {
> +                       karg.build_id_size = 0;
> +               } else {
> +                       if (karg.build_id_size < build_id_sz) {
> +                               err = -ENAMETOOLONG;
> +                               goto out;
> +                       }
> +                       karg.build_id_size = build_id_sz;
> +               }
> +       }

The diff doesn't have enough context lines to see it here, but the two
closing curly braces above are another copy of exactly the same code
block from the preceding patch. The current state in mainline looks
like this, with two repetitions of exactly the same block:

[...]
                karg.dev_minor = 0;
                karg.inode = 0;
        }

        if (karg.build_id_size) {
                __u32 build_id_sz;

                err = build_id_parse(vma, build_id_buf, &build_id_sz);
                if (err) {
                        karg.build_id_size = 0;
                } else {
                        if (karg.build_id_size < build_id_sz) {
                                err = -ENAMETOOLONG;
                                goto out;
                        }
                        karg.build_id_size = build_id_sz;
                }
        }

        if (karg.build_id_size) {
                __u32 build_id_sz;

                err = build_id_parse(vma, build_id_buf, &build_id_sz);
                if (err) {
                        karg.build_id_size = 0;
                } else {
                        if (karg.build_id_size < build_id_sz) {
                                err = -ENAMETOOLONG;
                                goto out;
                        }
                        karg.build_id_size = build_id_sz;
                }
        }

        if (karg.vma_name_size) {
[...]
Andrii Nakryiko July 29, 2024, 4:52 p.m. UTC | #11
On Mon, Jul 29, 2024 at 8:48 AM Jann Horn <jannh@google.com> wrote:
>
> On Thu, Jun 27, 2024 at 7:08 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > The need to get ELF build ID reliably is an important aspect when
> > dealing with profiling and stack trace symbolization, and
> > /proc/<pid>/maps textual representation doesn't help with this.
> [...]
> > @@ -539,6 +543,21 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> >                 }
> >         }
> >
> > +       if (karg.build_id_size) {
> > +               __u32 build_id_sz;
> > +
> > +               err = build_id_parse(vma, build_id_buf, &build_id_sz);
> > +               if (err) {
> > +                       karg.build_id_size = 0;
> > +               } else {
> > +                       if (karg.build_id_size < build_id_sz) {
> > +                               err = -ENAMETOOLONG;
> > +                               goto out;
> > +                       }
> > +                       karg.build_id_size = build_id_sz;
> > +               }
> > +       }
>
> The diff doesn't have enough context lines to see it here, but the two
> closing curly braces above are another copy of exactly the same code
> block from the preceding patch. The current state in mainline looks
> like this, with two repetitions of exactly the same block:

Yeah, you are right, thanks for the heads up! Seems like a rebase
screw up which duplicated build_id logic. It doesn't have any negative
effects besides doing the same work twice (if build ID parsing is
requested), but I'll definitely will send a fix to drop the
duplication.

>
> [...]
>                 karg.dev_minor = 0;
>                 karg.inode = 0;
>         }
>
>         if (karg.build_id_size) {
>                 __u32 build_id_sz;
>
>                 err = build_id_parse(vma, build_id_buf, &build_id_sz);
>                 if (err) {
>                         karg.build_id_size = 0;
>                 } else {
>                         if (karg.build_id_size < build_id_sz) {
>                                 err = -ENAMETOOLONG;
>                                 goto out;
>                         }
>                         karg.build_id_size = build_id_sz;
>                 }
>         }
>
>         if (karg.build_id_size) {
>                 __u32 build_id_sz;
>
>                 err = build_id_parse(vma, build_id_buf, &build_id_sz);
>                 if (err) {
>                         karg.build_id_size = 0;
>                 } else {
>                         if (karg.build_id_size < build_id_sz) {
>                                 err = -ENAMETOOLONG;
>                                 goto out;
>                         }
>                         karg.build_id_size = build_id_sz;
>                 }
>         }
>
>         if (karg.vma_name_size) {
> [...]
diff mbox series

Patch

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 674405b99d0d..32bef3eeab7f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -22,6 +22,7 @@ 
 #include <linux/pkeys.h>
 #include <linux/minmax.h>
 #include <linux/overflow.h>
+#include <linux/buildid.h>
 
 #include <asm/elf.h>
 #include <asm/tlb.h>
@@ -445,6 +446,7 @@  static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
 	addr = vma->vm_end;
 	if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA)
 		goto next_vma;
+
 no_vma:
 	return ERR_PTR(-ENOENT);
 }
@@ -455,7 +457,7 @@  static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	struct vm_area_struct *vma;
 	struct mm_struct *mm;
 	const char *name = NULL;
-	char *name_buf = NULL;
+	char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
 	__u64 usize;
 	int err;
 
@@ -477,6 +479,8 @@  static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	/* either both buffer address and size are set, or both should be zero */
 	if (!!karg.vma_name_size != !!karg.vma_name_addr)
 		return -EINVAL;
+	if (!!karg.build_id_size != !!karg.build_id_addr)
+		return -EINVAL;
 
 	mm = priv->mm;
 	if (!mm || !mmget_not_zero(mm))
@@ -539,6 +543,21 @@  static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 		}
 	}
 
+	if (karg.build_id_size) {
+		__u32 build_id_sz;
+
+		err = build_id_parse(vma, build_id_buf, &build_id_sz);
+		if (err) {
+			karg.build_id_size = 0;
+		} else {
+			if (karg.build_id_size < build_id_sz) {
+				err = -ENAMETOOLONG;
+				goto out;
+			}
+			karg.build_id_size = build_id_sz;
+		}
+	}
+
 	if (karg.vma_name_size) {
 		size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size);
 		const struct path *path;
@@ -583,6 +602,10 @@  static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	}
 	kfree(name_buf);
 
+	if (karg.build_id_size && copy_to_user((void __user *)karg.build_id_addr,
+					       build_id_buf, karg.build_id_size))
+		return -EFAULT;
+
 	if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize)))
 		return -EFAULT;
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 5d440f9b5d92..2a4a5f50c98e 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -511,6 +511,26 @@  struct procmap_query {
 	 * If set to zero, vma_name_addr should be set to zero as well
 	 */
 	__u32 vma_name_size;		/* in/out */
+	/*
+	 * If set to non-zero value, signals the request to extract and return
+	 * VMA's backing file's build ID, if the backing file is an ELF file
+	 * and it contains embedded build ID.
+	 *
+	 * Kernel will set this field to zero, if VMA has no backing file,
+	 * backing file is not an ELF file, or ELF file has no build ID
+	 * embedded.
+	 *
+	 * Build ID is a binary value (not a string). Kernel will set
+	 * build_id_size field to exact number of bytes used for build ID.
+	 * If build ID is requested and present, but needs more bytes than
+	 * user-supplied maximum buffer size (see build_id_addr field below),
+	 * -E2BIG error will be returned.
+	 *
+	 * If this field is set to non-zero value, build_id_addr should point
+	 * to valid user space memory buffer of at least build_id_size bytes.
+	 * If set to zero, build_id_addr should be set to zero as well
+	 */
+	__u32 build_id_size;		/* in/out */
 	/*
 	 * User-supplied address of a buffer of at least vma_name_size bytes
 	 * for kernel to fill with matched VMA's name (see vma_name_size field
@@ -519,6 +539,14 @@  struct procmap_query {
 	 * Should be set to zero if VMA name should not be returned.
 	 */
 	__u64 vma_name_addr;		/* in */
+	/*
+	 * User-supplied address of a buffer of at least build_id_size bytes
+	 * for kernel to fill with matched VMA's ELF build ID, if available
+	 * (see build_id_size field description above for details).
+	 *
+	 * Should be set to zero if build ID should not be returned.
+	 */
+	__u64 build_id_addr;		/* in */
 };
 
 #endif /* _UAPI_LINUX_FS_H */