diff mbox series

[079/147] fs/proc/kcore.c: add mmap interface

Message ID 20210908025730.S88ylmikU%akpm@linux-foundation.org (mailing list archive)
State New
Headers show
Series [001/147] mm, slub: don't call flush_all() from slab_debug_trace_open() | expand

Commit Message

Andrew Morton Sept. 8, 2021, 2:57 a.m. UTC
From: Feng Zhou <zhoufeng.zf@bytedance.com>
Subject: fs/proc/kcore.c: add mmap interface

When we do the kernel monitor, use the DRGN
(https://github.com/osandov/drgn) access to kernel data structures, found
that the system calls a lot.  DRGN is implemented by reading /proc/kcore. 
After looking at the kcore code, it is found that kcore does not implement
mmap, resulting in frequent context switching triggered by read. 
Therefore, we want to add mmap interface to optimize performance.  Since
vmalloc and module areas will change with allocation and release,
consistency cannot be guaranteed, so mmap interface only maps KCORE_TEXT
and KCORE_RAM.

The test results:
1. the default version of kcore
real 11.00
user 8.53
sys 3.59

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
99.64  128.578319          12  11168701           pread64
...
------ ----------- ----------- --------- --------- ----------------
100.00  129.042853              11193748       966 total

2. added kcore for the mmap interface
real 6.44
user 7.32
sys 0.24

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
32.94    0.130120          24      5317       315 futex
11.66    0.046077          21      2231         1 lstat
 9.23    0.036449         177       206           mmap
...
------ ----------- ----------- --------- --------- ----------------
100.00    0.395077                 25435       971 total

The test results show that the number of system calls and time
consumption are significantly reduced.

Link: https://lkml.kernel.org/r/20210704062208.7898-1-zhoufeng.zf@bytedance.com
Co-developed-by: Ying Chen <chenying.kernel@bytedance.com>
Signed-off-by: Ying Chen <chenying.kernel@bytedance.com>
Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/kcore.c |   73 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

Comments

Linus Torvalds Sept. 8, 2021, 6:13 p.m. UTC | #1
On Tue, Sep 7, 2021 at 7:57 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> After looking at the kcore code, it is found that kcore does not implement
> mmap, resulting in frequent context switching triggered by read.
> Therefore, we want to add mmap interface to optimize performance.  Since
> vmalloc and module areas will change with allocation and release,
> consistency cannot be guaranteed, so mmap interface only maps KCORE_TEXT
> and KCORE_RAM.

Honestly, I still hate this patch.

The last time people wanted to speed up /dev/kcore accesses, it was
all for black-hat reasons and speeding up kernel attacks.

And this code just makes me nervous even aside from that, because I do
not understand what the heck it's doing.


> +       if (kern_addr_valid(start)) {
> +               if (m->type == KCORE_RAM)
> +                       pfn = __pa(start) >> PAGE_SHIFT;
> +               else if (m->type == KCORE_TEXT)
> +                       pfn = __pa_symbol(start) >> PAGE_SHIFT;

Why is "__pa(start)" right in one situation, and "__pa_symbol(start)"
in another.

So this just makes me go "this is all confusing, dangerous, and the
use-case is dubious".

Mapping kernel memory is dangerous. The use-cases for it are dubious.
The patch isn't obvious.

All of that screams "I'll skip this".

           Linus
Feng Zhou Sept. 9, 2021, 9:56 a.m. UTC | #2
在 2021/9/9 上午2:13, Linus Torvalds 写道:
> On Tue, Sep 7, 2021 at 7:57 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>> After looking at the kcore code, it is found that kcore does not implement
>> mmap, resulting in frequent context switching triggered by read.
>> Therefore, we want to add mmap interface to optimize performance.  Since
>> vmalloc and module areas will change with allocation and release,
>> consistency cannot be guaranteed, so mmap interface only maps KCORE_TEXT
>> and KCORE_RAM.
> Honestly, I still hate this patch.
>
> The last time people wanted to speed up /dev/kcore accesses, it was
> all for black-hat reasons and speeding up kernel attacks.
>
> And this code just makes me nervous even aside from that, because I do
> not understand what the heck it's doing.
>
>
>> +       if (kern_addr_valid(start)) {
>> +               if (m->type == KCORE_RAM)
>> +                       pfn = __pa(start) >> PAGE_SHIFT;
>> +               else if (m->type == KCORE_TEXT)
>> +                       pfn = __pa_symbol(start) >> PAGE_SHIFT;
> Why is "__pa(start)" right in one situation, and "__pa_symbol(start)"
> in another.

Hi, Linus

The use here is refer to "read_kcore" in fs/proc/kcore.c.

list_for_each_entry(m, &kclist_head, list) {
...
if (m->type == KCORE_RAM || m->type == KCORE_REMAP)
phdr->p_paddr = __pa(m->addr);
else if (m->type == KCORE_TEXT)
phdr->p_paddr = __pa_symbol(m->addr);
...
}

Ensure consistency of usage.


>
> So this just makes me go "this is all confusing, dangerous, and the
> use-case is dubious".
>
> Mapping kernel memory is dangerous. The use-cases for it are dubious.
> The patch isn't obvious.

Kcore mmap kernel memory just has read permission.

+	if (vma->vm_flags & (VM_WRITE | VM_EXEC)) {
+		ret = -EPERM;
+		goto out;
+	}
+
+	vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+	vma->vm_flags |= VM_MIXEDMAP;
+	vma->vm_ops = &kcore_mmap_ops;

Compared to the read interface, kcore mmap has no increased risk, just
reduce context switching.

>
> All of that screams "I'll skip this".
>
>             Linus
Linus Torvalds Sept. 9, 2021, 5:32 p.m. UTC | #3
On Thu, Sep 9, 2021 at 2:57 AM Feng Zhou <zhoufeng.zf@bytedance.com> wrote:
>
> Compared to the read interface, kcore mmap has no increased risk, just
> reduce context switching.

Yes, but the main worry is "do we really need to make this faster and easier"?

Because one of the possible main users is literally the black hat "I
got root, now I want to do a rootkit".

And mmap is very very different from read().

Why? Because using mmap() you can now track changes in realtime (ie
you poll waiting for some memory location to change, possibly even
with hardware assist - like watchpoints or ring3 "monitor/mwait").

So mmap() of the kernel memory literally acts as a prime tool for
looking at and exploiting races.

Which is why I'm _very_ leery of these kinds of interfaces.

Do they have possible good uses? Yes. But the bad uses seem to
actually dominate. The good users don't seem _that_ critical, while
the bad users would seem to absolutely love this interface.

See my argument?

This is basically a very dangerous interface. The fact that it is
read-only doesn't change that at all.

               Linus
Linus Torvalds Sept. 9, 2021, 5:34 p.m. UTC | #4
On Thu, Sep 9, 2021 at 10:32 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> This is basically a very dangerous interface. The fact that it is
> read-only doesn't change that at all.

Just to clarify - we've done dangerous interfaces in the past. We
probably have tons of them still that I haven't even thought about,
and I think I can hear somebody sniggering from miles away about
various bugs that cause much more obvious security holes that I didn't
think of.

But the fact that we might have other holes that can be misused,
doesn't mean we have to add new ones that seem to be almost designed
for misuse.

          Linus
Feng Zhou Sept. 10, 2021, 3:18 a.m. UTC | #5
在 2021/9/10 上午1:34, Linus Torvalds 写道:
> On Thu, Sep 9, 2021 at 10:32 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> This is basically a very dangerous interface. The fact that it is
>> read-only doesn't change that at all.
> Just to clarify - we've done dangerous interfaces in the past. We
> probably have tons of them still that I haven't even thought about,
> and I think I can hear somebody sniggering from miles away about
> various bugs that cause much more obvious security holes that I didn't
> think of.
>
> But the fact that we might have other holes that can be misused,
> doesn't mean we have to add new ones that seem to be almost designed
> for misuse.
>
>            Linus

Ok, got it.
David Hildenbrand Sept. 10, 2021, 10:08 a.m. UTC | #6
On 08.09.21 04:57, Andrew Morton wrote:
> From: Feng Zhou <zhoufeng.zf@bytedance.com>
> Subject: fs/proc/kcore.c: add mmap interface
> 
> When we do the kernel monitor, use the DRGN
> (https://github.com/osandov/drgn) access to kernel data structures, found
> that the system calls a lot.  DRGN is implemented by reading /proc/kcore.
> After looking at the kcore code, it is found that kcore does not implement
> mmap, resulting in frequent context switching triggered by read.
> Therefore, we want to add mmap interface to optimize performance.  Since
> vmalloc and module areas will change with allocation and release,
> consistency cannot be guaranteed, so mmap interface only maps KCORE_TEXT
> and KCORE_RAM.
> 
> The test results:
> 1. the default version of kcore
> real 11.00
> user 8.53
> sys 3.59
> 
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 99.64  128.578319          12  11168701           pread64
> ...
> ------ ----------- ----------- --------- --------- ----------------
> 100.00  129.042853              11193748       966 total
> 
> 2. added kcore for the mmap interface
> real 6.44
> user 7.32
> sys 0.24
> 
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 32.94    0.130120          24      5317       315 futex
> 11.66    0.046077          21      2231         1 lstat
>   9.23    0.036449         177       206           mmap
> ...
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.395077                 25435       971 total
> 
> The test results show that the number of system calls and time
> consumption are significantly reduced.
> 
> Link: https://lkml.kernel.org/r/20210704062208.7898-1-zhoufeng.zf@bytedance.com
> Co-developed-by: Ying Chen <chenying.kernel@bytedance.com>
> Signed-off-by: Ying Chen <chenying.kernel@bytedance.com>
> Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Muchun Song <songmuchun@bytedance.com>
> Cc: Chengming Zhou <zhouchengming@bytedance.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>   fs/proc/kcore.c |   73 ++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 73 insertions(+)
> 
> --- a/fs/proc/kcore.c~fs-proc-kcorec-add-mmap-interface
> +++ a/fs/proc/kcore.c
> @@ -614,11 +614,84 @@ static int release_kcore(struct inode *i
>   	return 0;
>   }
>   
> +static vm_fault_t mmap_kcore_fault(struct vm_fault *vmf)
> +{
> +	return VM_FAULT_SIGBUS;
> +}
> +
> +static const struct vm_operations_struct kcore_mmap_ops = {
> +	.fault = mmap_kcore_fault,
> +};
> +
> +static int mmap_kcore(struct file *file, struct vm_area_struct *vma)
> +{
> +	size_t size = vma->vm_end - vma->vm_start;
> +	u64 start, end, pfn;
> +	int nphdr;
> +	size_t data_offset;
> +	size_t phdrs_len, notes_len;
> +	struct kcore_list *m = NULL;
> +	int ret = 0;
> +
> +	down_read(&kclist_lock);
> +
> +	get_kcore_size(&nphdr, &phdrs_len, &notes_len, &data_offset);
> +
> +	data_offset &= PAGE_MASK;
> +	start = (u64)vma->vm_pgoff << PAGE_SHIFT;
> +	if (start < data_offset) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	start = kc_offset_to_vaddr(start - data_offset);
> +	end   = start + size;
> +
> +	list_for_each_entry(m, &kclist_head, list) {
> +		if (start >= m->addr && end <= m->addr + m->size)
> +			break;
> +	}
> +
> +	if (&m->list == &kclist_head) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (vma->vm_flags & (VM_WRITE | VM_EXEC)) {
> +		ret = -EPERM;
> +		goto out;
> +	}
> +
> +	vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
> +	vma->vm_flags |= VM_MIXEDMAP;
> +	vma->vm_ops = &kcore_mmap_ops;
> +

This breaks all my efforts to sanitize /proc/kore access for virtio-mem.

Is there still a way to nack this?

Sorry I didn't spot this any sooner.
Mike Rapoport Sept. 10, 2021, noon UTC | #7
On Fri, Sep 10, 2021 at 12:08:17PM +0200, David Hildenbrand wrote:
> On 08.09.21 04:57, Andrew Morton wrote:
> > +
> > +	if (vma->vm_flags & (VM_WRITE | VM_EXEC)) {
> > +		ret = -EPERM;
> > +		goto out;
> > +	}
> > +
> > +	vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
> > +	vma->vm_flags |= VM_MIXEDMAP;
> > +	vma->vm_ops = &kcore_mmap_ops;
> > +
> 
> This breaks all my efforts to sanitize /proc/kore access for virtio-mem.
> 
> Is there still a way to nack this?

Already done:

https://lore.kernel.org/mm-commits/CAHk-=wgQ+8kmczLLKCY7yDsGHQBRcZESKd1dNaKbrjUgbWeb3A@mail.gmail.com

and down the same thread.
 
> Sorry I didn't spot this any sooner.
> 
> -- 
> Thanks,
> 
> David / dhildenb
>
David Hildenbrand Sept. 10, 2021, 12:02 p.m. UTC | #8
On 10.09.21 14:00, Mike Rapoport wrote:
> On Fri, Sep 10, 2021 at 12:08:17PM +0200, David Hildenbrand wrote:
>> On 08.09.21 04:57, Andrew Morton wrote:
>>> +
>>> +	if (vma->vm_flags & (VM_WRITE | VM_EXEC)) {
>>> +		ret = -EPERM;
>>> +		goto out;
>>> +	}
>>> +
>>> +	vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
>>> +	vma->vm_flags |= VM_MIXEDMAP;
>>> +	vma->vm_ops = &kcore_mmap_ops;
>>> +
>>
>> This breaks all my efforts to sanitize /proc/kore access for virtio-mem.
>>
>> Is there still a way to nack this?
> 
> Already done:
> 
> https://lore.kernel.org/mm-commits/CAHk-=wgQ+8kmczLLKCY7yDsGHQBRcZESKd1dNaKbrjUgbWeb3A@mail.gmail.com
> 
> and down the same thread.
>   

Yeah, spotted Linus' reply just after I sent my reply.

... afterwards I thought about the implications fpr secretmem and 
ordinary memory hotunplug and was happy that we dodged this bullet. :)
diff mbox series

Patch

--- a/fs/proc/kcore.c~fs-proc-kcorec-add-mmap-interface
+++ a/fs/proc/kcore.c
@@ -614,11 +614,84 @@  static int release_kcore(struct inode *i
 	return 0;
 }
 
+static vm_fault_t mmap_kcore_fault(struct vm_fault *vmf)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+static const struct vm_operations_struct kcore_mmap_ops = {
+	.fault = mmap_kcore_fault,
+};
+
+static int mmap_kcore(struct file *file, struct vm_area_struct *vma)
+{
+	size_t size = vma->vm_end - vma->vm_start;
+	u64 start, end, pfn;
+	int nphdr;
+	size_t data_offset;
+	size_t phdrs_len, notes_len;
+	struct kcore_list *m = NULL;
+	int ret = 0;
+
+	down_read(&kclist_lock);
+
+	get_kcore_size(&nphdr, &phdrs_len, &notes_len, &data_offset);
+
+	data_offset &= PAGE_MASK;
+	start = (u64)vma->vm_pgoff << PAGE_SHIFT;
+	if (start < data_offset) {
+		ret = -EINVAL;
+		goto out;
+	}
+	start = kc_offset_to_vaddr(start - data_offset);
+	end   = start + size;
+
+	list_for_each_entry(m, &kclist_head, list) {
+		if (start >= m->addr && end <= m->addr + m->size)
+			break;
+	}
+
+	if (&m->list == &kclist_head) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (vma->vm_flags & (VM_WRITE | VM_EXEC)) {
+		ret = -EPERM;
+		goto out;
+	}
+
+	vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+	vma->vm_flags |= VM_MIXEDMAP;
+	vma->vm_ops = &kcore_mmap_ops;
+
+	if (kern_addr_valid(start)) {
+		if (m->type == KCORE_RAM)
+			pfn = __pa(start) >> PAGE_SHIFT;
+		else if (m->type == KCORE_TEXT)
+			pfn = __pa_symbol(start) >> PAGE_SHIFT;
+		else {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		ret = remap_pfn_range(vma, vma->vm_start, pfn, size,
+				vma->vm_page_prot);
+	} else {
+		ret = -EFAULT;
+	}
+
+out:
+	up_read(&kclist_lock);
+	return ret;
+}
+
 static const struct proc_ops kcore_proc_ops = {
 	.proc_read	= read_kcore,
 	.proc_open	= open_kcore,
 	.proc_release	= release_kcore,
 	.proc_lseek	= default_llseek,
+	.proc_mmap	= mmap_kcore,
 };
 
 /* just remember that we have to update kcore */