Message ID | 20210908025730.S88ylmikU%akpm@linux-foundation.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [001/147] mm, slub: don't call flush_all() from slab_debug_trace_open() | expand |
On Tue, Sep 7, 2021 at 7:57 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > After looking at the kcore code, it is found that kcore does not implement > mmap, resulting in frequent context switching triggered by read. > Therefore, we want to add mmap interface to optimize performance. Since > vmalloc and module areas will change with allocation and release, > consistency cannot be guaranteed, so mmap interface only maps KCORE_TEXT > and KCORE_RAM. Honestly, I still hate this patch. The last time people wanted to speed up /dev/kcore accesses, it was all for black-hat reasons and speeding up kernel attacks. And this code just makes me nervous even aside from that, because I do not understand what the heck it's doing. > + if (kern_addr_valid(start)) { > + if (m->type == KCORE_RAM) > + pfn = __pa(start) >> PAGE_SHIFT; > + else if (m->type == KCORE_TEXT) > + pfn = __pa_symbol(start) >> PAGE_SHIFT; Why is "__pa(start)" right in one situation, and "__pa_symbol(start)" in another. So this just makes me go "this is all confusing, dangerous, and the use-case is dubious". Mapping kernel memory is dangerous. The use-cases for it are dubious. The patch isn't obvious. All of that screams "I'll skip this". Linus
在 2021/9/9 上午2:13, Linus Torvalds 写道: > On Tue, Sep 7, 2021 at 7:57 PM Andrew Morton <akpm@linux-foundation.org> wrote: >> After looking at the kcore code, it is found that kcore does not implement >> mmap, resulting in frequent context switching triggered by read. >> Therefore, we want to add mmap interface to optimize performance. Since >> vmalloc and module areas will change with allocation and release, >> consistency cannot be guaranteed, so mmap interface only maps KCORE_TEXT >> and KCORE_RAM. > Honestly, I still hate this patch. > > The last time people wanted to speed up /dev/kcore accesses, it was > all for black-hat reasons and speeding up kernel attacks. > > And this code just makes me nervous even aside from that, because I do > not understand what the heck it's doing. > > >> + if (kern_addr_valid(start)) { >> + if (m->type == KCORE_RAM) >> + pfn = __pa(start) >> PAGE_SHIFT; >> + else if (m->type == KCORE_TEXT) >> + pfn = __pa_symbol(start) >> PAGE_SHIFT; > Why is "__pa(start)" right in one situation, and "__pa_symbol(start)" > in another. Hi, Linus The use here is refer to "read_kcore" in fs/proc/kcore.c. list_for_each_entry(m, &kclist_head, list) { ... if (m->type == KCORE_RAM || m->type == KCORE_REMAP) phdr->p_paddr = __pa(m->addr); else if (m->type == KCORE_TEXT) phdr->p_paddr = __pa_symbol(m->addr); ... } Ensure consistency of usage. > > So this just makes me go "this is all confusing, dangerous, and the > use-case is dubious". > > Mapping kernel memory is dangerous. The use-cases for it are dubious. > The patch isn't obvious. Kcore mmap kernel memory just has read permission. + if (vma->vm_flags & (VM_WRITE | VM_EXEC)) { + ret = -EPERM; + goto out; + } + + vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC); + vma->vm_flags |= VM_MIXEDMAP; + vma->vm_ops = &kcore_mmap_ops; Compared to the read interface, kcore mmap has no increased risk, just reduce context switching. > > All of that screams "I'll skip this". > > Linus
On Thu, Sep 9, 2021 at 2:57 AM Feng Zhou <zhoufeng.zf@bytedance.com> wrote: > > Compared to the read interface, kcore mmap has no increased risk, just > reduce context switching. Yes, but the main worry is "do we really need to make this faster and easier"? Because one of the possible main users is literally the black hat "I got root, now I want to do a rootkit". And mmap is very very different from read(). Why? Because using mmap() you can now track changes in realtime (ie you poll waiting for some memory location to change, possibly even with hardware assist - like watchpoints or ring3 "monitor/mwait"). So mmap() of the kernel memory literally acts as a prime tool for looking at and exploiting races. Which is why I'm _very_ leery of these kinds of interfaces. Do they have possible good uses? Yes. But the bad uses seem to actually dominate. The good users don't seem _that_ critical, while the bad users would seem to absolutely love this interface. See my argument? This is basically a very dangerous interface. The fact that it is read-only doesn't change that at all. Linus
On Thu, Sep 9, 2021 at 10:32 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > This is basically a very dangerous interface. The fact that it is > read-only doesn't change that at all. Just to clarify - we've done dangerous interfaces in the past. We probably have tons of them still that I haven't even thought about, and I think I can hear somebody sniggering from miles away about various bugs that cause much more obvious security holes that I didn't think of. But the fact that we might have other holes that can be misused, doesn't mean we have to add new ones that seem to be almost designed for misuse. Linus
在 2021/9/10 上午1:34, Linus Torvalds 写道: > On Thu, Sep 9, 2021 at 10:32 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> This is basically a very dangerous interface. The fact that it is >> read-only doesn't change that at all. > Just to clarify - we've done dangerous interfaces in the past. We > probably have tons of them still that I haven't even thought about, > and I think I can hear somebody sniggering from miles away about > various bugs that cause much more obvious security holes that I didn't > think of. > > But the fact that we might have other holes that can be misused, > doesn't mean we have to add new ones that seem to be almost designed > for misuse. > > Linus Ok, got it.
On 08.09.21 04:57, Andrew Morton wrote: > From: Feng Zhou <zhoufeng.zf@bytedance.com> > Subject: fs/proc/kcore.c: add mmap interface > > When we do the kernel monitor, use the DRGN > (https://github.com/osandov/drgn) access to kernel data structures, found > that the system calls a lot. DRGN is implemented by reading /proc/kcore. > After looking at the kcore code, it is found that kcore does not implement > mmap, resulting in frequent context switching triggered by read. > Therefore, we want to add mmap interface to optimize performance. Since > vmalloc and module areas will change with allocation and release, > consistency cannot be guaranteed, so mmap interface only maps KCORE_TEXT > and KCORE_RAM. > > The test results: > 1. the default version of kcore > real 11.00 > user 8.53 > sys 3.59 > > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 99.64 128.578319 12 11168701 pread64 > ... > ------ ----------- ----------- --------- --------- ---------------- > 100.00 129.042853 11193748 966 total > > 2. added kcore for the mmap interface > real 6.44 > user 7.32 > sys 0.24 > > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 32.94 0.130120 24 5317 315 futex > 11.66 0.046077 21 2231 1 lstat > 9.23 0.036449 177 206 mmap > ... > ------ ----------- ----------- --------- --------- ---------------- > 100.00 0.395077 25435 971 total > > The test results show that the number of system calls and time > consumption are significantly reduced. > > Link: https://lkml.kernel.org/r/20210704062208.7898-1-zhoufeng.zf@bytedance.com > Co-developed-by: Ying Chen <chenying.kernel@bytedance.com> > Signed-off-by: Ying Chen <chenying.kernel@bytedance.com> > Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com> > Cc: Alexey Dobriyan <adobriyan@gmail.com> > Cc: Mike Rapoport <rppt@kernel.org> > Cc: Muchun Song <songmuchun@bytedance.com> > Cc: Chengming Zhou <zhouchengming@bytedance.com> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > fs/proc/kcore.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 73 insertions(+) > > --- a/fs/proc/kcore.c~fs-proc-kcorec-add-mmap-interface > +++ a/fs/proc/kcore.c > @@ -614,11 +614,84 @@ static int release_kcore(struct inode *i > return 0; > } > > +static vm_fault_t mmap_kcore_fault(struct vm_fault *vmf) > +{ > + return VM_FAULT_SIGBUS; > +} > + > +static const struct vm_operations_struct kcore_mmap_ops = { > + .fault = mmap_kcore_fault, > +}; > + > +static int mmap_kcore(struct file *file, struct vm_area_struct *vma) > +{ > + size_t size = vma->vm_end - vma->vm_start; > + u64 start, end, pfn; > + int nphdr; > + size_t data_offset; > + size_t phdrs_len, notes_len; > + struct kcore_list *m = NULL; > + int ret = 0; > + > + down_read(&kclist_lock); > + > + get_kcore_size(&nphdr, &phdrs_len, ¬es_len, &data_offset); > + > + data_offset &= PAGE_MASK; > + start = (u64)vma->vm_pgoff << PAGE_SHIFT; > + if (start < data_offset) { > + ret = -EINVAL; > + goto out; > + } > + start = kc_offset_to_vaddr(start - data_offset); > + end = start + size; > + > + list_for_each_entry(m, &kclist_head, list) { > + if (start >= m->addr && end <= m->addr + m->size) > + break; > + } > + > + if (&m->list == &kclist_head) { > + ret = -EINVAL; > + goto out; > + } > + > + if (vma->vm_flags & (VM_WRITE | VM_EXEC)) { > + ret = -EPERM; > + goto out; > + } > + > + vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC); > + vma->vm_flags |= VM_MIXEDMAP; > + vma->vm_ops = &kcore_mmap_ops; > + This breaks all my efforts to sanitize /proc/kore access for virtio-mem. Is there still a way to nack this? Sorry I didn't spot this any sooner.
On Fri, Sep 10, 2021 at 12:08:17PM +0200, David Hildenbrand wrote: > On 08.09.21 04:57, Andrew Morton wrote: > > + > > + if (vma->vm_flags & (VM_WRITE | VM_EXEC)) { > > + ret = -EPERM; > > + goto out; > > + } > > + > > + vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC); > > + vma->vm_flags |= VM_MIXEDMAP; > > + vma->vm_ops = &kcore_mmap_ops; > > + > > This breaks all my efforts to sanitize /proc/kore access for virtio-mem. > > Is there still a way to nack this? Already done: https://lore.kernel.org/mm-commits/CAHk-=wgQ+8kmczLLKCY7yDsGHQBRcZESKd1dNaKbrjUgbWeb3A@mail.gmail.com and down the same thread. > Sorry I didn't spot this any sooner. > > -- > Thanks, > > David / dhildenb >
On 10.09.21 14:00, Mike Rapoport wrote: > On Fri, Sep 10, 2021 at 12:08:17PM +0200, David Hildenbrand wrote: >> On 08.09.21 04:57, Andrew Morton wrote: >>> + >>> + if (vma->vm_flags & (VM_WRITE | VM_EXEC)) { >>> + ret = -EPERM; >>> + goto out; >>> + } >>> + >>> + vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC); >>> + vma->vm_flags |= VM_MIXEDMAP; >>> + vma->vm_ops = &kcore_mmap_ops; >>> + >> >> This breaks all my efforts to sanitize /proc/kore access for virtio-mem. >> >> Is there still a way to nack this? > > Already done: > > https://lore.kernel.org/mm-commits/CAHk-=wgQ+8kmczLLKCY7yDsGHQBRcZESKd1dNaKbrjUgbWeb3A@mail.gmail.com > > and down the same thread. > Yeah, spotted Linus' reply just after I sent my reply. ... afterwards I thought about the implications fpr secretmem and ordinary memory hotunplug and was happy that we dodged this bullet. :)
--- a/fs/proc/kcore.c~fs-proc-kcorec-add-mmap-interface +++ a/fs/proc/kcore.c @@ -614,11 +614,84 @@ static int release_kcore(struct inode *i return 0; } +static vm_fault_t mmap_kcore_fault(struct vm_fault *vmf) +{ + return VM_FAULT_SIGBUS; +} + +static const struct vm_operations_struct kcore_mmap_ops = { + .fault = mmap_kcore_fault, +}; + +static int mmap_kcore(struct file *file, struct vm_area_struct *vma) +{ + size_t size = vma->vm_end - vma->vm_start; + u64 start, end, pfn; + int nphdr; + size_t data_offset; + size_t phdrs_len, notes_len; + struct kcore_list *m = NULL; + int ret = 0; + + down_read(&kclist_lock); + + get_kcore_size(&nphdr, &phdrs_len, ¬es_len, &data_offset); + + data_offset &= PAGE_MASK; + start = (u64)vma->vm_pgoff << PAGE_SHIFT; + if (start < data_offset) { + ret = -EINVAL; + goto out; + } + start = kc_offset_to_vaddr(start - data_offset); + end = start + size; + + list_for_each_entry(m, &kclist_head, list) { + if (start >= m->addr && end <= m->addr + m->size) + break; + } + + if (&m->list == &kclist_head) { + ret = -EINVAL; + goto out; + } + + if (vma->vm_flags & (VM_WRITE | VM_EXEC)) { + ret = -EPERM; + goto out; + } + + vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC); + vma->vm_flags |= VM_MIXEDMAP; + vma->vm_ops = &kcore_mmap_ops; + + if (kern_addr_valid(start)) { + if (m->type == KCORE_RAM) + pfn = __pa(start) >> PAGE_SHIFT; + else if (m->type == KCORE_TEXT) + pfn = __pa_symbol(start) >> PAGE_SHIFT; + else { + ret = -EFAULT; + goto out; + } + + ret = remap_pfn_range(vma, vma->vm_start, pfn, size, + vma->vm_page_prot); + } else { + ret = -EFAULT; + } + +out: + up_read(&kclist_lock); + return ret; +} + static const struct proc_ops kcore_proc_ops = { .proc_read = read_kcore, .proc_open = open_kcore, .proc_release = release_kcore, .proc_lseek = default_llseek, + .proc_mmap = mmap_kcore, }; /* just remember that we have to update kcore */