Message ID | 20220818224218.2399791-1-song@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | vmalloc_exec for modules and BPF programs | expand |
> On Aug 18, 2022, at 3:42 PM, Song Liu <song@kernel.org> wrote: > > This set is a prototype that allows dynamic kernel text (modules, bpf > programs, various trampolines, etc.) to share huge pages. The idea is > similar to Peter's suggestion in [1]. Please refer to each patch for > more detais. > > The ultimate goal is to only host kernel text in 2MB pages (for x86_64). > > Please share your comments on this. > > Thanks! > > [1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/ Hi Peter, Could you please share your feedback on this? Thanks, Song PS: I guess vger dropped my patch again. :( The set is also available at https://git.kernel.org/pub/scm/linux/kernel/git/song/linux.git branch vmalloc_exec. > > Song Liu (5): > vmalloc: introduce vmalloc_exec and vfree_exec > bpf: use vmalloc_exec > modules, x86: use vmalloc_exec for module core > vmalloc_exec: share a huge page with kernel text > vmalloc: vfree_exec: free unused vm_struct > > arch/x86/Kconfig | 1 + > arch/x86/kernel/alternative.c | 30 ++++- > arch/x86/kernel/module.c | 1 + > arch/x86/mm/init_64.c | 3 +- > include/linux/vmalloc.h | 16 +-- > kernel/bpf/core.c | 155 ++------------------------ > kernel/module/main.c | 23 ++-- > kernel/module/strict_rwx.c | 3 - > kernel/trace/ftrace.c | 3 +- > mm/nommu.c | 7 ++ > mm/vmalloc.c | 200 +++++++++++++++++++++++++++++----- > 11 files changed, 239 insertions(+), 203 deletions(-) > > -- > 2.30.2
On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote:
> Could you please share your feedback on this?
I've looked at it all of 5 minutes, so perhaps I've missed something.
However, I'm a little surprised you went with a second tree instead of
doing the top-down thing for data. The way you did it makes it hard to
have guard pages between text and data.
> On Aug 22, 2022, at 9:34 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote: >> Could you please share your feedback on this? > > I've looked at it all of 5 minutes, so perhaps I've missed something. > > However, I'm a little surprised you went with a second tree instead of > doing the top-down thing for data. The way you did it makes it hard to > have guard pages between text and data. I didn't realize the importance of the guard pages. But it is not too hard to do it with this approach. For each 2MB text page, we can reserve 4kB on the beginning and end of it. Would this work? There are a couple benefits from a second tree: 1. It allows text allocations to go below PAGE_SIZE granularity, while data allocations would still use PAGE_SIZE granularity, which is the same as current code. 2. Text allocate requires mapping one vm_struct to many vmap_area. Putting text allocations in a separate tree make it easier to handle this. (Well, I haven't finished this logic yet). 3. A separate tree makes it easier to use text tail page, [_etext, roundup(_etext, PMD_SIZE)], for modules and BPF programs. Does this make sense? Do you see other downsides with a second tree? Thanks, Song
On Mon, Aug 22, 2022 at 04:56:47PM +0000, Song Liu wrote: > > > > On Aug 22, 2022, at 9:34 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote: > >> Could you please share your feedback on this? > > > > I've looked at it all of 5 minutes, so perhaps I've missed something. > > > > However, I'm a little surprised you went with a second tree instead of > > doing the top-down thing for data. The way you did it makes it hard to > > have guard pages between text and data. > > I didn't realize the importance of the guard pages. But it is not too I'm not sure how important it is, just seems like a good idea to trap anybody trying to cross that divide. Also, to me it seems like a good idea to have a single large contiguous text region instead of splintered 2M pages. > hard to do it with this approach. For each 2MB text page, we can reserve > 4kB on the beginning and end of it. Would this work? Typically a guard page has different protections (as in none what so ever) so that every access goes *splat*.
Le 23/08/2022 à 07:42, Peter Zijlstra a écrit : > On Mon, Aug 22, 2022 at 04:56:47PM +0000, Song Liu wrote: >> >> >>> On Aug 22, 2022, at 9:34 AM, Peter Zijlstra <peterz@infradead.org> wrote: >>> >>> On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote: >>>> Could you please share your feedback on this? >>> >>> I've looked at it all of 5 minutes, so perhaps I've missed something. >>> >>> However, I'm a little surprised you went with a second tree instead of >>> doing the top-down thing for data. The way you did it makes it hard to >>> have guard pages between text and data. >> >> I didn't realize the importance of the guard pages. But it is not too > > I'm not sure how important it is, just seems like a good idea to trap > anybody trying to cross that divide. Also, to me it seems like a good > idea to have a single large contiguous text region instead of splintered > 2M pages. > >> hard to do it with this approach. For each 2MB text page, we can reserve >> 4kB on the beginning and end of it. Would this work? > > Typically a guard page has different protections (as in none what so > ever) so that every access goes *splat*. > Text is RO-X, on some architectures even only X. So the only real thing to protect against is bad execution, isn't it ?. So I guess having some areas with invalid or trap instructions would be enough ?
> On Aug 22, 2022, at 10:42 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Aug 22, 2022 at 04:56:47PM +0000, Song Liu wrote: >> >> >>> On Aug 22, 2022, at 9:34 AM, Peter Zijlstra <peterz@infradead.org> wrote: >>> >>> On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote: >>>> Could you please share your feedback on this? >>> >>> I've looked at it all of 5 minutes, so perhaps I've missed something. >>> >>> However, I'm a little surprised you went with a second tree instead of >>> doing the top-down thing for data. The way you did it makes it hard to >>> have guard pages between text and data. >> >> I didn't realize the importance of the guard pages. But it is not too > > I'm not sure how important it is, just seems like a good idea to trap > anybody trying to cross that divide. Also, to me it seems like a good > idea to have a single large contiguous text region instead of splintered > 2M pages. A single large contiguous text region is great. However, it is not easy to keep it contiguous. For example, when we load a big module, and then unload it. It is not easy to recycle the space. Say we load module-x-v1, which is 4MB, and uses 2 huge pages. Then we load a small BPF program after it. The address space looks like: MODULE_VADDR to MODULE_VADDR + 4MB: module-x-v1 MODULE_VADDR + 4MB to MODULE_VADDR + 4MB + 4kB: bpf_prog_xxxx When we unload module-x-v1, there will be 4MB hole in the address space. If we then load module-x-v2, which is 4.1MB in size, we cannot reuse that hole, because the module is a little too big for the hole. AFAICT, to use the space efficiently, we will have to deal with splintered 2MB pages. Does this make sense? Thanks, Song > >> hard to do it with this approach. For each 2MB text page, we can reserve >> 4kB on the beginning and end of it. Would this work? > > Typically a guard page has different protections (as in none what so > ever) so that every access goes *splat*.
> On Aug 22, 2022, at 11:39 PM, Christophe Leroy <christophe.leroy@csgroup.eu> wrote: > > > > Le 23/08/2022 à 07:42, Peter Zijlstra a écrit : >> On Mon, Aug 22, 2022 at 04:56:47PM +0000, Song Liu wrote: >>> >>> >>>> On Aug 22, 2022, at 9:34 AM, Peter Zijlstra <peterz@infradead.org> wrote: >>>> >>>> On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote: >>>>> Could you please share your feedback on this? >>>> >>>> I've looked at it all of 5 minutes, so perhaps I've missed something. >>>> >>>> However, I'm a little surprised you went with a second tree instead of >>>> doing the top-down thing for data. The way you did it makes it hard to >>>> have guard pages between text and data. >>> >>> I didn't realize the importance of the guard pages. But it is not too >> >> I'm not sure how important it is, just seems like a good idea to trap >> anybody trying to cross that divide. Also, to me it seems like a good >> idea to have a single large contiguous text region instead of splintered >> 2M pages. >> >>> hard to do it with this approach. For each 2MB text page, we can reserve >>> 4kB on the beginning and end of it. Would this work? >> >> Typically a guard page has different protections (as in none what so >> ever) so that every access goes *splat*. > > > Text is RO-X, on some architectures even only X. So the only real thing > to protect against is bad execution, isn't it ?. So I guess having some > areas with invalid or trap instructions would be enough ? Agreed that filling with trap instructions should be enough. Thanks, Song
Hi Peter, > On Aug 22, 2022, at 9:56 AM, Song Liu <songliubraving@fb.com> wrote: > >> On Aug 22, 2022, at 9:34 AM, Peter Zijlstra <peterz@infradead.org> wrote: >> >> On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote: >>> Could you please share your feedback on this? >> >> I've looked at it all of 5 minutes, so perhaps I've missed something. >> >> However, I'm a little surprised you went with a second tree instead of >> doing the top-down thing for data. The way you did it makes it hard to >> have guard pages between text and data. > > I didn't realize the importance of the guard pages. But it is not too > hard to do it with this approach. For each 2MB text page, we can reserve > 4kB on the beginning and end of it. Would this work? > > There are a couple benefits from a second tree: > > 1. It allows text allocations to go below PAGE_SIZE granularity, while > data allocations would still use PAGE_SIZE granularity, which is the > same as current code. > 2. Text allocate requires mapping one vm_struct to many vmap_area. Putting > text allocations in a separate tree make it easier to handle this. > (Well, I haven't finished this logic yet). > 3. A separate tree makes it easier to use text tail page, > [_etext, roundup(_etext, PMD_SIZE)], for modules and BPF programs. > > Does this make sense? Do you see other downsides with a second tree? Did these make sense? Do you have future comments that I would address in future versions? Thanks, Song