Message ID | 20190401141549.3F4721FE@viggo.jf.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/mpx: fix recursive munmap() corruption | expand |
On Mon, 1 Apr 2019, Dave Hansen wrote: > diff -puN mm/mmap.c~mpx-rss-pass-no-vma mm/mmap.c > --- a/mm/mmap.c~mpx-rss-pass-no-vma 2019-04-01 06:56:53.409411123 -0700 > +++ b/mm/mmap.c 2019-04-01 06:56:53.423411123 -0700 > @@ -2731,9 +2731,17 @@ int __do_munmap(struct mm_struct *mm, un > return -EINVAL; > > len = PAGE_ALIGN(len); > + end = start + len; > if (len == 0) > return -EINVAL; > > + /* > + * arch_unmap() might do unmaps itself. It must be called > + * and finish any rbtree manipulation before this code > + * runs and also starts to manipulate the rbtree. > + */ > + arch_unmap(mm, start, end); ... > -static inline void arch_unmap(struct mm_struct *mm, struct vm_area_struct *vma, > - unsigned long start, unsigned long end) > +static inline void arch_unmap(struct mm_struct *mm, unsigned long start, > + unsigned long end) While you fixed up the asm-generic thing, this breaks arch/um and arch/unicorn32. For those the fixup is trivial by removing the vma argument. But itt also breaks powerpc and there I'm not sure whether moving arch_unmap() to the beginning of __do_munmap() is safe. Micheal??? Aside of that the powerpc variant looks suspicious: static inline void arch_unmap(struct mm_struct *mm, unsigned long start, unsigned long end) { if (start <= mm->context.vdso_base && mm->context.vdso_base < end) mm->context.vdso_base = 0; } Shouldn't that be: if (start >= mm->context.vdso_base && mm->context.vdso_base < end) Hmm? Thanks, tglx
Thomas Gleixner <tglx@linutronix.de> writes: > On Mon, 1 Apr 2019, Dave Hansen wrote: >> diff -puN mm/mmap.c~mpx-rss-pass-no-vma mm/mmap.c >> --- a/mm/mmap.c~mpx-rss-pass-no-vma 2019-04-01 06:56:53.409411123 -0700 >> +++ b/mm/mmap.c 2019-04-01 06:56:53.423411123 -0700 >> @@ -2731,9 +2731,17 @@ int __do_munmap(struct mm_struct *mm, un >> return -EINVAL; >> >> len = PAGE_ALIGN(len); >> + end = start + len; >> if (len == 0) >> return -EINVAL; >> >> + /* >> + * arch_unmap() might do unmaps itself. It must be called >> + * and finish any rbtree manipulation before this code >> + * runs and also starts to manipulate the rbtree. >> + */ >> + arch_unmap(mm, start, end); > > ... > >> -static inline void arch_unmap(struct mm_struct *mm, struct vm_area_struct *vma, >> - unsigned long start, unsigned long end) >> +static inline void arch_unmap(struct mm_struct *mm, unsigned long start, >> + unsigned long end) > > While you fixed up the asm-generic thing, this breaks arch/um and > arch/unicorn32. For those the fixup is trivial by removing the vma > argument. > > But itt also breaks powerpc and there I'm not sure whether moving > arch_unmap() to the beginning of __do_munmap() is safe. Micheal??? I don't know for sure but I think it should be fine. That code is just there to handle CRIU unmapping/remapping the VDSO. So that either needs to happen while the process is stopped or it needs to handle races anyway, so I don't see how the placement within the unmap path should matter. > Aside of that the powerpc variant looks suspicious: > > static inline void arch_unmap(struct mm_struct *mm, > unsigned long start, unsigned long end) > { > if (start <= mm->context.vdso_base && mm->context.vdso_base < end) > mm->context.vdso_base = 0; > } > > Shouldn't that be: > > if (start >= mm->context.vdso_base && mm->context.vdso_base < end) > > Hmm? Yeah looks pretty suspicious. I'll follow-up with Laurent who wrote it. Thanks for spotting it! cheers
Le 20/04/2019 à 12:31, Michael Ellerman a écrit : > Thomas Gleixner <tglx@linutronix.de> writes: >> On Mon, 1 Apr 2019, Dave Hansen wrote: >>> diff -puN mm/mmap.c~mpx-rss-pass-no-vma mm/mmap.c >>> --- a/mm/mmap.c~mpx-rss-pass-no-vma 2019-04-01 06:56:53.409411123 -0700 >>> +++ b/mm/mmap.c 2019-04-01 06:56:53.423411123 -0700 >>> @@ -2731,9 +2731,17 @@ int __do_munmap(struct mm_struct *mm, un >>> return -EINVAL; >>> >>> len = PAGE_ALIGN(len); >>> + end = start + len; >>> if (len == 0) >>> return -EINVAL; >>> >>> + /* >>> + * arch_unmap() might do unmaps itself. It must be called >>> + * and finish any rbtree manipulation before this code >>> + * runs and also starts to manipulate the rbtree. >>> + */ >>> + arch_unmap(mm, start, end); >> >> ... >> >>> -static inline void arch_unmap(struct mm_struct *mm, struct vm_area_struct *vma, >>> - unsigned long start, unsigned long end) >>> +static inline void arch_unmap(struct mm_struct *mm, unsigned long start, >>> + unsigned long end) >> >> While you fixed up the asm-generic thing, this breaks arch/um and >> arch/unicorn32. For those the fixup is trivial by removing the vma >> argument. >> >> But itt also breaks powerpc and there I'm not sure whether moving >> arch_unmap() to the beginning of __do_munmap() is safe. Micheal??? > > I don't know for sure but I think it should be fine. That code is just > there to handle CRIU unmapping/remapping the VDSO. So that either needs > to happen while the process is stopped or it needs to handle races > anyway, so I don't see how the placement within the unmap path should > matter. My only concern is the error path. Calling arch_unmap() before handling any error case means that it will have to be undo and there is no way to do so. I don't know what is the rational to move arch_unmap() to the beginning of __do_munmap() but the error paths must be managed. >> Aside of that the powerpc variant looks suspicious: >> >> static inline void arch_unmap(struct mm_struct *mm, >> unsigned long start, unsigned long end) >> { >> if (start <= mm->context.vdso_base && mm->context.vdso_base < end) >> mm->context.vdso_base = 0; >> } >> >> Shouldn't that be: >> >> if (start >= mm->context.vdso_base && mm->context.vdso_base < end) >> >> Hmm? > > Yeah looks pretty suspicious. I'll follow-up with Laurent who wrote it. > Thanks for spotting it! I've to admit that I had to read that code carefully before answering. There are 2 assumptions here: 1. 'start' and 'end' are page aligned (this is guaranteed by __do_munmap(). 2. the VDSO is 1 page (this is guaranteed by the union vdso_data_store on powerpc). The idea is to handle a munmap() call surrounding the VDSO area: | VDSO | ^start ^end This is covered by this test, as the munmap() matching the exact boundaries of the VDSO is handled too. Am I missing something ? Cheers, Laurent.
On Tue, 23 Apr 2019, Laurent Dufour wrote: > Le 20/04/2019 à 12:31, Michael Ellerman a écrit : > > Thomas Gleixner <tglx@linutronix.de> writes: > > > Aside of that the powerpc variant looks suspicious: > > > > > > static inline void arch_unmap(struct mm_struct *mm, > > > unsigned long start, unsigned long end) > > > { > > > if (start <= mm->context.vdso_base && mm->context.vdso_base < end) > > > mm->context.vdso_base = 0; > > > } > > > > > > Shouldn't that be: > > > > > > if (start >= mm->context.vdso_base && mm->context.vdso_base < end) > > > > > > Hmm? > > > > Yeah looks pretty suspicious. I'll follow-up with Laurent who wrote it. > > Thanks for spotting it! > > I've to admit that I had to read that code carefully before answering. > > There are 2 assumptions here: > 1. 'start' and 'end' are page aligned (this is guaranteed by __do_munmap(). > 2. the VDSO is 1 page (this is guaranteed by the union vdso_data_store on > powerpc). > > The idea is to handle a munmap() call surrounding the VDSO area: > | VDSO | > ^start ^end > > This is covered by this test, as the munmap() matching the exact boundaries of > the VDSO is handled too. > > Am I missing something ? Well if this is the intention, then you missed to add a comment explaining it :) Thanks, tglx
Le 23/04/2019 à 15:34, Thomas Gleixner a écrit : > On Tue, 23 Apr 2019, Laurent Dufour wrote: >> Le 20/04/2019 à 12:31, Michael Ellerman a écrit : >>> Thomas Gleixner <tglx@linutronix.de> writes: >>>> Aside of that the powerpc variant looks suspicious: >>>> >>>> static inline void arch_unmap(struct mm_struct *mm, >>>> unsigned long start, unsigned long end) >>>> { >>>> if (start <= mm->context.vdso_base && mm->context.vdso_base < end) >>>> mm->context.vdso_base = 0; >>>> } >>>> >>>> Shouldn't that be: >>>> >>>> if (start >= mm->context.vdso_base && mm->context.vdso_base < end) >>>> >>>> Hmm? >>> >>> Yeah looks pretty suspicious. I'll follow-up with Laurent who wrote it. >>> Thanks for spotting it! >> >> I've to admit that I had to read that code carefully before answering. >> >> There are 2 assumptions here: >> 1. 'start' and 'end' are page aligned (this is guaranteed by __do_munmap(). >> 2. the VDSO is 1 page (this is guaranteed by the union vdso_data_store on >> powerpc). >> >> The idea is to handle a munmap() call surrounding the VDSO area: >> | VDSO | >> ^start ^end >> >> This is covered by this test, as the munmap() matching the exact boundaries of >> the VDSO is handled too. >> >> Am I missing something ? > > Well if this is the intention, then you missed to add a comment explaining it :) > > Thanks, > > tglx You're right, and I was thinking the same when I read that code this morning ;) I'll propose a patch to a add an explicit comment. Thanks, Laurent.
On 4/23/19 4:16 AM, Laurent Dufour wrote: > My only concern is the error path. > Calling arch_unmap() before handling any error case means that it will > have to be undo and there is no way to do so. Is there a practical scenario where munmap() of the VDSO can split a VMA? If the VDSO is guaranteed to be a single page, it would have to be a scenario where munmap() was called on a range that included the VDSO *and* other VMA that we failed to split. But, the scenario would have to be that someone tried to munmap() the VDSO and something adjacent, the munmap() failed, and they kept on using the VDSO and expected the special signal and perf behavior to be maintained. BTW, what keeps the VDSO from merging with an adjacent VMA? Is it just the vm_ops->close that comes from special_mapping_vmops? > I don't know what is the rational to move arch_unmap() to the beginning > of __do_munmap() but the error paths must be managed. It's in the changelog: https://patchwork.kernel.org/patch/10909727/ But, the tl;dr version is: x86 is recursively calling __do_unmap() (via arch_unmap()) in a spot where the internal rbtree data is inconsistent, which causes all kinds of fun. If we move arch_unmap() to before __do_munmap() does any data structure manipulation, the recursive call doesn't get confused any more. > There are 2 assumptions here: > 1. 'start' and 'end' are page aligned (this is guaranteed by __do_munmap(). > 2. the VDSO is 1 page (this is guaranteed by the union vdso_data_store on powerpc) Are you sure about #2? The 'vdso64_pages' variable seems rather unnecessary if the VDSO is only 1 page. ;)
Le 23/04/2019 à 18:04, Dave Hansen a écrit : > On 4/23/19 4:16 AM, Laurent Dufour wrote: >> My only concern is the error path. >> Calling arch_unmap() before handling any error case means that it will >> have to be undo and there is no way to do so. > > Is there a practical scenario where munmap() of the VDSO can split a > VMA? If the VDSO is guaranteed to be a single page, it would have to be > a scenario where munmap() was called on a range that included the VDSO > *and* other VMA that we failed to split. > > But, the scenario would have to be that someone tried to munmap() the > VDSO and something adjacent, the munmap() failed, and they kept on using > the VDSO and expected the special signal and perf behavior to be maintained. I've to admit that this should not be a common scenario, and unmapping the VDSO is not so common anyway. > BTW, what keeps the VDSO from merging with an adjacent VMA? Is it just > the vm_ops->close that comes from special_mapping_vmops? I'd think so. >> I don't know what is the rational to move arch_unmap() to the beginning >> of __do_munmap() but the error paths must be managed. > > It's in the changelog: > > https://patchwork.kernel.org/patch/10909727/ > > But, the tl;dr version is: x86 is recursively calling __do_unmap() (via > arch_unmap()) in a spot where the internal rbtree data is inconsistent, > which causes all kinds of fun. If we move arch_unmap() to before > __do_munmap() does any data structure manipulation, the recursive call > doesn't get confused any more. If only Powerpc is impacted I guess this would be fine but what about the other architectures? >> There are 2 assumptions here: >> 1. 'start' and 'end' are page aligned (this is guaranteed by __do_munmap(). >> 2. the VDSO is 1 page (this is guaranteed by the union vdso_data_store on powerpc) > > Are you sure about #2? The 'vdso64_pages' variable seems rather > unnecessary if the VDSO is only 1 page. ;) Hum, not so sure now ;) I got confused, only the header is one page. The test is working as a best effort, and don't cover the case where only few pages inside the VDSO are unmmapped (start > mm->context.vdso_base). This is not what CRIU is doing and so this was enough for CRIU support. Michael, do you think there is a need to manage all the possibility here, since the only user is CRIU and unmapping the VDSO is not a so good idea for other processes ?
Laurent Dufour <ldufour@linux.vnet.ibm.com> writes: > Le 23/04/2019 à 18:04, Dave Hansen a écrit : >> On 4/23/19 4:16 AM, Laurent Dufour wrote: ... >>> There are 2 assumptions here: >>> 1. 'start' and 'end' are page aligned (this is guaranteed by __do_munmap(). >>> 2. the VDSO is 1 page (this is guaranteed by the union vdso_data_store on powerpc) >> >> Are you sure about #2? The 'vdso64_pages' variable seems rather >> unnecessary if the VDSO is only 1 page. ;) > > Hum, not so sure now ;) > I got confused, only the header is one page. > The test is working as a best effort, and don't cover the case where > only few pages inside the VDSO are unmmapped (start > > mm->context.vdso_base). This is not what CRIU is doing and so this was > enough for CRIU support. > > Michael, do you think there is a need to manage all the possibility > here, since the only user is CRIU and unmapping the VDSO is not a so > good idea for other processes ? Couldn't we implement the semantic that if any part of the VDSO is unmapped then vdso_base is set to zero? That should be fairly easy, eg: if (start < vdso_end && end >= mm->context.vdso_base) mm->context.vdso_base = 0; We might need to add vdso_end to the mm->context, but that should be OK. That seems like it would work for CRIU and make sense in general? cheers
Le 01/05/2019 à 12:32, Michael Ellerman a écrit : > Laurent Dufour <ldufour@linux.vnet.ibm.com> writes: >> Le 23/04/2019 à 18:04, Dave Hansen a écrit : >>> On 4/23/19 4:16 AM, Laurent Dufour wrote: > ... >>>> There are 2 assumptions here: >>>> 1. 'start' and 'end' are page aligned (this is guaranteed by __do_munmap(). >>>> 2. the VDSO is 1 page (this is guaranteed by the union vdso_data_store on powerpc) >>> >>> Are you sure about #2? The 'vdso64_pages' variable seems rather >>> unnecessary if the VDSO is only 1 page. ;) >> >> Hum, not so sure now ;) >> I got confused, only the header is one page. >> The test is working as a best effort, and don't cover the case where >> only few pages inside the VDSO are unmmapped (start > >> mm->context.vdso_base). This is not what CRIU is doing and so this was >> enough for CRIU support. >> >> Michael, do you think there is a need to manage all the possibility >> here, since the only user is CRIU and unmapping the VDSO is not a so >> good idea for other processes ? > > Couldn't we implement the semantic that if any part of the VDSO is > unmapped then vdso_base is set to zero? That should be fairly easy, eg: > > if (start < vdso_end && end >= mm->context.vdso_base) > mm->context.vdso_base = 0; > > > We might need to add vdso_end to the mm->context, but that should be OK. > > That seems like it would work for CRIU and make sense in general? Sorry for the late answer, yes this would make more sense. Here is a patch doing that. Cheers, Laurent From 5b64a86c2a8042c7785c3d3f5e58e954a2c8c843 Mon Sep 17 00:00:00 2001 From: Laurent Dufour <ldufour@linux.ibm.com> Date: Tue, 7 May 2019 16:29:46 +0200 Subject: [PATCH] powerpc/vdso: handle generic unmap of the VDSO Make the unmap of the VDSO more generic by checking for the start and end of the VDSO. This implies to add the vdso_end address in the mm_context_t structure. Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> --- arch/powerpc/include/asm/book3s/32/mmu-hash.h | 3 ++- arch/powerpc/include/asm/book3s/64/mmu.h | 2 +- arch/powerpc/include/asm/mm-arch-hooks.h | 5 ++++- arch/powerpc/include/asm/mmu_context.h | 21 +++++++++++++++++-- arch/powerpc/include/asm/nohash/32/mmu-40x.h | 2 +- arch/powerpc/include/asm/nohash/32/mmu-44x.h | 2 +- arch/powerpc/include/asm/nohash/32/mmu-8xx.h | 2 +- arch/powerpc/include/asm/nohash/mmu-book3e.h | 2 +- arch/powerpc/kernel/vdso.c | 2 ++ 9 files changed, 32 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/32/mmu-hash.h b/arch/powerpc/include/asm/book3s/32/mmu-hash.h index 2e277ca0170f..452152b809fc 100644 --- a/arch/powerpc/include/asm/book3s/32/mmu-hash.h +++ b/arch/powerpc/include/asm/book3s/32/mmu-hash.h @@ -29,6 +29,7 @@ #define BPP_RX 0x01 /* Read only */ #define BPP_RW 0x02 /* Read/write */ + #ifndef __ASSEMBLY__ /* Contort a phys_addr_t into the right format/bits for a BAT */ #ifdef CONFIG_PHYS_64BIT @@ -90,7 +91,7 @@ struct hash_pte { typedef struct { unsigned long id; - unsigned long vdso_base; + unsigned long vdso_base, vdso_end; } mm_context_t; void update_bats(void); diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h index 74d24201fc4f..7a5a91a0696f 100644 --- a/arch/powerpc/include/asm/book3s/64/mmu.h +++ b/arch/powerpc/include/asm/book3s/64/mmu.h @@ -120,7 +120,7 @@ typedef struct { struct npu_context *npu_context; struct hash_mm_context *hash_context; - unsigned long vdso_base; + unsigned long vdso_base, vdso_end; /* * pagetable fragment support */ diff --git a/arch/powerpc/include/asm/mm-arch-hooks.h b/arch/powerpc/include/asm/mm-arch-hooks.h index f2a2da895897..1e2d527d3d1f 100644 --- a/arch/powerpc/include/asm/mm-arch-hooks.h +++ b/arch/powerpc/include/asm/mm-arch-hooks.h @@ -16,12 +16,15 @@ static inline void arch_remap(struct mm_struct *mm, unsigned long old_start, unsigned long old_end, unsigned long new_start, unsigned long new_end) { + unsigned long length = mm->context.vdso_end - mm->context.vdso_base; /* * mremap() doesn't allow moving multiple vmas so we can limit the * check to old_start == vdso_base. */ - if (old_start == mm->context.vdso_base) + if (old_start == mm->context.vdso_base) { + mm->context.vdso_end = new_start + length; mm->context.vdso_base = new_start; + } } #define arch_remap arch_remap diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 611204e588b9..c24f5ed0aeff 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -235,8 +235,25 @@ static inline void arch_unmap(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long start, unsigned long end) { - if (start <= mm->context.vdso_base && mm->context.vdso_base < end) - mm->context.vdso_base = 0; + unsigned long vdso_base, vdso_end; + + vdso_base = mm->context.vdso_base; + vdso_end = mm->context.vdso_end; + + /* + * Partial unmapping of pages inside the VDSO, is consider equivalent + * to unmapping the VDSO. + * + * case 1 > | VDSO | < + * case 2 > | < | + * case 3 | > < | + * case 4 | > | < + */ + + if ((start <= vdso_base && vdso_end <= end) || /* 1 */ + (vdso_base <= start && start < vdso_end) || /* 3,4 */ + (vdso_base < end && end <= vdso_end)) /* 2,3 */ + mm->context.vdso_base = mm->context.vdso_end = 0; } static inline void arch_bprm_mm_init(struct mm_struct *mm, diff --git a/arch/powerpc/include/asm/nohash/32/mmu-40x.h b/arch/powerpc/include/asm/nohash/32/mmu-40x.h index 74f4edb5916e..98739ba9d36e 100644 --- a/arch/powerpc/include/asm/nohash/32/mmu-40x.h +++ b/arch/powerpc/include/asm/nohash/32/mmu-40x.h @@ -57,7 +57,7 @@ typedef struct { unsigned int id; unsigned int active; - unsigned long vdso_base; + unsigned long vdso_base, vdso_end; } mm_context_t; #endif /* !__ASSEMBLY__ */ diff --git a/arch/powerpc/include/asm/nohash/32/mmu-44x.h b/arch/powerpc/include/asm/nohash/32/mmu-44x.h index 28aa3b339c5e..de1d5b1c8cec 100644 --- a/arch/powerpc/include/asm/nohash/32/mmu-44x.h +++ b/arch/powerpc/include/asm/nohash/32/mmu-44x.h @@ -108,7 +108,7 @@ extern unsigned int tlb_44x_index; typedef struct { unsigned int id; unsigned int active; - unsigned long vdso_base; + unsigned long vdso_base, vdso_end; } mm_context_t; /* patch sites */ diff --git a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h index 76af5b0cb16e..414ce6638b20 100644 --- a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h @@ -209,7 +209,7 @@ struct slice_mask { typedef struct { unsigned int id; unsigned int active; - unsigned long vdso_base; + unsigned long vdso_base, vdso_end; #ifdef CONFIG_PPC_MM_SLICES u16 user_psize; /* page size index */ unsigned char low_slices_psize[SLICE_ARRAY_SIZE]; diff --git a/arch/powerpc/include/asm/nohash/mmu-book3e.h b/arch/powerpc/include/asm/nohash/mmu-book3e.h index 4c9777d256fb..8f406ad9fe25 100644 --- a/arch/powerpc/include/asm/nohash/mmu-book3e.h +++ b/arch/powerpc/include/asm/nohash/mmu-book3e.h @@ -229,7 +229,7 @@ extern unsigned int tlbcam_index; typedef struct { unsigned int id; unsigned int active; - unsigned long vdso_base; + unsigned long vdso_base, vdso_end; } mm_context_t; /* Page size definitions, common between 32 and 64-bit diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c index a31b6234fcd7..263f820cc666 100644 --- a/arch/powerpc/kernel/vdso.c +++ b/arch/powerpc/kernel/vdso.c @@ -182,6 +182,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) #endif current->mm->context.vdso_base = 0; + current->mm->context.vdso_end = 0; /* vDSO has a problem and was disabled, just don't "enable" it for the * process @@ -217,6 +218,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) * will fail to recognise it as a vDSO (since arch_vma_name fails). */ current->mm->context.vdso_base = vdso_base; + current->mm->context.vdso_end = vdso_base + (vdso_pages << PAGE_SHIFT); /* * our vma flags don't have VM_WRITE so by default, the process isn't
Hi Laurent Le 07/05/2019 à 18:35, Laurent Dufour a écrit : > Le 01/05/2019 à 12:32, Michael Ellerman a écrit : >> Laurent Dufour <ldufour@linux.vnet.ibm.com> writes: >>> Le 23/04/2019 à 18:04, Dave Hansen a écrit : >>>> On 4/23/19 4:16 AM, Laurent Dufour wrote: >> ... >>>>> There are 2 assumptions here: >>>>> 1. 'start' and 'end' are page aligned (this is guaranteed by __do_munmap(). >>>>> 2. the VDSO is 1 page (this is guaranteed by the union vdso_data_store on powerpc) >>>> >>>> Are you sure about #2? The 'vdso64_pages' variable seems rather >>>> unnecessary if the VDSO is only 1 page. ;) >>> >>> Hum, not so sure now ;) >>> I got confused, only the header is one page. >>> The test is working as a best effort, and don't cover the case where >>> only few pages inside the VDSO are unmmapped (start > >>> mm->context.vdso_base). This is not what CRIU is doing and so this was >>> enough for CRIU support. >>> >>> Michael, do you think there is a need to manage all the possibility >>> here, since the only user is CRIU and unmapping the VDSO is not a so >>> good idea for other processes ? >> >> Couldn't we implement the semantic that if any part of the VDSO is >> unmapped then vdso_base is set to zero? That should be fairly easy, eg: >> >> if (start < vdso_end && end >= mm->context.vdso_base) >> mm->context.vdso_base = 0; >> >> >> We might need to add vdso_end to the mm->context, but that should be OK. >> >> That seems like it would work for CRIU and make sense in general? > > Sorry for the late answer, yes this would make more sense. > > Here is a patch doing that. > In your patch, the test seems overkill: + if ((start <= vdso_base && vdso_end <= end) || /* 1 */ + (vdso_base <= start && start < vdso_end) || /* 3,4 */ + (vdso_base < end && end <= vdso_end)) /* 2,3 */ + mm->context.vdso_base = mm->context.vdso_end = 0; What about if (start < vdso_end && vdso_start < end) mm->context.vdso_base = mm->context.vdso_end = 0; This should cover all cases, or am I missing something ? And do we really need to store vdso_end in the context ? I think it should be possible to re-calculate it: the size of the VDSO should be (&vdso32_end - &vdso32_start) + PAGE_SIZE for 32 bits VDSO, and (&vdso64_end - &vdso64_start) + PAGE_SIZE for the 64 bits VDSO. Christophe
Le 23/10/2020 à 14:28, Christophe Leroy a écrit : > Hi Laurent > > Le 07/05/2019 à 18:35, Laurent Dufour a écrit : >> Le 01/05/2019 à 12:32, Michael Ellerman a écrit : >>> Laurent Dufour <ldufour@linux.vnet.ibm.com> writes: >>>> Le 23/04/2019 à 18:04, Dave Hansen a écrit : >>>>> On 4/23/19 4:16 AM, Laurent Dufour wrote: >>> ... >>>>>> There are 2 assumptions here: >>>>>> 1. 'start' and 'end' are page aligned (this is guaranteed by >>>>>> __do_munmap(). >>>>>> 2. the VDSO is 1 page (this is guaranteed by the union vdso_data_store >>>>>> on powerpc) >>>>> >>>>> Are you sure about #2? The 'vdso64_pages' variable seems rather >>>>> unnecessary if the VDSO is only 1 page. ;) >>>> >>>> Hum, not so sure now ;) >>>> I got confused, only the header is one page. >>>> The test is working as a best effort, and don't cover the case where >>>> only few pages inside the VDSO are unmmapped (start > >>>> mm->context.vdso_base). This is not what CRIU is doing and so this was >>>> enough for CRIU support. >>>> >>>> Michael, do you think there is a need to manage all the possibility >>>> here, since the only user is CRIU and unmapping the VDSO is not a so >>>> good idea for other processes ? >>> >>> Couldn't we implement the semantic that if any part of the VDSO is >>> unmapped then vdso_base is set to zero? That should be fairly easy, eg: >>> >>> if (start < vdso_end && end >= mm->context.vdso_base) >>> mm->context.vdso_base = 0; >>> >>> >>> We might need to add vdso_end to the mm->context, but that should be OK. >>> >>> That seems like it would work for CRIU and make sense in general? >> >> Sorry for the late answer, yes this would make more sense. >> >> Here is a patch doing that. >> > > In your patch, the test seems overkill: > > + if ((start <= vdso_base && vdso_end <= end) || /* 1 */ > + (vdso_base <= start && start < vdso_end) || /* 3,4 */ > + (vdso_base < end && end <= vdso_end)) /* 2,3 */ > + mm->context.vdso_base = mm->context.vdso_end = 0; > > What about > > if (start < vdso_end && vdso_start < end) > mm->context.vdso_base = mm->context.vdso_end = 0; > > This should cover all cases, or am I missing something ? > > > And do we really need to store vdso_end in the context ? > I think it should be possible to re-calculate it: the size of the VDSO should be > (&vdso32_end - &vdso32_start) + PAGE_SIZE for 32 bits VDSO, and (&vdso64_end - > &vdso64_start) + PAGE_SIZE for the 64 bits VDSO. Thanks Christophe for the advise. That is covering all the cases, and indeed is similar to the Michael's proposal I missed last year. I'll send a patch fixing this issue following your proposal. Cheers, Laurent.
Hi Laurent, Christophe, Michael, all, On 11/3/20 5:11 PM, Laurent Dufour wrote: > Le 23/10/2020 à 14:28, Christophe Leroy a écrit : [..] >>>> That seems like it would work for CRIU and make sense in general? >>> >>> Sorry for the late answer, yes this would make more sense. >>> >>> Here is a patch doing that. >>> >> >> In your patch, the test seems overkill: >> >> + if ((start <= vdso_base && vdso_end <= end) || /* 1 */ >> + (vdso_base <= start && start < vdso_end) || /* 3,4 */ >> + (vdso_base < end && end <= vdso_end)) /* 2,3 */ >> + mm->context.vdso_base = mm->context.vdso_end = 0; >> >> What about >> >> if (start < vdso_end && vdso_start < end) >> mm->context.vdso_base = mm->context.vdso_end = 0; >> >> This should cover all cases, or am I missing something ? >> >> >> And do we really need to store vdso_end in the context ? >> I think it should be possible to re-calculate it: the size of the VDSO >> should be (&vdso32_end - &vdso32_start) + PAGE_SIZE for 32 bits VDSO, >> and (&vdso64_end - &vdso64_start) + PAGE_SIZE for the 64 bits VDSO. > > Thanks Christophe for the advise. > > That is covering all the cases, and indeed is similar to the Michael's > proposal I missed last year. > > I'll send a patch fixing this issue following your proposal. It's probably not necessary anymore. I've sent patches [1], currently in akpm, the last one forbids splitting of vm_special_mapping. So, a user is able munmap() or mremap() vdso as a whole, but not partly. [1]: https://lore.kernel.org/linux-mm/20201013013416.390574-1-dima@arista.com/ Thanks, Dmitry
Le 03/11/2020 à 22:08, Dmitry Safonov a écrit : > Hi Laurent, Christophe, Michael, all, > > On 11/3/20 5:11 PM, Laurent Dufour wrote: >> Le 23/10/2020 à 14:28, Christophe Leroy a écrit : > [..] >>>>> That seems like it would work for CRIU and make sense in general? >>>> >>>> Sorry for the late answer, yes this would make more sense. >>>> >>>> Here is a patch doing that. >>>> >>> >>> In your patch, the test seems overkill: >>> >>> + if ((start <= vdso_base && vdso_end <= end) || /* 1 */ >>> + (vdso_base <= start && start < vdso_end) || /* 3,4 */ >>> + (vdso_base < end && end <= vdso_end)) /* 2,3 */ >>> + mm->context.vdso_base = mm->context.vdso_end = 0; >>> >>> What about >>> >>> if (start < vdso_end && vdso_start < end) >>> mm->context.vdso_base = mm->context.vdso_end = 0; >>> >>> This should cover all cases, or am I missing something ? >>> >>> >>> And do we really need to store vdso_end in the context ? >>> I think it should be possible to re-calculate it: the size of the VDSO >>> should be (&vdso32_end - &vdso32_start) + PAGE_SIZE for 32 bits VDSO, >>> and (&vdso64_end - &vdso64_start) + PAGE_SIZE for the 64 bits VDSO. >> >> Thanks Christophe for the advise. >> >> That is covering all the cases, and indeed is similar to the Michael's >> proposal I missed last year. >> >> I'll send a patch fixing this issue following your proposal. > > It's probably not necessary anymore. I've sent patches [1], currently in > akpm, the last one forbids splitting of vm_special_mapping. > So, a user is able munmap() or mremap() vdso as a whole, but not partly. Hi Dmitry, That's a good thing too, but I think my patch is still valid in the PowerPC code, fixing a bad check, even if some corner cases are handled earlier in the code. > [1]: > https://lore.kernel.org/linux-mm/20201013013416.390574-1-dima@arista.com/ > > Thanks, > Dmitry >
==== The long story: munmap() has a couple of pieces: 1. Find the affected VMA(s) 2. Split the start/end one(s) if neceesary 3. Pull the VMAs out of the rbtree 4. Actually zap the memory via unmap_region(), including freeing page tables (or queueing them to be freed). 5. Fixup some of the accounting (like fput()) and actually free the VMA itself. I decided to put the arch_unmap() call right afer #3. This was *just* before mmap_sem looked like it might get downgraded (it won't in this context), but it looked right. It wasn't. Richard Biener reported a test that shows this in dmesg: [1216548.787498] BUG: Bad rss-counter state mm:0000000017ce560b idx:1 val:551 [1216548.787500] BUG: non-zero pgtables_bytes on freeing mm: 24576 What triggered this was the recursive do_munmap() called via arch_unmap(). It was freeing page tables that has not been properly zapped. But, the problem was bigger than this. For one, arch_unmap() can free VMAs. But, the calling __do_munmap() has variables that *point* to VMAs and obviously can't handle them just getting freed while the pointer is still valid. I tried a couple of things here. First, I tried to fix the page table freeing problem in isolation, but I then found the VMA issue. I also tried having the MPX code return a flag if it modified the rbtree which would force __do_munmap() to re-walk to restart. That spiralled out of control in complexity pretty fast. Just moving arch_unmap() and accepting that the bonkers failure case might eat some bounds tables seems like the simplest viable fix. Reported-by: Richard Biener <rguenther@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andy Lutomirski <luto@amacapital.net> Cc: x86@kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: stable@vger.kernel.org --- b/arch/x86/include/asm/mmu_context.h | 6 +++--- b/arch/x86/include/asm/mpx.h | 5 ++--- b/arch/x86/mm/mpx.c | 10 ++++++---- b/include/asm-generic/mm_hooks.h | 1 - b/mm/mmap.c | 15 ++++++++------- 5 files changed, 19 insertions(+), 18 deletions(-) diff -puN mm/mmap.c~mpx-rss-pass-no-vma mm/mmap.c --- a/mm/mmap.c~mpx-rss-pass-no-vma 2019-04-01 06:56:53.409411123 -0700 +++ b/mm/mmap.c 2019-04-01 06:56:53.423411123 -0700 @@ -2731,9 +2731,17 @@ int __do_munmap(struct mm_struct *mm, un return -EINVAL; len = PAGE_ALIGN(len); + end = start + len; if (len == 0) return -EINVAL; + /* + * arch_unmap() might do unmaps itself. It must be called + * and finish any rbtree manipulation before this code + * runs and also starts to manipulate the rbtree. + */ + arch_unmap(mm, start, end); + /* Find the first overlapping VMA */ vma = find_vma(mm, start); if (!vma) @@ -2742,7 +2750,6 @@ int __do_munmap(struct mm_struct *mm, un /* we have start < vma->vm_end */ /* if it doesn't overlap, we have nothing.. */ - end = start + len; if (vma->vm_start >= end) return 0; @@ -2812,12 +2819,6 @@ int __do_munmap(struct mm_struct *mm, un /* Detach vmas from rbtree */ detach_vmas_to_be_unmapped(mm, vma, prev, end); - /* - * mpx unmap needs to be called with mmap_sem held for write. - * It is safe to call it before unmap_region(). - */ - arch_unmap(mm, vma, start, end); - if (downgrade) downgrade_write(&mm->mmap_sem); diff -puN arch/x86/include/asm/mmu_context.h~mpx-rss-pass-no-vma arch/x86/include/asm/mmu_context.h --- a/arch/x86/include/asm/mmu_context.h~mpx-rss-pass-no-vma 2019-04-01 06:56:53.412411123 -0700 +++ b/arch/x86/include/asm/mmu_context.h 2019-04-01 06:56:53.423411123 -0700 @@ -277,8 +277,8 @@ static inline void arch_bprm_mm_init(str mpx_mm_init(mm); } -static inline void arch_unmap(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long start, unsigned long end) +static inline void arch_unmap(struct mm_struct *mm, unsigned long start, + unsigned long end) { /* * mpx_notify_unmap() goes and reads a rarely-hot @@ -298,7 +298,7 @@ static inline void arch_unmap(struct mm_ * consistently wrong. */ if (unlikely(cpu_feature_enabled(X86_FEATURE_MPX))) - mpx_notify_unmap(mm, vma, start, end); + mpx_notify_unmap(mm, start, end); } /* diff -puN include/asm-generic/mm_hooks.h~mpx-rss-pass-no-vma include/asm-generic/mm_hooks.h --- a/include/asm-generic/mm_hooks.h~mpx-rss-pass-no-vma 2019-04-01 06:56:53.414411123 -0700 +++ b/include/asm-generic/mm_hooks.h 2019-04-01 06:56:53.423411123 -0700 @@ -18,7 +18,6 @@ static inline void arch_exit_mmap(struct } static inline void arch_unmap(struct mm_struct *mm, - struct vm_area_struct *vma, unsigned long start, unsigned long end) { } diff -puN arch/x86/mm/mpx.c~mpx-rss-pass-no-vma arch/x86/mm/mpx.c --- a/arch/x86/mm/mpx.c~mpx-rss-pass-no-vma 2019-04-01 06:56:53.416411123 -0700 +++ b/arch/x86/mm/mpx.c 2019-04-01 06:56:53.423411123 -0700 @@ -881,9 +881,10 @@ static int mpx_unmap_tables(struct mm_st * the virtual address region start...end have already been split if * necessary, and the 'vma' is the first vma in this range (start -> end). */ -void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long start, unsigned long end) +void mpx_notify_unmap(struct mm_struct *mm, unsigned long start, + unsigned long end) { + struct vm_area_struct *vma; int ret; /* @@ -902,11 +903,12 @@ void mpx_notify_unmap(struct mm_struct * * which should not occur normally. Being strict about it here * helps ensure that we do not have an exploitable stack overflow. */ - do { + vma = find_vma(mm, start); + while (vma && vma->vm_start < end) { if (vma->vm_flags & VM_MPX) return; vma = vma->vm_next; - } while (vma && vma->vm_start < end); + } ret = mpx_unmap_tables(mm, start, end); if (ret) diff -puN arch/x86/include/asm/mpx.h~mpx-rss-pass-no-vma arch/x86/include/asm/mpx.h --- a/arch/x86/include/asm/mpx.h~mpx-rss-pass-no-vma 2019-04-01 06:56:53.418411123 -0700 +++ b/arch/x86/include/asm/mpx.h 2019-04-01 06:56:53.424411123 -0700 @@ -78,8 +78,8 @@ static inline void mpx_mm_init(struct mm */ mm->context.bd_addr = MPX_INVALID_BOUNDS_DIR; } -void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long start, unsigned long end); +void mpx_notify_unmap(struct mm_struct *mm, unsigned long start, + unsigned long end); unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len, unsigned long flags); @@ -100,7 +100,6 @@ static inline void mpx_mm_init(struct mm { } static inline void mpx_notify_unmap(struct mm_struct *mm, - struct vm_area_struct *vma, unsigned long start, unsigned long end) { }