Message ID | 20230721034643.616851-1-jannh@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: Lock VMA in dup_anon_vma() before setting ->anon_vma | expand |
On Thu, Jul 20, 2023 at 8:46 PM Jann Horn <jannh@google.com> wrote: > > When VMAs are merged, dup_anon_vma() is called with `dst` pointing to the > VMA that is being expanded to cover the area previously occupied by another > VMA. This currently happens while `dst` is not write-locked. > > This means that, in the `src->anon_vma && !dst->anon_vma` case, as soon as > the assignment `dst->anon_vma = src->anon_vma` has happened, concurrent > page faults can happen on `dst` under the per-VMA lock. > This is already icky in itself, since such page faults can now install > pages into `dst` that are attached to an `anon_vma` that is not yet tied > back to the `anon_vma` with an `anon_vma_chain`. > But if `anon_vma_clone()` fails due to an out-of-memory error, things get > much worse: `anon_vma_clone()` then reverts `dst->anon_vma` back to NULL, > and `dst` remains completely unconnected to the `anon_vma`, even though we > can have pages in the area covered by `dst` that point to the `anon_vma`. > > This means the `anon_vma` of such pages can be freed while the pages are > still mapped into userspace, which leads to UAF when a helper like > folio_lock_anon_vma_read() tries to look up the anon_vma of such a page. > > This theoretically is a security bug, but I believe it is really hard to > actually trigger as an unprivileged user because it requires that you can > make an order-0 GFP_KERNEL allocation fail, and the page allocator tries > pretty hard to prevent that. > > I think doing the vma_start_write() call inside dup_anon_vma() is the most > straightforward fix for now. Indeed, this is a valid fix because we end up modifying the 'dst' without locking it. Locking in vma_merge()/vma_expand() happens inside vma_prepare() but that's too late because dup_anon_vma() would already happen. > > For a kernel-assisted reproducer, see the notes section of the patch mail. > > Fixes: 5e31275cc997 ("mm: add per-VMA lock and helper functions to control it") > Cc: stable@vger.kernel.org > Cc: Suren Baghdasaryan <surenb@google.com> > Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> > --- > To reproduce, patch mm/rmap.c by adding "#include <linux/delay.h>" and > changing anon_vma_chain_alloc() like this: > > static inline struct anon_vma_chain *anon_vma_chain_alloc(gfp_t gfp) > { > + if (strcmp(current->comm, "FAILME") == 0) { > + // inject delay and error > + mdelay(2000); > + return NULL; > + } > return kmem_cache_alloc(anon_vma_chain_cachep, gfp); > } > > Then build with KASAN and run this reproducer: > > > #define _GNU_SOURCE > #include <pthread.h> > #include <err.h> > #include <fcntl.h> > #include <unistd.h> > #include <stdlib.h> > #include <stdio.h> > #include <sys/mman.h> > #include <sys/prctl.h> > > #define SYSCHK(x) ({ \ > typeof(x) __res = (x); \ > if (__res == (typeof(x))-1L) \ > err(1, "SYSCHK(" #x ")"); \ > __res; \ > }) > > static char *area; > static volatile int fault_thread_done; > static volatile int spin_launch; > > static void *fault_thread(void *dummy) { > while (!spin_launch) /*spin*/; > sleep(1); > area[0] = 1; > fault_thread_done = 1; > return NULL; > } > > int main(void) { > fault_thread_done = 0; > pthread_t thread; > if (pthread_create(&thread, NULL, fault_thread, NULL)) > errx(1, "pthread_create"); > > // allocator spam > int fd = SYSCHK(open("/etc/hostname", O_RDONLY)); > char *vmas[10000]; > for (int i=0; i<5000; i++) { > vmas[i] = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0)); > *vmas[i] = 1; > } > > // create a 3-page area, no anon_vma at this point, with guard vma behind it to prevent merging with neighboring anon_vmas > area = SYSCHK(mmap((void*)0x10000, 0x4000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)); > SYSCHK(mmap(area+0x3000, 0x1000, PROT_READ, MAP_SHARED|MAP_FIXED, fd, 0)); > // turn it into 3 VMAs > SYSCHK(mprotect(area+0x1000, 0x1000, PROT_READ|PROT_WRITE|PROT_EXEC)); > > // create an anon_vma for the tail VMA > area[0x2000] = 1; > > // more allocator spam > for (int i=5000; i<10000; i++) { > vmas[i] = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0)); > *vmas[i] = 1; > } > > printf("with anon_vma on tail VMA:\n\n"); > system("cat /proc/$PPID/smaps | head -n55"); > printf("\n\n"); > > spin_launch=1; > // mprotect() will try to merge the VMAs but bail out due to the injected > // allocator failure > SYSCHK(prctl(PR_SET_NAME, "FAILME")); > SYSCHK(mprotect(area+0x1000, 0x1000, PROT_READ|PROT_WRITE)); > SYSCHK(prctl(PR_SET_NAME, "normal")); > > printf("after merge from mprotect:\n\n"); > if (!fault_thread_done) > errx(1, "fault thread not done yet???"); > system("cat /proc/$PPID/smaps | head -n55"); > printf("\n\n"); > > // release the anon_vma > SYSCHK(munmap(area+0x1000, 0x2000)); > > // release spam > for (int i=0; i<10000; i++) > SYSCHK(munmap(vmas[i], 0x1000)); > > // wait for RCU > sleep(2); > > // trigger UAF? > printf("trying to trigger uaf...\n"); > SYSCHK(madvise(area, 0x1000, 21/*MADV_PAGEOUT*/)); > } > > > You should get an ASAN splat like: > > BUG: KASAN: use-after-free in folio_lock_anon_vma_read+0x9d/0x2f0 > Read of size 8 at addr ffff8880053a2660 by task normal/549 > > CPU: 1 PID: 549 Comm: normal Not tainted 6.5.0-rc2-00073-ge599e16c16a1-dirty #292 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 > Call Trace: > <TASK> > dump_stack_lvl+0x36/0x50 > print_report+0xcf/0x660 > [...] > kasan_report+0xc7/0x100 > [...] > folio_lock_anon_vma_read+0x9d/0x2f0 > rmap_walk_anon+0x282/0x350 > [...] > folio_referenced+0x277/0x2a0 > [...] > shrink_folio_list+0xc9f/0x15c0 > [...] > reclaim_folio_list+0xdc/0x1f0 > [...] > reclaim_pages+0x211/0x280 > [...] > madvise_cold_or_pageout_pte_range+0x2ea/0x6a0 > [...] > walk_pgd_range+0x6c5/0xb90 > [...] > __walk_page_range+0x27f/0x290 > [...] > walk_page_range+0x1fd/0x230 > [...] > madvise_pageout+0x1cd/0x2d0 > [...] > do_madvise+0xb58/0x1280 > [...] > __x64_sys_madvise+0x62/0x70 > do_syscall_64+0x3b/0x90 > [...] > > > mm/mmap.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/mm/mmap.c b/mm/mmap.c > index 3eda23c9ebe7..3937479d0e07 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -615,6 +615,7 @@ static inline int dup_anon_vma(struct vm_area_struct *dst, > * anon pages imported. > */ > if (src->anon_vma && !dst->anon_vma) { > + vma_start_write(dst); > dst->anon_vma = src->anon_vma; > return anon_vma_clone(dst, src); > } > > base-commit: e599e16c16a16be9907fb00608212df56d08d57b > -- > 2.41.0.487.g6d72f3e995-goog >
diff --git a/mm/mmap.c b/mm/mmap.c index 3eda23c9ebe7..3937479d0e07 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -615,6 +615,7 @@ static inline int dup_anon_vma(struct vm_area_struct *dst, * anon pages imported. */ if (src->anon_vma && !dst->anon_vma) { + vma_start_write(dst); dst->anon_vma = src->anon_vma; return anon_vma_clone(dst, src); }
When VMAs are merged, dup_anon_vma() is called with `dst` pointing to the VMA that is being expanded to cover the area previously occupied by another VMA. This currently happens while `dst` is not write-locked. This means that, in the `src->anon_vma && !dst->anon_vma` case, as soon as the assignment `dst->anon_vma = src->anon_vma` has happened, concurrent page faults can happen on `dst` under the per-VMA lock. This is already icky in itself, since such page faults can now install pages into `dst` that are attached to an `anon_vma` that is not yet tied back to the `anon_vma` with an `anon_vma_chain`. But if `anon_vma_clone()` fails due to an out-of-memory error, things get much worse: `anon_vma_clone()` then reverts `dst->anon_vma` back to NULL, and `dst` remains completely unconnected to the `anon_vma`, even though we can have pages in the area covered by `dst` that point to the `anon_vma`. This means the `anon_vma` of such pages can be freed while the pages are still mapped into userspace, which leads to UAF when a helper like folio_lock_anon_vma_read() tries to look up the anon_vma of such a page. This theoretically is a security bug, but I believe it is really hard to actually trigger as an unprivileged user because it requires that you can make an order-0 GFP_KERNEL allocation fail, and the page allocator tries pretty hard to prevent that. I think doing the vma_start_write() call inside dup_anon_vma() is the most straightforward fix for now. For a kernel-assisted reproducer, see the notes section of the patch mail. Fixes: 5e31275cc997 ("mm: add per-VMA lock and helper functions to control it") Cc: stable@vger.kernel.org Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Jann Horn <jannh@google.com> --- To reproduce, patch mm/rmap.c by adding "#include <linux/delay.h>" and changing anon_vma_chain_alloc() like this: static inline struct anon_vma_chain *anon_vma_chain_alloc(gfp_t gfp) { + if (strcmp(current->comm, "FAILME") == 0) { + // inject delay and error + mdelay(2000); + return NULL; + } return kmem_cache_alloc(anon_vma_chain_cachep, gfp); } Then build with KASAN and run this reproducer: #define _GNU_SOURCE #include <pthread.h> #include <err.h> #include <fcntl.h> #include <unistd.h> #include <stdlib.h> #include <stdio.h> #include <sys/mman.h> #include <sys/prctl.h> #define SYSCHK(x) ({ \ typeof(x) __res = (x); \ if (__res == (typeof(x))-1L) \ err(1, "SYSCHK(" #x ")"); \ __res; \ }) static char *area; static volatile int fault_thread_done; static volatile int spin_launch; static void *fault_thread(void *dummy) { while (!spin_launch) /*spin*/; sleep(1); area[0] = 1; fault_thread_done = 1; return NULL; } int main(void) { fault_thread_done = 0; pthread_t thread; if (pthread_create(&thread, NULL, fault_thread, NULL)) errx(1, "pthread_create"); // allocator spam int fd = SYSCHK(open("/etc/hostname", O_RDONLY)); char *vmas[10000]; for (int i=0; i<5000; i++) { vmas[i] = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0)); *vmas[i] = 1; } // create a 3-page area, no anon_vma at this point, with guard vma behind it to prevent merging with neighboring anon_vmas area = SYSCHK(mmap((void*)0x10000, 0x4000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)); SYSCHK(mmap(area+0x3000, 0x1000, PROT_READ, MAP_SHARED|MAP_FIXED, fd, 0)); // turn it into 3 VMAs SYSCHK(mprotect(area+0x1000, 0x1000, PROT_READ|PROT_WRITE|PROT_EXEC)); // create an anon_vma for the tail VMA area[0x2000] = 1; // more allocator spam for (int i=5000; i<10000; i++) { vmas[i] = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0)); *vmas[i] = 1; } printf("with anon_vma on tail VMA:\n\n"); system("cat /proc/$PPID/smaps | head -n55"); printf("\n\n"); spin_launch=1; // mprotect() will try to merge the VMAs but bail out due to the injected // allocator failure SYSCHK(prctl(PR_SET_NAME, "FAILME")); SYSCHK(mprotect(area+0x1000, 0x1000, PROT_READ|PROT_WRITE)); SYSCHK(prctl(PR_SET_NAME, "normal")); printf("after merge from mprotect:\n\n"); if (!fault_thread_done) errx(1, "fault thread not done yet???"); system("cat /proc/$PPID/smaps | head -n55"); printf("\n\n"); // release the anon_vma SYSCHK(munmap(area+0x1000, 0x2000)); // release spam for (int i=0; i<10000; i++) SYSCHK(munmap(vmas[i], 0x1000)); // wait for RCU sleep(2); // trigger UAF? printf("trying to trigger uaf...\n"); SYSCHK(madvise(area, 0x1000, 21/*MADV_PAGEOUT*/)); } You should get an ASAN splat like: BUG: KASAN: use-after-free in folio_lock_anon_vma_read+0x9d/0x2f0 Read of size 8 at addr ffff8880053a2660 by task normal/549 CPU: 1 PID: 549 Comm: normal Not tainted 6.5.0-rc2-00073-ge599e16c16a1-dirty #292 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x36/0x50 print_report+0xcf/0x660 [...] kasan_report+0xc7/0x100 [...] folio_lock_anon_vma_read+0x9d/0x2f0 rmap_walk_anon+0x282/0x350 [...] folio_referenced+0x277/0x2a0 [...] shrink_folio_list+0xc9f/0x15c0 [...] reclaim_folio_list+0xdc/0x1f0 [...] reclaim_pages+0x211/0x280 [...] madvise_cold_or_pageout_pte_range+0x2ea/0x6a0 [...] walk_pgd_range+0x6c5/0xb90 [...] __walk_page_range+0x27f/0x290 [...] walk_page_range+0x1fd/0x230 [...] madvise_pageout+0x1cd/0x2d0 [...] do_madvise+0xb58/0x1280 [...] __x64_sys_madvise+0x62/0x70 do_syscall_64+0x3b/0x90 [...] mm/mmap.c | 1 + 1 file changed, 1 insertion(+) base-commit: e599e16c16a16be9907fb00608212df56d08d57b