Message ID | 20250219112519.92853-1-21cnbao@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache | expand |
On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > userfaultfd_move() checks whether the PTE entry is present or a > swap entry. > > - If the PTE entry is present, move_present_pte() handles folio > migration by setting: > > src_folio->index = linear_page_index(dst_vma, dst_addr); > > - If the PTE entry is a swap entry, move_swap_pte() simply copies > the PTE to the new dst_addr. > > This approach is incorrect because even if the PTE is a swap > entry, it can still reference a folio that remains in the swap > cache. > > If do_swap_page() is triggered, it may locate the folio in the > swap cache. However, during add_rmap operations, a kernel panic > can occur due to: > page_pgoff(folio, page) != linear_page_index(vma, address) Thanks for the report and reproducer! > > $./a.out > /dev/null > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > [ 13.337716] memcg:ffff00000405f000 > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > [ 13.340190] ------------[ cut here ]------------ > [ 13.340316] kernel BUG at mm/rmap.c:1380! > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > [ 13.340969] Modules linked in: > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > [ 13.341470] Hardware name: linux,dummy-virt (DT) > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > [ 13.342018] sp : ffff80008752bb20 > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > [ 13.343876] Call trace: > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > [ 13.344333] do_swap_page+0x1060/0x1400 > [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > [ 13.344504] handle_mm_fault+0xd8/0x2e8 > [ 13.344586] do_page_fault+0x20c/0x770 > [ 13.344673] do_translation_fault+0xb4/0xf0 > [ 13.344759] do_mem_abort+0x48/0xa0 > [ 13.344842] el0_da+0x58/0x130 > [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > [ 13.345504] ---[ end trace 0000000000000000 ]--- > [ 13.345715] note: a.out[107] exited with irqs disabled > [ 13.345954] note: a.out[107] exited with preempt_count 2 > > Fully fixing it would be quite complex, requiring similar handling > of folios as done in move_present_pte. How complex would that be? Is it a matter of adding folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and folio->index = linear_page_index like in move_present_pte() or something more? > For now, a quick solution > is to return -EBUSY. > I'd like to see others' opinions on whether a full fix is worth > pursuing. > > For anyone interested in reproducing it, the a.out test program is > as below, > > #define _GNU_SOURCE > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include <sys/mman.h> > #include <sys/ioctl.h> > #include <sys/syscall.h> > #include <linux/userfaultfd.h> > #include <fcntl.h> > #include <pthread.h> > #include <unistd.h> > #include <poll.h> > #include <errno.h> > > #define PAGE_SIZE 4096 > #define REGION_SIZE (512 * 1024) > > #ifndef UFFDIO_MOVE > struct uffdio_move { > __u64 dst; > __u64 src; > __u64 len; > #define UFFDIO_MOVE_MODE_DONTWAKE ((__u64)1<<0) > #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1) > __u64 mode; > __s64 move; > }; > #define _UFFDIO_MOVE (0x05) > #define UFFDIO_MOVE _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move) > #endif > > void *src, *dst; > int uffd; > > void *madvise_thread(void *arg) { > if (madvise(src, REGION_SIZE, MADV_PAGEOUT) == -1) { > perror("madvise MADV_PAGEOUT"); > } > return NULL; > } > > void *fault_handler_thread(void *arg) { > struct uffd_msg msg; > struct uffdio_move move; > struct pollfd pollfd = { .fd = uffd, .events = POLLIN }; > > pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL); > pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL); > > while (1) { > if (poll(&pollfd, 1, -1) == -1) { > perror("poll"); > exit(EXIT_FAILURE); > } > > if (read(uffd, &msg, sizeof(msg)) <= 0) { > perror("read"); > exit(EXIT_FAILURE); > } > > if (msg.event != UFFD_EVENT_PAGEFAULT) { > fprintf(stderr, "Unexpected event\n"); > exit(EXIT_FAILURE); > } > > move.src = (unsigned long)src + (msg.arg.pagefault.address - (unsigned long)dst); > move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1); > move.len = PAGE_SIZE; > move.mode = 0; > > if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) { > perror("UFFDIO_MOVE"); > exit(EXIT_FAILURE); > } > } > return NULL; > } > > int main() { > again: > pthread_t thr, madv_thr; > struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 }; > struct uffdio_register uffdio_register; > > src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > if (src == MAP_FAILED) { > perror("mmap src"); > exit(EXIT_FAILURE); > } > memset(src, 1, REGION_SIZE); > > dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > if (dst == MAP_FAILED) { > perror("mmap dst"); > exit(EXIT_FAILURE); > } > > uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK); > if (uffd == -1) { > perror("userfaultfd"); > exit(EXIT_FAILURE); > } > > if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) { > perror("UFFDIO_API"); > exit(EXIT_FAILURE); > } > > uffdio_register.range.start = (unsigned long)dst; > uffdio_register.range.len = REGION_SIZE; > uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; > > if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) { > perror("UFFDIO_REGISTER"); > exit(EXIT_FAILURE); > } > > if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) { > perror("pthread_create madvise_thread"); > exit(EXIT_FAILURE); > } > > if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) { > perror("pthread_create fault_handler_thread"); > exit(EXIT_FAILURE); > } > > for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) { > char val = ((char *)dst)[i]; > printf("Accessing dst at offset %zu, value: %d\n", i, val); > } > > pthread_join(madv_thr, NULL); > pthread_cancel(thr); > pthread_join(thr, NULL); > > munmap(src, REGION_SIZE); > munmap(dst, REGION_SIZE); > close(uffd); > goto again; > return 0; > } > > As long as you enable mTHP (which likely increases the residency > time of swapcache), you can reproduce the issue within a few > seconds. But I guess the same race condition also exists with > small folios. > > Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI") > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: Suren Baghdasaryan <surenb@google.com> > Cc: Al Viro <viro@zeniv.linux.org.uk> > Cc: Axel Rasmussen <axelrasmussen@google.com> > Cc: Brian Geffon <bgeffon@google.com> > Cc: Christian Brauner <brauner@kernel.org> > Cc: David Hildenbrand <david@redhat.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: Jann Horn <jannh@google.com> > Cc: Kalesh Singh <kaleshsingh@google.com> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > Cc: Lokesh Gidra <lokeshgidra@google.com> > Cc: Matthew Wilcox (Oracle) <willy@infradead.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Mike Rapoport (IBM) <rppt@kernel.org> > Cc: Nicolas Geoffray <ngeoffray@google.com> > Cc: Peter Xu <peterx@redhat.com> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Cc: Shuah Khan <shuah@kernel.org> > Cc: ZhangPeng <zhangpeng362@huawei.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > mm/userfaultfd.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index 867898c4e30b..34cf1c8c725d 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -18,6 +18,7 @@ > #include <asm/tlbflush.h> > #include <asm/tlb.h> > #include "internal.h" > +#include "swap.h" > > static __always_inline > bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) > @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm, > pmd_t *dst_pmd, pmd_t dst_pmdval, > spinlock_t *dst_ptl, spinlock_t *src_ptl) > { > + struct folio *folio; > + swp_entry_t entry; > + > if (!pte_swp_exclusive(orig_src_pte)) > return -EBUSY; > Would be helpful to add a comment explaining that this is the case when the folio is in the swap cache. > + entry = pte_to_swp_entry(orig_src_pte); > + folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry)); > + if (!IS_ERR(folio)) { > + folio_put(folio); > + return -EBUSY; > + } > + > double_pt_lock(dst_ptl, src_ptl); > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, > -- > 2.39.3 (Apple Git-146) >
On 19.02.25 19:26, Suren Baghdasaryan wrote: > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: >> >> From: Barry Song <v-songbaohua@oppo.com> >> >> userfaultfd_move() checks whether the PTE entry is present or a >> swap entry. >> >> - If the PTE entry is present, move_present_pte() handles folio >> migration by setting: >> >> src_folio->index = linear_page_index(dst_vma, dst_addr); >> >> - If the PTE entry is a swap entry, move_swap_pte() simply copies >> the PTE to the new dst_addr. >> >> This approach is incorrect because even if the PTE is a swap >> entry, it can still reference a folio that remains in the swap >> cache. >> >> If do_swap_page() is triggered, it may locate the folio in the >> swap cache. However, during add_rmap operations, a kernel panic >> can occur due to: >> page_pgoff(folio, page) != linear_page_index(vma, address) > > Thanks for the report and reproducer! > >> >> $./a.out > /dev/null >> [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c >> [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 >> [ 13.337716] memcg:ffff00000405f000 >> [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) >> [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 >> [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 >> [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 >> [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 >> [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 >> [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 >> [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) >> [ 13.340190] ------------[ cut here ]------------ >> [ 13.340316] kernel BUG at mm/rmap.c:1380! >> [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP >> [ 13.340969] Modules linked in: >> [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 >> [ 13.341470] Hardware name: linux,dummy-virt (DT) >> [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >> [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 >> [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 >> [ 13.342018] sp : ffff80008752bb20 >> [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 >> [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 >> [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 >> [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff >> [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f >> [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 >> [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 >> [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 >> [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 >> [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f >> [ 13.343876] Call trace: >> [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) >> [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 >> [ 13.344333] do_swap_page+0x1060/0x1400 >> [ 13.344417] __handle_mm_fault+0x61c/0xbc8 >> [ 13.344504] handle_mm_fault+0xd8/0x2e8 >> [ 13.344586] do_page_fault+0x20c/0x770 >> [ 13.344673] do_translation_fault+0xb4/0xf0 >> [ 13.344759] do_mem_abort+0x48/0xa0 >> [ 13.344842] el0_da+0x58/0x130 >> [ 13.344914] el0t_64_sync_handler+0xc4/0x138 >> [ 13.345002] el0t_64_sync+0x1ac/0x1b0 >> [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) >> [ 13.345504] ---[ end trace 0000000000000000 ]--- >> [ 13.345715] note: a.out[107] exited with irqs disabled >> [ 13.345954] note: a.out[107] exited with preempt_count 2 >> >> Fully fixing it would be quite complex, requiring similar handling >> of folios as done in move_present_pte. > > How complex would that be? Is it a matter of adding > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > folio->index = linear_page_index like in move_present_pte() or > something more? If the entry is pte_swp_exclusive(), and the folio is order-0, it cannot be pinned and we may be able to move it I think. So all that's required is to check pte_swp_exclusive() and the folio size. ... in theory :) Not sure about the swap details.
On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > userfaultfd_move() checks whether the PTE entry is present or a > swap entry. > > - If the PTE entry is present, move_present_pte() handles folio > migration by setting: > > src_folio->index = linear_page_index(dst_vma, dst_addr); > > - If the PTE entry is a swap entry, move_swap_pte() simply copies > the PTE to the new dst_addr. > > This approach is incorrect because even if the PTE is a swap > entry, it can still reference a folio that remains in the swap > cache. > > If do_swap_page() is triggered, it may locate the folio in the > swap cache. However, during add_rmap operations, a kernel panic > can occur due to: > page_pgoff(folio, page) != linear_page_index(vma, address) > > $./a.out > /dev/null > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > [ 13.337716] memcg:ffff00000405f000 > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > [ 13.340190] ------------[ cut here ]------------ > [ 13.340316] kernel BUG at mm/rmap.c:1380! > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > [ 13.340969] Modules linked in: > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > [ 13.341470] Hardware name: linux,dummy-virt (DT) > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > [ 13.342018] sp : ffff80008752bb20 > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > [ 13.343876] Call trace: > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > [ 13.344333] do_swap_page+0x1060/0x1400 > [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > [ 13.344504] handle_mm_fault+0xd8/0x2e8 > [ 13.344586] do_page_fault+0x20c/0x770 > [ 13.344673] do_translation_fault+0xb4/0xf0 > [ 13.344759] do_mem_abort+0x48/0xa0 > [ 13.344842] el0_da+0x58/0x130 > [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > [ 13.345504] ---[ end trace 0000000000000000 ]--- > [ 13.345715] note: a.out[107] exited with irqs disabled > [ 13.345954] note: a.out[107] exited with preempt_count 2 > > Fully fixing it would be quite complex, requiring similar handling > of folios as done in move_present_pte. For now, a quick solution > is to return -EBUSY. > I'd like to see others' opinions on whether a full fix is worth > pursuing. > Thanks a lot for finding this. As a user of MOVE ioctl (in Android GC) I strongly urge you to fix this properly. Because this is not going to be a rare occurrence in the case of Android. And when -EBUSY is returned, all that userspace can do is touch the page, which also does not guarantee that a subsequent retry of the ioctl will succeed. > For anyone interested in reproducing it, the a.out test program is > as below, > > #define _GNU_SOURCE > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include <sys/mman.h> > #include <sys/ioctl.h> > #include <sys/syscall.h> > #include <linux/userfaultfd.h> > #include <fcntl.h> > #include <pthread.h> > #include <unistd.h> > #include <poll.h> > #include <errno.h> > > #define PAGE_SIZE 4096 > #define REGION_SIZE (512 * 1024) > > #ifndef UFFDIO_MOVE > struct uffdio_move { > __u64 dst; > __u64 src; > __u64 len; > #define UFFDIO_MOVE_MODE_DONTWAKE ((__u64)1<<0) > #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1) > __u64 mode; > __s64 move; > }; > #define _UFFDIO_MOVE (0x05) > #define UFFDIO_MOVE _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move) > #endif > > void *src, *dst; > int uffd; > > void *madvise_thread(void *arg) { > if (madvise(src, REGION_SIZE, MADV_PAGEOUT) == -1) { > perror("madvise MADV_PAGEOUT"); > } > return NULL; > } > > void *fault_handler_thread(void *arg) { > struct uffd_msg msg; > struct uffdio_move move; > struct pollfd pollfd = { .fd = uffd, .events = POLLIN }; > > pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL); > pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL); > > while (1) { > if (poll(&pollfd, 1, -1) == -1) { > perror("poll"); > exit(EXIT_FAILURE); > } > > if (read(uffd, &msg, sizeof(msg)) <= 0) { > perror("read"); > exit(EXIT_FAILURE); > } > > if (msg.event != UFFD_EVENT_PAGEFAULT) { > fprintf(stderr, "Unexpected event\n"); > exit(EXIT_FAILURE); > } > > move.src = (unsigned long)src + (msg.arg.pagefault.address - (unsigned long)dst); > move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1); > move.len = PAGE_SIZE; > move.mode = 0; > > if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) { > perror("UFFDIO_MOVE"); > exit(EXIT_FAILURE); > } > } > return NULL; > } > > int main() { > again: > pthread_t thr, madv_thr; > struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 }; > struct uffdio_register uffdio_register; > > src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > if (src == MAP_FAILED) { > perror("mmap src"); > exit(EXIT_FAILURE); > } > memset(src, 1, REGION_SIZE); > > dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > if (dst == MAP_FAILED) { > perror("mmap dst"); > exit(EXIT_FAILURE); > } > > uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK); > if (uffd == -1) { > perror("userfaultfd"); > exit(EXIT_FAILURE); > } > > if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) { > perror("UFFDIO_API"); > exit(EXIT_FAILURE); > } > > uffdio_register.range.start = (unsigned long)dst; > uffdio_register.range.len = REGION_SIZE; > uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; > > if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) { > perror("UFFDIO_REGISTER"); > exit(EXIT_FAILURE); > } > > if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) { > perror("pthread_create madvise_thread"); > exit(EXIT_FAILURE); > } > > if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) { > perror("pthread_create fault_handler_thread"); > exit(EXIT_FAILURE); > } > > for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) { > char val = ((char *)dst)[i]; > printf("Accessing dst at offset %zu, value: %d\n", i, val); > } > > pthread_join(madv_thr, NULL); > pthread_cancel(thr); > pthread_join(thr, NULL); > > munmap(src, REGION_SIZE); > munmap(dst, REGION_SIZE); > close(uffd); > goto again; > return 0; > } > > As long as you enable mTHP (which likely increases the residency > time of swapcache), you can reproduce the issue within a few > seconds. But I guess the same race condition also exists with > small folios. > > Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI") > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: Suren Baghdasaryan <surenb@google.com> > Cc: Al Viro <viro@zeniv.linux.org.uk> > Cc: Axel Rasmussen <axelrasmussen@google.com> > Cc: Brian Geffon <bgeffon@google.com> > Cc: Christian Brauner <brauner@kernel.org> > Cc: David Hildenbrand <david@redhat.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: Jann Horn <jannh@google.com> > Cc: Kalesh Singh <kaleshsingh@google.com> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > Cc: Lokesh Gidra <lokeshgidra@google.com> > Cc: Matthew Wilcox (Oracle) <willy@infradead.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Mike Rapoport (IBM) <rppt@kernel.org> > Cc: Nicolas Geoffray <ngeoffray@google.com> > Cc: Peter Xu <peterx@redhat.com> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Cc: Shuah Khan <shuah@kernel.org> > Cc: ZhangPeng <zhangpeng362@huawei.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > mm/userfaultfd.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index 867898c4e30b..34cf1c8c725d 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -18,6 +18,7 @@ > #include <asm/tlbflush.h> > #include <asm/tlb.h> > #include "internal.h" > +#include "swap.h" > > static __always_inline > bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) > @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm, > pmd_t *dst_pmd, pmd_t dst_pmdval, > spinlock_t *dst_ptl, spinlock_t *src_ptl) > { > + struct folio *folio; > + swp_entry_t entry; > + > if (!pte_swp_exclusive(orig_src_pte)) > return -EBUSY; > > + entry = pte_to_swp_entry(orig_src_pte); > + folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry)); > + if (!IS_ERR(folio)) { > + folio_put(folio); > + return -EBUSY; > + } > + > double_pt_lock(dst_ptl, src_ptl); > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, > -- > 2.39.3 (Apple Git-146) >
On Wed, Feb 19, 2025 at 10:30 AM David Hildenbrand <david@redhat.com> wrote: > > On 19.02.25 19:26, Suren Baghdasaryan wrote: > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > >> > >> From: Barry Song <v-songbaohua@oppo.com> > >> > >> userfaultfd_move() checks whether the PTE entry is present or a > >> swap entry. > >> > >> - If the PTE entry is present, move_present_pte() handles folio > >> migration by setting: > >> > >> src_folio->index = linear_page_index(dst_vma, dst_addr); > >> > >> - If the PTE entry is a swap entry, move_swap_pte() simply copies > >> the PTE to the new dst_addr. > >> > >> This approach is incorrect because even if the PTE is a swap > >> entry, it can still reference a folio that remains in the swap > >> cache. > >> > >> If do_swap_page() is triggered, it may locate the folio in the > >> swap cache. However, during add_rmap operations, a kernel panic > >> can occur due to: > >> page_pgoff(folio, page) != linear_page_index(vma, address) > > > > Thanks for the report and reproducer! > > > >> > >> $./a.out > /dev/null > >> [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > >> [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > >> [ 13.337716] memcg:ffff00000405f000 > >> [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > >> [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > >> [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > >> [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > >> [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > >> [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > >> [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > >> [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > >> [ 13.340190] ------------[ cut here ]------------ > >> [ 13.340316] kernel BUG at mm/rmap.c:1380! > >> [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > >> [ 13.340969] Modules linked in: > >> [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > >> [ 13.341470] Hardware name: linux,dummy-virt (DT) > >> [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > >> [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > >> [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > >> [ 13.342018] sp : ffff80008752bb20 > >> [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > >> [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > >> [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > >> [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > >> [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > >> [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > >> [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > >> [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > >> [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > >> [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > >> [ 13.343876] Call trace: > >> [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > >> [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > >> [ 13.344333] do_swap_page+0x1060/0x1400 > >> [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > >> [ 13.344504] handle_mm_fault+0xd8/0x2e8 > >> [ 13.344586] do_page_fault+0x20c/0x770 > >> [ 13.344673] do_translation_fault+0xb4/0xf0 > >> [ 13.344759] do_mem_abort+0x48/0xa0 > >> [ 13.344842] el0_da+0x58/0x130 > >> [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > >> [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > >> [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > >> [ 13.345504] ---[ end trace 0000000000000000 ]--- > >> [ 13.345715] note: a.out[107] exited with irqs disabled > >> [ 13.345954] note: a.out[107] exited with preempt_count 2 > >> > >> Fully fixing it would be quite complex, requiring similar handling > >> of folios as done in move_present_pte. > > > > How complex would that be? Is it a matter of adding > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > > folio->index = linear_page_index like in move_present_pte() or > > something more? > > If the entry is pte_swp_exclusive(), and the folio is order-0, it cannot > be pinned and we may be able to move it I think. > > So all that's required is to check pte_swp_exclusive() and the folio size. > > ... in theory :) Not sure about the swap details. Looking some more into it, I think we would have to perform all the folio and anon_vma locking and pinning that we do for present pages in move_pages_pte(). If that's correct then maybe treating swapcache pages like a present page inside move_pages_pte() would be simpler? > > -- > Cheers, > > David / dhildenb >
On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > userfaultfd_move() checks whether the PTE entry is present or a > > swap entry. > > > > - If the PTE entry is present, move_present_pte() handles folio > > migration by setting: > > > > src_folio->index = linear_page_index(dst_vma, dst_addr); > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies > > the PTE to the new dst_addr. > > > > This approach is incorrect because even if the PTE is a swap > > entry, it can still reference a folio that remains in the swap > > cache. > > > > If do_swap_page() is triggered, it may locate the folio in the > > swap cache. However, during add_rmap operations, a kernel panic > > can occur due to: > > page_pgoff(folio, page) != linear_page_index(vma, address) > > Thanks for the report and reproducer! > > > > > $./a.out > /dev/null > > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > > [ 13.337716] memcg:ffff00000405f000 > > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > > [ 13.340190] ------------[ cut here ]------------ > > [ 13.340316] kernel BUG at mm/rmap.c:1380! > > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > > [ 13.340969] Modules linked in: > > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > > [ 13.341470] Hardware name: linux,dummy-virt (DT) > > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > [ 13.342018] sp : ffff80008752bb20 > > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > > [ 13.343876] Call trace: > > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > [ 13.344333] do_swap_page+0x1060/0x1400 > > [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > [ 13.344586] do_page_fault+0x20c/0x770 > > [ 13.344673] do_translation_fault+0xb4/0xf0 > > [ 13.344759] do_mem_abort+0x48/0xa0 > > [ 13.344842] el0_da+0x58/0x130 > > [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > > [ 13.345504] ---[ end trace 0000000000000000 ]--- > > [ 13.345715] note: a.out[107] exited with irqs disabled > > [ 13.345954] note: a.out[107] exited with preempt_count 2 > > > > Fully fixing it would be quite complex, requiring similar handling > > of folios as done in move_present_pte. > > How complex would that be? Is it a matter of adding > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > folio->index = linear_page_index like in move_present_pte() or > something more? My main concern is still with large folios that require a split_folio() during move_pages(), as the entire folio shares the same index and anon_vma. However, userfaultfd_move() moves pages individually, making a split necessary. However, in split_huge_page_to_list_to_order(), there is a: if (folio_test_writeback(folio)) return -EBUSY; This is likely true for swapcache, right? However, even for move_present_pte(), it simply returns -EBUSY: move_pages_pte() { /* at this point we have src_folio locked */ if (folio_test_large(src_folio)) { /* split_folio() can block */ pte_unmap(&orig_src_pte); pte_unmap(&orig_dst_pte); src_pte = dst_pte = NULL; err = split_folio(src_folio); if (err) goto out; /* have to reacquire the folio after it got split */ folio_unlock(src_folio); folio_put(src_folio); src_folio = NULL; goto retry; } } Do we need a folio_wait_writeback() before calling split_folio()? By the way, I have also reported that userfaultfd_move() has a fundamental conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common kernel. In this scenario, folios in the virtual zone won’t be split in split_folio(). Instead, the large folio migrates into nr_pages small folios. Thus, the best-case scenario would be: mTHP -> migrate to small folios in split_folio() -> move small folios to dst_addr While this works, it negates the performance benefits of userfaultfd_move(), as it introduces two PTE operations (migration in split_folio() and move in userfaultfd_move() while retry), nr_pages memory allocations, and still requires one memcpy(). This could end up performing even worse than userfaultfd_copy(), I guess. The worst-case scenario would be failing to allocate small folios in split_folio(), then userfaultfd_move() might return -ENOMEM? Given these issues, I strongly recommend that ART hold off on upgrading to userfaultfd_move() until these problems are fully understood and resolved. Otherwise, we’re in for a rough ride! > > > For now, a quick solution > > is to return -EBUSY. > > I'd like to see others' opinions on whether a full fix is worth > > pursuing. > > > > For anyone interested in reproducing it, the a.out test program is > > as below, > > > > #define _GNU_SOURCE > > #include <stdio.h> > > #include <stdlib.h> > > #include <string.h> > > #include <sys/mman.h> > > #include <sys/ioctl.h> > > #include <sys/syscall.h> > > #include <linux/userfaultfd.h> > > #include <fcntl.h> > > #include <pthread.h> > > #include <unistd.h> > > #include <poll.h> > > #include <errno.h> > > > > #define PAGE_SIZE 4096 > > #define REGION_SIZE (512 * 1024) > > > > #ifndef UFFDIO_MOVE > > struct uffdio_move { > > __u64 dst; > > __u64 src; > > __u64 len; > > #define UFFDIO_MOVE_MODE_DONTWAKE ((__u64)1<<0) > > #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1) > > __u64 mode; > > __s64 move; > > }; > > #define _UFFDIO_MOVE (0x05) > > #define UFFDIO_MOVE _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move) > > #endif > > > > void *src, *dst; > > int uffd; > > > > void *madvise_thread(void *arg) { > > if (madvise(src, REGION_SIZE, MADV_PAGEOUT) == -1) { > > perror("madvise MADV_PAGEOUT"); > > } > > return NULL; > > } > > > > void *fault_handler_thread(void *arg) { > > struct uffd_msg msg; > > struct uffdio_move move; > > struct pollfd pollfd = { .fd = uffd, .events = POLLIN }; > > > > pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL); > > pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL); > > > > while (1) { > > if (poll(&pollfd, 1, -1) == -1) { > > perror("poll"); > > exit(EXIT_FAILURE); > > } > > > > if (read(uffd, &msg, sizeof(msg)) <= 0) { > > perror("read"); > > exit(EXIT_FAILURE); > > } > > > > if (msg.event != UFFD_EVENT_PAGEFAULT) { > > fprintf(stderr, "Unexpected event\n"); > > exit(EXIT_FAILURE); > > } > > > > move.src = (unsigned long)src + (msg.arg.pagefault.address - (unsigned long)dst); > > move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1); > > move.len = PAGE_SIZE; > > move.mode = 0; > > > > if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) { > > perror("UFFDIO_MOVE"); > > exit(EXIT_FAILURE); > > } > > } > > return NULL; > > } > > > > int main() { > > again: > > pthread_t thr, madv_thr; > > struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 }; > > struct uffdio_register uffdio_register; > > > > src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > if (src == MAP_FAILED) { > > perror("mmap src"); > > exit(EXIT_FAILURE); > > } > > memset(src, 1, REGION_SIZE); > > > > dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > if (dst == MAP_FAILED) { > > perror("mmap dst"); > > exit(EXIT_FAILURE); > > } > > > > uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK); > > if (uffd == -1) { > > perror("userfaultfd"); > > exit(EXIT_FAILURE); > > } > > > > if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) { > > perror("UFFDIO_API"); > > exit(EXIT_FAILURE); > > } > > > > uffdio_register.range.start = (unsigned long)dst; > > uffdio_register.range.len = REGION_SIZE; > > uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; > > > > if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) { > > perror("UFFDIO_REGISTER"); > > exit(EXIT_FAILURE); > > } > > > > if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) { > > perror("pthread_create madvise_thread"); > > exit(EXIT_FAILURE); > > } > > > > if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) { > > perror("pthread_create fault_handler_thread"); > > exit(EXIT_FAILURE); > > } > > > > for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) { > > char val = ((char *)dst)[i]; > > printf("Accessing dst at offset %zu, value: %d\n", i, val); > > } > > > > pthread_join(madv_thr, NULL); > > pthread_cancel(thr); > > pthread_join(thr, NULL); > > > > munmap(src, REGION_SIZE); > > munmap(dst, REGION_SIZE); > > close(uffd); > > goto again; > > return 0; > > } > > > > As long as you enable mTHP (which likely increases the residency > > time of swapcache), you can reproduce the issue within a few > > seconds. But I guess the same race condition also exists with > > small folios. > > > > Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI") > > Cc: Andrea Arcangeli <aarcange@redhat.com> > > Cc: Suren Baghdasaryan <surenb@google.com> > > Cc: Al Viro <viro@zeniv.linux.org.uk> > > Cc: Axel Rasmussen <axelrasmussen@google.com> > > Cc: Brian Geffon <bgeffon@google.com> > > Cc: Christian Brauner <brauner@kernel.org> > > Cc: David Hildenbrand <david@redhat.com> > > Cc: Hugh Dickins <hughd@google.com> > > Cc: Jann Horn <jannh@google.com> > > Cc: Kalesh Singh <kaleshsingh@google.com> > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > > Cc: Lokesh Gidra <lokeshgidra@google.com> > > Cc: Matthew Wilcox (Oracle) <willy@infradead.org> > > Cc: Michal Hocko <mhocko@suse.com> > > Cc: Mike Rapoport (IBM) <rppt@kernel.org> > > Cc: Nicolas Geoffray <ngeoffray@google.com> > > Cc: Peter Xu <peterx@redhat.com> > > Cc: Ryan Roberts <ryan.roberts@arm.com> > > Cc: Shuah Khan <shuah@kernel.org> > > Cc: ZhangPeng <zhangpeng362@huawei.com> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > --- > > mm/userfaultfd.c | 11 +++++++++++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > index 867898c4e30b..34cf1c8c725d 100644 > > --- a/mm/userfaultfd.c > > +++ b/mm/userfaultfd.c > > @@ -18,6 +18,7 @@ > > #include <asm/tlbflush.h> > > #include <asm/tlb.h> > > #include "internal.h" > > +#include "swap.h" > > > > static __always_inline > > bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) > > @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm, > > pmd_t *dst_pmd, pmd_t dst_pmdval, > > spinlock_t *dst_ptl, spinlock_t *src_ptl) > > { > > + struct folio *folio; > > + swp_entry_t entry; > > + > > if (!pte_swp_exclusive(orig_src_pte)) > > return -EBUSY; > > > > Would be helpful to add a comment explaining that this is the case > when the folio is in the swap cache. > > > + entry = pte_to_swp_entry(orig_src_pte); > > + folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry)); > > + if (!IS_ERR(folio)) { > > + folio_put(folio); > > + return -EBUSY; > > + } > > + > > double_pt_lock(dst_ptl, src_ptl); > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, > > -- > > 2.39.3 (Apple Git-146) > > Thanks Barry
On Thu, Feb 20, 2025 at 7:40 AM Lokesh Gidra <lokeshgidra@google.com> wrote: > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > userfaultfd_move() checks whether the PTE entry is present or a > > swap entry. > > > > - If the PTE entry is present, move_present_pte() handles folio > > migration by setting: > > > > src_folio->index = linear_page_index(dst_vma, dst_addr); > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies > > the PTE to the new dst_addr. > > > > This approach is incorrect because even if the PTE is a swap > > entry, it can still reference a folio that remains in the swap > > cache. > > > > If do_swap_page() is triggered, it may locate the folio in the > > swap cache. However, during add_rmap operations, a kernel panic > > can occur due to: > > page_pgoff(folio, page) != linear_page_index(vma, address) > > > > $./a.out > /dev/null > > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > > [ 13.337716] memcg:ffff00000405f000 > > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > > [ 13.340190] ------------[ cut here ]------------ > > [ 13.340316] kernel BUG at mm/rmap.c:1380! > > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > > [ 13.340969] Modules linked in: > > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > > [ 13.341470] Hardware name: linux,dummy-virt (DT) > > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > [ 13.342018] sp : ffff80008752bb20 > > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > > [ 13.343876] Call trace: > > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > [ 13.344333] do_swap_page+0x1060/0x1400 > > [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > [ 13.344586] do_page_fault+0x20c/0x770 > > [ 13.344673] do_translation_fault+0xb4/0xf0 > > [ 13.344759] do_mem_abort+0x48/0xa0 > > [ 13.344842] el0_da+0x58/0x130 > > [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > > [ 13.345504] ---[ end trace 0000000000000000 ]--- > > [ 13.345715] note: a.out[107] exited with irqs disabled > > [ 13.345954] note: a.out[107] exited with preempt_count 2 > > > > Fully fixing it would be quite complex, requiring similar handling > > of folios as done in move_present_pte. For now, a quick solution > > is to return -EBUSY. > > I'd like to see others' opinions on whether a full fix is worth > > pursuing. > > > > Thanks a lot for finding this. > > As a user of MOVE ioctl (in Android GC) I strongly urge you to fix > this properly. Because this is not going to be a rare occurrence in > the case of Android. And when -EBUSY is returned, all that userspace > can do is touch the page, which also does not guarantee that a > subsequent retry of the ioctl will succeed. Not trying to push this idea, but I’m curious if it's feasible: If UFFDIO_MOVE fails, could userspace fall back to UFFDIO_COPY? I’m still trying to wrap my head around a few things, particularly what exactly UFFDIO_MOVE is doing with mTHP, as I mentioned in my reply to Suren and you in another email: https://lore.kernel.org/linux-mm/CAGsJ_4yx1=jaQmDG_9rMqHFFkoXqMJw941eYvtby28OqDq+S7g@mail.gmail.com/ > > > For anyone interested in reproducing it, the a.out test program is > > as below, > > > > #define _GNU_SOURCE > > #include <stdio.h> > > #include <stdlib.h> > > #include <string.h> > > #include <sys/mman.h> > > #include <sys/ioctl.h> > > #include <sys/syscall.h> > > #include <linux/userfaultfd.h> > > #include <fcntl.h> > > #include <pthread.h> > > #include <unistd.h> > > #include <poll.h> > > #include <errno.h> > > > > #define PAGE_SIZE 4096 > > #define REGION_SIZE (512 * 1024) > > > > #ifndef UFFDIO_MOVE > > struct uffdio_move { > > __u64 dst; > > __u64 src; > > __u64 len; > > #define UFFDIO_MOVE_MODE_DONTWAKE ((__u64)1<<0) > > #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1) > > __u64 mode; > > __s64 move; > > }; > > #define _UFFDIO_MOVE (0x05) > > #define UFFDIO_MOVE _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move) > > #endif > > > > void *src, *dst; > > int uffd; > > > > void *madvise_thread(void *arg) { > > if (madvise(src, REGION_SIZE, MADV_PAGEOUT) == -1) { > > perror("madvise MADV_PAGEOUT"); > > } > > return NULL; > > } > > > > void *fault_handler_thread(void *arg) { > > struct uffd_msg msg; > > struct uffdio_move move; > > struct pollfd pollfd = { .fd = uffd, .events = POLLIN }; > > > > pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL); > > pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL); > > > > while (1) { > > if (poll(&pollfd, 1, -1) == -1) { > > perror("poll"); > > exit(EXIT_FAILURE); > > } > > > > if (read(uffd, &msg, sizeof(msg)) <= 0) { > > perror("read"); > > exit(EXIT_FAILURE); > > } > > > > if (msg.event != UFFD_EVENT_PAGEFAULT) { > > fprintf(stderr, "Unexpected event\n"); > > exit(EXIT_FAILURE); > > } > > > > move.src = (unsigned long)src + (msg.arg.pagefault.address - (unsigned long)dst); > > move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1); > > move.len = PAGE_SIZE; > > move.mode = 0; > > > > if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) { > > perror("UFFDIO_MOVE"); > > exit(EXIT_FAILURE); > > } > > } > > return NULL; > > } > > > > int main() { > > again: > > pthread_t thr, madv_thr; > > struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 }; > > struct uffdio_register uffdio_register; > > > > src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > if (src == MAP_FAILED) { > > perror("mmap src"); > > exit(EXIT_FAILURE); > > } > > memset(src, 1, REGION_SIZE); > > > > dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > if (dst == MAP_FAILED) { > > perror("mmap dst"); > > exit(EXIT_FAILURE); > > } > > > > uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK); > > if (uffd == -1) { > > perror("userfaultfd"); > > exit(EXIT_FAILURE); > > } > > > > if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) { > > perror("UFFDIO_API"); > > exit(EXIT_FAILURE); > > } > > > > uffdio_register.range.start = (unsigned long)dst; > > uffdio_register.range.len = REGION_SIZE; > > uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; > > > > if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) { > > perror("UFFDIO_REGISTER"); > > exit(EXIT_FAILURE); > > } > > > > if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) { > > perror("pthread_create madvise_thread"); > > exit(EXIT_FAILURE); > > } > > > > if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) { > > perror("pthread_create fault_handler_thread"); > > exit(EXIT_FAILURE); > > } > > > > for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) { > > char val = ((char *)dst)[i]; > > printf("Accessing dst at offset %zu, value: %d\n", i, val); > > } > > > > pthread_join(madv_thr, NULL); > > pthread_cancel(thr); > > pthread_join(thr, NULL); > > > > munmap(src, REGION_SIZE); > > munmap(dst, REGION_SIZE); > > close(uffd); > > goto again; > > return 0; > > } > > > > As long as you enable mTHP (which likely increases the residency > > time of swapcache), you can reproduce the issue within a few > > seconds. But I guess the same race condition also exists with > > small folios. > > > > Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI") > > Cc: Andrea Arcangeli <aarcange@redhat.com> > > Cc: Suren Baghdasaryan <surenb@google.com> > > Cc: Al Viro <viro@zeniv.linux.org.uk> > > Cc: Axel Rasmussen <axelrasmussen@google.com> > > Cc: Brian Geffon <bgeffon@google.com> > > Cc: Christian Brauner <brauner@kernel.org> > > Cc: David Hildenbrand <david@redhat.com> > > Cc: Hugh Dickins <hughd@google.com> > > Cc: Jann Horn <jannh@google.com> > > Cc: Kalesh Singh <kaleshsingh@google.com> > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > > Cc: Lokesh Gidra <lokeshgidra@google.com> > > Cc: Matthew Wilcox (Oracle) <willy@infradead.org> > > Cc: Michal Hocko <mhocko@suse.com> > > Cc: Mike Rapoport (IBM) <rppt@kernel.org> > > Cc: Nicolas Geoffray <ngeoffray@google.com> > > Cc: Peter Xu <peterx@redhat.com> > > Cc: Ryan Roberts <ryan.roberts@arm.com> > > Cc: Shuah Khan <shuah@kernel.org> > > Cc: ZhangPeng <zhangpeng362@huawei.com> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > --- > > mm/userfaultfd.c | 11 +++++++++++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > index 867898c4e30b..34cf1c8c725d 100644 > > --- a/mm/userfaultfd.c > > +++ b/mm/userfaultfd.c > > @@ -18,6 +18,7 @@ > > #include <asm/tlbflush.h> > > #include <asm/tlb.h> > > #include "internal.h" > > +#include "swap.h" > > > > static __always_inline > > bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) > > @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm, > > pmd_t *dst_pmd, pmd_t dst_pmdval, > > spinlock_t *dst_ptl, spinlock_t *src_ptl) > > { > > + struct folio *folio; > > + swp_entry_t entry; > > + > > if (!pte_swp_exclusive(orig_src_pte)) > > return -EBUSY; > > > > + entry = pte_to_swp_entry(orig_src_pte); > > + folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry)); > > + if (!IS_ERR(folio)) { > > + folio_put(folio); > > + return -EBUSY; > > + } > > + > > double_pt_lock(dst_ptl, src_ptl); > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, > > -- > > 2.39.3 (Apple Git-146) > > > Thanks Barry
On Wed, Feb 19, 2025 at 12:45 PM Barry Song <21cnbao@gmail.com> wrote: > > On Thu, Feb 20, 2025 at 7:40 AM Lokesh Gidra <lokeshgidra@google.com> wrote: > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > userfaultfd_move() checks whether the PTE entry is present or a > > > swap entry. > > > > > > - If the PTE entry is present, move_present_pte() handles folio > > > migration by setting: > > > > > > src_folio->index = linear_page_index(dst_vma, dst_addr); > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies > > > the PTE to the new dst_addr. > > > > > > This approach is incorrect because even if the PTE is a swap > > > entry, it can still reference a folio that remains in the swap > > > cache. > > > > > > If do_swap_page() is triggered, it may locate the folio in the > > > swap cache. However, during add_rmap operations, a kernel panic > > > can occur due to: > > > page_pgoff(folio, page) != linear_page_index(vma, address) > > > > > > $./a.out > /dev/null > > > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > > > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > > > [ 13.337716] memcg:ffff00000405f000 > > > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > > > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > > > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > > > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > > > [ 13.340190] ------------[ cut here ]------------ > > > [ 13.340316] kernel BUG at mm/rmap.c:1380! > > > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > > > [ 13.340969] Modules linked in: > > > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > > > [ 13.341470] Hardware name: linux,dummy-virt (DT) > > > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > > [ 13.342018] sp : ffff80008752bb20 > > > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > > > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > > > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > > > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > > > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > > > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > > > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > > > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > > > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > > > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > > > [ 13.343876] Call trace: > > > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > > [ 13.344333] do_swap_page+0x1060/0x1400 > > > [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > > [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > > [ 13.344586] do_page_fault+0x20c/0x770 > > > [ 13.344673] do_translation_fault+0xb4/0xf0 > > > [ 13.344759] do_mem_abort+0x48/0xa0 > > > [ 13.344842] el0_da+0x58/0x130 > > > [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > > [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > > > [ 13.345504] ---[ end trace 0000000000000000 ]--- > > > [ 13.345715] note: a.out[107] exited with irqs disabled > > > [ 13.345954] note: a.out[107] exited with preempt_count 2 > > > > > > Fully fixing it would be quite complex, requiring similar handling > > > of folios as done in move_present_pte. For now, a quick solution > > > is to return -EBUSY. > > > I'd like to see others' opinions on whether a full fix is worth > > > pursuing. > > > > > > > Thanks a lot for finding this. > > > > As a user of MOVE ioctl (in Android GC) I strongly urge you to fix > > this properly. Because this is not going to be a rare occurrence in > > the case of Android. And when -EBUSY is returned, all that userspace > > can do is touch the page, which also does not guarantee that a > > subsequent retry of the ioctl will succeed. > > Not trying to push this idea, but I’m curious if it's feasible: > > If UFFDIO_MOVE fails, could userspace fall back to UFFDIO_COPY? It's possible! But it wouldn't be rare to find such pages and falling back to COPY so many times would mellow down the benefits of using MOVE quite a bit. > > I’m still trying to wrap my head around a few things, particularly > what exactly UFFDIO_MOVE is doing with mTHP, as I mentioned in my > reply to Suren and you in another email: > > https://lore.kernel.org/linux-mm/CAGsJ_4yx1=jaQmDG_9rMqHFFkoXqMJw941eYvtby28OqDq+S7g@mail.gmail.com/ > > > > > > > For anyone interested in reproducing it, the a.out test program is > > > as below, > > > > > > #define _GNU_SOURCE > > > #include <stdio.h> > > > #include <stdlib.h> > > > #include <string.h> > > > #include <sys/mman.h> > > > #include <sys/ioctl.h> > > > #include <sys/syscall.h> > > > #include <linux/userfaultfd.h> > > > #include <fcntl.h> > > > #include <pthread.h> > > > #include <unistd.h> > > > #include <poll.h> > > > #include <errno.h> > > > > > > #define PAGE_SIZE 4096 > > > #define REGION_SIZE (512 * 1024) > > > > > > #ifndef UFFDIO_MOVE > > > struct uffdio_move { > > > __u64 dst; > > > __u64 src; > > > __u64 len; > > > #define UFFDIO_MOVE_MODE_DONTWAKE ((__u64)1<<0) > > > #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1) > > > __u64 mode; > > > __s64 move; > > > }; > > > #define _UFFDIO_MOVE (0x05) > > > #define UFFDIO_MOVE _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move) > > > #endif > > > > > > void *src, *dst; > > > int uffd; > > > > > > void *madvise_thread(void *arg) { > > > if (madvise(src, REGION_SIZE, MADV_PAGEOUT) == -1) { > > > perror("madvise MADV_PAGEOUT"); > > > } > > > return NULL; > > > } > > > > > > void *fault_handler_thread(void *arg) { > > > struct uffd_msg msg; > > > struct uffdio_move move; > > > struct pollfd pollfd = { .fd = uffd, .events = POLLIN }; > > > > > > pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL); > > > pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL); > > > > > > while (1) { > > > if (poll(&pollfd, 1, -1) == -1) { > > > perror("poll"); > > > exit(EXIT_FAILURE); > > > } > > > > > > if (read(uffd, &msg, sizeof(msg)) <= 0) { > > > perror("read"); > > > exit(EXIT_FAILURE); > > > } > > > > > > if (msg.event != UFFD_EVENT_PAGEFAULT) { > > > fprintf(stderr, "Unexpected event\n"); > > > exit(EXIT_FAILURE); > > > } > > > > > > move.src = (unsigned long)src + (msg.arg.pagefault.address - (unsigned long)dst); > > > move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1); > > > move.len = PAGE_SIZE; > > > move.mode = 0; > > > > > > if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) { > > > perror("UFFDIO_MOVE"); > > > exit(EXIT_FAILURE); > > > } > > > } > > > return NULL; > > > } > > > > > > int main() { > > > again: > > > pthread_t thr, madv_thr; > > > struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 }; > > > struct uffdio_register uffdio_register; > > > > > > src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > > if (src == MAP_FAILED) { > > > perror("mmap src"); > > > exit(EXIT_FAILURE); > > > } > > > memset(src, 1, REGION_SIZE); > > > > > > dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > > if (dst == MAP_FAILED) { > > > perror("mmap dst"); > > > exit(EXIT_FAILURE); > > > } > > > > > > uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK); > > > if (uffd == -1) { > > > perror("userfaultfd"); > > > exit(EXIT_FAILURE); > > > } > > > > > > if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) { > > > perror("UFFDIO_API"); > > > exit(EXIT_FAILURE); > > > } > > > > > > uffdio_register.range.start = (unsigned long)dst; > > > uffdio_register.range.len = REGION_SIZE; > > > uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; > > > > > > if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) { > > > perror("UFFDIO_REGISTER"); > > > exit(EXIT_FAILURE); > > > } > > > > > > if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) { > > > perror("pthread_create madvise_thread"); > > > exit(EXIT_FAILURE); > > > } > > > > > > if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) { > > > perror("pthread_create fault_handler_thread"); > > > exit(EXIT_FAILURE); > > > } > > > > > > for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) { > > > char val = ((char *)dst)[i]; > > > printf("Accessing dst at offset %zu, value: %d\n", i, val); > > > } > > > > > > pthread_join(madv_thr, NULL); > > > pthread_cancel(thr); > > > pthread_join(thr, NULL); > > > > > > munmap(src, REGION_SIZE); > > > munmap(dst, REGION_SIZE); > > > close(uffd); > > > goto again; > > > return 0; > > > } > > > > > > As long as you enable mTHP (which likely increases the residency > > > time of swapcache), you can reproduce the issue within a few > > > seconds. But I guess the same race condition also exists with > > > small folios. > > > > > > Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI") > > > Cc: Andrea Arcangeli <aarcange@redhat.com> > > > Cc: Suren Baghdasaryan <surenb@google.com> > > > Cc: Al Viro <viro@zeniv.linux.org.uk> > > > Cc: Axel Rasmussen <axelrasmussen@google.com> > > > Cc: Brian Geffon <bgeffon@google.com> > > > Cc: Christian Brauner <brauner@kernel.org> > > > Cc: David Hildenbrand <david@redhat.com> > > > Cc: Hugh Dickins <hughd@google.com> > > > Cc: Jann Horn <jannh@google.com> > > > Cc: Kalesh Singh <kaleshsingh@google.com> > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > > > Cc: Lokesh Gidra <lokeshgidra@google.com> > > > Cc: Matthew Wilcox (Oracle) <willy@infradead.org> > > > Cc: Michal Hocko <mhocko@suse.com> > > > Cc: Mike Rapoport (IBM) <rppt@kernel.org> > > > Cc: Nicolas Geoffray <ngeoffray@google.com> > > > Cc: Peter Xu <peterx@redhat.com> > > > Cc: Ryan Roberts <ryan.roberts@arm.com> > > > Cc: Shuah Khan <shuah@kernel.org> > > > Cc: ZhangPeng <zhangpeng362@huawei.com> > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > > --- > > > mm/userfaultfd.c | 11 +++++++++++ > > > 1 file changed, 11 insertions(+) > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > index 867898c4e30b..34cf1c8c725d 100644 > > > --- a/mm/userfaultfd.c > > > +++ b/mm/userfaultfd.c > > > @@ -18,6 +18,7 @@ > > > #include <asm/tlbflush.h> > > > #include <asm/tlb.h> > > > #include "internal.h" > > > +#include "swap.h" > > > > > > static __always_inline > > > bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) > > > @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm, > > > pmd_t *dst_pmd, pmd_t dst_pmdval, > > > spinlock_t *dst_ptl, spinlock_t *src_ptl) > > > { > > > + struct folio *folio; > > > + swp_entry_t entry; > > > + > > > if (!pte_swp_exclusive(orig_src_pte)) > > > return -EBUSY; > > > > > > + entry = pte_to_swp_entry(orig_src_pte); > > > + folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry)); > > > + if (!IS_ERR(folio)) { > > > + folio_put(folio); > > > + return -EBUSY; > > > + } > > > + > > > double_pt_lock(dst_ptl, src_ptl); > > > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, > > > -- > > > 2.39.3 (Apple Git-146) > > > > > > > Thanks > Barry
On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote: > > How complex would that be? Is it a matter of adding > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > > folio->index = linear_page_index like in move_present_pte() or > > something more? > > My main concern is still with large folios that require a split_folio() > during move_pages(), as the entire folio shares the same index and > anon_vma. However, userfaultfd_move() moves pages individually, > making a split necessary. > > However, in split_huge_page_to_list_to_order(), there is a: > > if (folio_test_writeback(folio)) > return -EBUSY; > > This is likely true for swapcache, right? I don't see why? When they get moved to the swap cache, yes, they're immediately written back, but after being swapped back in, they stay in the swap cache, so they don't have to be moved back to the swap cache. Right?
On Wed, Feb 19, 2025 at 12:38 PM Barry Song <21cnbao@gmail.com> wrote: > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > userfaultfd_move() checks whether the PTE entry is present or a > > > swap entry. > > > > > > - If the PTE entry is present, move_present_pte() handles folio > > > migration by setting: > > > > > > src_folio->index = linear_page_index(dst_vma, dst_addr); > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies > > > the PTE to the new dst_addr. > > > > > > This approach is incorrect because even if the PTE is a swap > > > entry, it can still reference a folio that remains in the swap > > > cache. > > > > > > If do_swap_page() is triggered, it may locate the folio in the > > > swap cache. However, during add_rmap operations, a kernel panic > > > can occur due to: > > > page_pgoff(folio, page) != linear_page_index(vma, address) > > > > Thanks for the report and reproducer! > > > > > > > > $./a.out > /dev/null > > > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > > > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > > > [ 13.337716] memcg:ffff00000405f000 > > > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > > > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > > > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > > > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > > > [ 13.340190] ------------[ cut here ]------------ > > > [ 13.340316] kernel BUG at mm/rmap.c:1380! > > > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > > > [ 13.340969] Modules linked in: > > > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > > > [ 13.341470] Hardware name: linux,dummy-virt (DT) > > > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > > [ 13.342018] sp : ffff80008752bb20 > > > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > > > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > > > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > > > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > > > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > > > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > > > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > > > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > > > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > > > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > > > [ 13.343876] Call trace: > > > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > > [ 13.344333] do_swap_page+0x1060/0x1400 > > > [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > > [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > > [ 13.344586] do_page_fault+0x20c/0x770 > > > [ 13.344673] do_translation_fault+0xb4/0xf0 > > > [ 13.344759] do_mem_abort+0x48/0xa0 > > > [ 13.344842] el0_da+0x58/0x130 > > > [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > > [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > > > [ 13.345504] ---[ end trace 0000000000000000 ]--- > > > [ 13.345715] note: a.out[107] exited with irqs disabled > > > [ 13.345954] note: a.out[107] exited with preempt_count 2 > > > > > > Fully fixing it would be quite complex, requiring similar handling > > > of folios as done in move_present_pte. > > > > How complex would that be? Is it a matter of adding > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > > folio->index = linear_page_index like in move_present_pte() or > > something more? > > My main concern is still with large folios that require a split_folio() > during move_pages(), as the entire folio shares the same index and > anon_vma. However, userfaultfd_move() moves pages individually, > making a split necessary. > > However, in split_huge_page_to_list_to_order(), there is a: > > if (folio_test_writeback(folio)) > return -EBUSY; > > This is likely true for swapcache, right? However, even for move_present_pte(), > it simply returns -EBUSY: > > move_pages_pte() > { > /* at this point we have src_folio locked */ > if (folio_test_large(src_folio)) { > /* split_folio() can block */ > pte_unmap(&orig_src_pte); > pte_unmap(&orig_dst_pte); > src_pte = dst_pte = NULL; > err = split_folio(src_folio); > if (err) > goto out; > > /* have to reacquire the folio after it got split */ > folio_unlock(src_folio); > folio_put(src_folio); > src_folio = NULL; > goto retry; > } > } > > Do we need a folio_wait_writeback() before calling split_folio()? > > By the way, I have also reported that userfaultfd_move() has a fundamental > conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common > kernel. In this scenario, folios in the virtual zone won’t be split in > split_folio(). Instead, the large folio migrates into nr_pages small folios. > > Thus, the best-case scenario would be: > > mTHP -> migrate to small folios in split_folio() -> move small folios to > dst_addr > > While this works, it negates the performance benefits of > userfaultfd_move(), as it introduces two PTE operations (migration in > split_folio() and move in userfaultfd_move() while retry), nr_pages memory > allocations, and still requires one memcpy(). This could end up > performing even worse than userfaultfd_copy(), I guess. > > The worst-case scenario would be failing to allocate small folios in > split_folio(), then userfaultfd_move() might return -ENOMEM? > > Given these issues, I strongly recommend that ART hold off on upgrading > to userfaultfd_move() until these problems are fully understood and > resolved. Otherwise, we’re in for a rough ride! At the moment, ART GC doesn't work taking mTHP into consideration. We don't try to be careful in userspace to be large-page aligned or anything. Also, the MOVE ioctl implementation works either on huge-pages or on normal pages. IIUC, it can't handle mTHP large pages as a whole. But that's true for other userfaultfd ioctls as well. If we were to continue using COPY, it's not that it's in any way more friendly to mTHP than MOVE. In fact, that's one of the reasons I'm considering making the ART heap NO_HUGEPAGE to avoid the need for folio-split entirely. Furthermore, there are few cases in which COPY ioctl's overhead just doesn't make sense for ART GC. So starting to use MOVE ioctl is the right thing to do. What we need eventually to gain mTHP benefits is both MOVE ioctl to support large-page migration as well as GC code in userspace to work with mTHP in mind. > > > > > > For now, a quick solution > > > is to return -EBUSY. > > > I'd like to see others' opinions on whether a full fix is worth > > > pursuing. > > > > > > For anyone interested in reproducing it, the a.out test program is > > > as below, > > > > > > #define _GNU_SOURCE > > > #include <stdio.h> > > > #include <stdlib.h> > > > #include <string.h> > > > #include <sys/mman.h> > > > #include <sys/ioctl.h> > > > #include <sys/syscall.h> > > > #include <linux/userfaultfd.h> > > > #include <fcntl.h> > > > #include <pthread.h> > > > #include <unistd.h> > > > #include <poll.h> > > > #include <errno.h> > > > > > > #define PAGE_SIZE 4096 > > > #define REGION_SIZE (512 * 1024) > > > > > > #ifndef UFFDIO_MOVE > > > struct uffdio_move { > > > __u64 dst; > > > __u64 src; > > > __u64 len; > > > #define UFFDIO_MOVE_MODE_DONTWAKE ((__u64)1<<0) > > > #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1) > > > __u64 mode; > > > __s64 move; > > > }; > > > #define _UFFDIO_MOVE (0x05) > > > #define UFFDIO_MOVE _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move) > > > #endif > > > > > > void *src, *dst; > > > int uffd; > > > > > > void *madvise_thread(void *arg) { > > > if (madvise(src, REGION_SIZE, MADV_PAGEOUT) == -1) { > > > perror("madvise MADV_PAGEOUT"); > > > } > > > return NULL; > > > } > > > > > > void *fault_handler_thread(void *arg) { > > > struct uffd_msg msg; > > > struct uffdio_move move; > > > struct pollfd pollfd = { .fd = uffd, .events = POLLIN }; > > > > > > pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL); > > > pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL); > > > > > > while (1) { > > > if (poll(&pollfd, 1, -1) == -1) { > > > perror("poll"); > > > exit(EXIT_FAILURE); > > > } > > > > > > if (read(uffd, &msg, sizeof(msg)) <= 0) { > > > perror("read"); > > > exit(EXIT_FAILURE); > > > } > > > > > > if (msg.event != UFFD_EVENT_PAGEFAULT) { > > > fprintf(stderr, "Unexpected event\n"); > > > exit(EXIT_FAILURE); > > > } > > > > > > move.src = (unsigned long)src + (msg.arg.pagefault.address - (unsigned long)dst); > > > move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1); > > > move.len = PAGE_SIZE; > > > move.mode = 0; > > > > > > if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) { > > > perror("UFFDIO_MOVE"); > > > exit(EXIT_FAILURE); > > > } > > > } > > > return NULL; > > > } > > > > > > int main() { > > > again: > > > pthread_t thr, madv_thr; > > > struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 }; > > > struct uffdio_register uffdio_register; > > > > > > src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > > if (src == MAP_FAILED) { > > > perror("mmap src"); > > > exit(EXIT_FAILURE); > > > } > > > memset(src, 1, REGION_SIZE); > > > > > > dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > > if (dst == MAP_FAILED) { > > > perror("mmap dst"); > > > exit(EXIT_FAILURE); > > > } > > > > > > uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK); > > > if (uffd == -1) { > > > perror("userfaultfd"); > > > exit(EXIT_FAILURE); > > > } > > > > > > if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) { > > > perror("UFFDIO_API"); > > > exit(EXIT_FAILURE); > > > } > > > > > > uffdio_register.range.start = (unsigned long)dst; > > > uffdio_register.range.len = REGION_SIZE; > > > uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; > > > > > > if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) { > > > perror("UFFDIO_REGISTER"); > > > exit(EXIT_FAILURE); > > > } > > > > > > if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) { > > > perror("pthread_create madvise_thread"); > > > exit(EXIT_FAILURE); > > > } > > > > > > if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) { > > > perror("pthread_create fault_handler_thread"); > > > exit(EXIT_FAILURE); > > > } > > > > > > for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) { > > > char val = ((char *)dst)[i]; > > > printf("Accessing dst at offset %zu, value: %d\n", i, val); > > > } > > > > > > pthread_join(madv_thr, NULL); > > > pthread_cancel(thr); > > > pthread_join(thr, NULL); > > > > > > munmap(src, REGION_SIZE); > > > munmap(dst, REGION_SIZE); > > > close(uffd); > > > goto again; > > > return 0; > > > } > > > > > > As long as you enable mTHP (which likely increases the residency > > > time of swapcache), you can reproduce the issue within a few > > > seconds. But I guess the same race condition also exists with > > > small folios. > > > > > > Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI") > > > Cc: Andrea Arcangeli <aarcange@redhat.com> > > > Cc: Suren Baghdasaryan <surenb@google.com> > > > Cc: Al Viro <viro@zeniv.linux.org.uk> > > > Cc: Axel Rasmussen <axelrasmussen@google.com> > > > Cc: Brian Geffon <bgeffon@google.com> > > > Cc: Christian Brauner <brauner@kernel.org> > > > Cc: David Hildenbrand <david@redhat.com> > > > Cc: Hugh Dickins <hughd@google.com> > > > Cc: Jann Horn <jannh@google.com> > > > Cc: Kalesh Singh <kaleshsingh@google.com> > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > > > Cc: Lokesh Gidra <lokeshgidra@google.com> > > > Cc: Matthew Wilcox (Oracle) <willy@infradead.org> > > > Cc: Michal Hocko <mhocko@suse.com> > > > Cc: Mike Rapoport (IBM) <rppt@kernel.org> > > > Cc: Nicolas Geoffray <ngeoffray@google.com> > > > Cc: Peter Xu <peterx@redhat.com> > > > Cc: Ryan Roberts <ryan.roberts@arm.com> > > > Cc: Shuah Khan <shuah@kernel.org> > > > Cc: ZhangPeng <zhangpeng362@huawei.com> > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > > --- > > > mm/userfaultfd.c | 11 +++++++++++ > > > 1 file changed, 11 insertions(+) > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > index 867898c4e30b..34cf1c8c725d 100644 > > > --- a/mm/userfaultfd.c > > > +++ b/mm/userfaultfd.c > > > @@ -18,6 +18,7 @@ > > > #include <asm/tlbflush.h> > > > #include <asm/tlb.h> > > > #include "internal.h" > > > +#include "swap.h" > > > > > > static __always_inline > > > bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) > > > @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm, > > > pmd_t *dst_pmd, pmd_t dst_pmdval, > > > spinlock_t *dst_ptl, spinlock_t *src_ptl) > > > { > > > + struct folio *folio; > > > + swp_entry_t entry; > > > + > > > if (!pte_swp_exclusive(orig_src_pte)) > > > return -EBUSY; > > > > > > > Would be helpful to add a comment explaining that this is the case > > when the folio is in the swap cache. > > > > > + entry = pte_to_swp_entry(orig_src_pte); > > > + folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry)); > > > + if (!IS_ERR(folio)) { > > > + folio_put(folio); > > > + return -EBUSY; > > > + } > > > + > > > double_pt_lock(dst_ptl, src_ptl); > > > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, > > > -- > > > 2.39.3 (Apple Git-146) > > > > > Thanks > Barry
On Thu, Feb 20, 2025 at 9:57 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote: > > > How complex would that be? Is it a matter of adding > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > > > folio->index = linear_page_index like in move_present_pte() or > > > something more? > > > > My main concern is still with large folios that require a split_folio() > > during move_pages(), as the entire folio shares the same index and > > anon_vma. However, userfaultfd_move() moves pages individually, > > making a split necessary. > > > > However, in split_huge_page_to_list_to_order(), there is a: > > > > if (folio_test_writeback(folio)) > > return -EBUSY; > > > > This is likely true for swapcache, right? > > I don't see why? When they get moved to the swap cache, yes, they're > immediately written back, but after being swapped back in, they stay in > the swap cache, so they don't have to be moved back to the swap cache. > Right? I don’t quite understand your question. The issue we’re discussing is that the folio is in swapcache. Right now, we’re encountering a kernel crash because we haven’t fixed the folio’s index. If we want to address that, we need to perform a split_folio() for mTHP. Since we’re already dealing with swapcache, we’re likely in a situation where we’re doing writeback (pageout), considering Android uses sync zram. So, if swapcache is true, writeback is probably true as well. The race occurs after we call add_to_swap(), try_to_unmap() and before we complete the writeback - page. (Swapcache will be cleared for the sync device once the writeback is finished.) Thanks Barry
On Thu, Feb 20, 2025 at 10:03 AM Lokesh Gidra <lokeshgidra@google.com> wrote: > > On Wed, Feb 19, 2025 at 12:38 PM Barry Song <21cnbao@gmail.com> wrote: > > > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > > > userfaultfd_move() checks whether the PTE entry is present or a > > > > swap entry. > > > > > > > > - If the PTE entry is present, move_present_pte() handles folio > > > > migration by setting: > > > > > > > > src_folio->index = linear_page_index(dst_vma, dst_addr); > > > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies > > > > the PTE to the new dst_addr. > > > > > > > > This approach is incorrect because even if the PTE is a swap > > > > entry, it can still reference a folio that remains in the swap > > > > cache. > > > > > > > > If do_swap_page() is triggered, it may locate the folio in the > > > > swap cache. However, during add_rmap operations, a kernel panic > > > > can occur due to: > > > > page_pgoff(folio, page) != linear_page_index(vma, address) > > > > > > Thanks for the report and reproducer! > > > > > > > > > > > $./a.out > /dev/null > > > > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > > > > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > > > > [ 13.337716] memcg:ffff00000405f000 > > > > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > > > > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > > > > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > > > > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > > > > [ 13.340190] ------------[ cut here ]------------ > > > > [ 13.340316] kernel BUG at mm/rmap.c:1380! > > > > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > > > > [ 13.340969] Modules linked in: > > > > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > > > > [ 13.341470] Hardware name: linux,dummy-virt (DT) > > > > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > > > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > > > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > > > [ 13.342018] sp : ffff80008752bb20 > > > > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > > > > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > > > > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > > > > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > > > > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > > > > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > > > > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > > > > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > > > > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > > > > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > > > > [ 13.343876] Call trace: > > > > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > > > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > > > [ 13.344333] do_swap_page+0x1060/0x1400 > > > > [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > > > [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > > > [ 13.344586] do_page_fault+0x20c/0x770 > > > > [ 13.344673] do_translation_fault+0xb4/0xf0 > > > > [ 13.344759] do_mem_abort+0x48/0xa0 > > > > [ 13.344842] el0_da+0x58/0x130 > > > > [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > > > [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > > > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > > > > [ 13.345504] ---[ end trace 0000000000000000 ]--- > > > > [ 13.345715] note: a.out[107] exited with irqs disabled > > > > [ 13.345954] note: a.out[107] exited with preempt_count 2 > > > > > > > > Fully fixing it would be quite complex, requiring similar handling > > > > of folios as done in move_present_pte. > > > > > > How complex would that be? Is it a matter of adding > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > > > folio->index = linear_page_index like in move_present_pte() or > > > something more? > > > > My main concern is still with large folios that require a split_folio() > > during move_pages(), as the entire folio shares the same index and > > anon_vma. However, userfaultfd_move() moves pages individually, > > making a split necessary. > > > > However, in split_huge_page_to_list_to_order(), there is a: > > > > if (folio_test_writeback(folio)) > > return -EBUSY; > > > > This is likely true for swapcache, right? However, even for move_present_pte(), > > it simply returns -EBUSY: > > > > move_pages_pte() > > { > > /* at this point we have src_folio locked */ > > if (folio_test_large(src_folio)) { > > /* split_folio() can block */ > > pte_unmap(&orig_src_pte); > > pte_unmap(&orig_dst_pte); > > src_pte = dst_pte = NULL; > > err = split_folio(src_folio); > > if (err) > > goto out; > > > > /* have to reacquire the folio after it got split */ > > folio_unlock(src_folio); > > folio_put(src_folio); > > src_folio = NULL; > > goto retry; > > } > > } > > > > Do we need a folio_wait_writeback() before calling split_folio()? > > > > By the way, I have also reported that userfaultfd_move() has a fundamental > > conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common > > kernel. In this scenario, folios in the virtual zone won’t be split in > > split_folio(). Instead, the large folio migrates into nr_pages small folios. > > > > Thus, the best-case scenario would be: > > > > mTHP -> migrate to small folios in split_folio() -> move small folios to > > dst_addr > > > > While this works, it negates the performance benefits of > > userfaultfd_move(), as it introduces two PTE operations (migration in > > split_folio() and move in userfaultfd_move() while retry), nr_pages memory > > allocations, and still requires one memcpy(). This could end up > > performing even worse than userfaultfd_copy(), I guess. > > > > The worst-case scenario would be failing to allocate small folios in > > split_folio(), then userfaultfd_move() might return -ENOMEM? > > > > Given these issues, I strongly recommend that ART hold off on upgrading > > to userfaultfd_move() until these problems are fully understood and > > resolved. Otherwise, we’re in for a rough ride! > > At the moment, ART GC doesn't work taking mTHP into consideration. We > don't try to be careful in userspace to be large-page aligned or > anything. Also, the MOVE ioctl implementation works either on > huge-pages or on normal pages. IIUC, it can't handle mTHP large pages > as a whole. But that's true for other userfaultfd ioctls as well. If > we were to continue using COPY, it's not that it's in any way more > friendly to mTHP than MOVE. In fact, that's one of the reasons I'm > considering making the ART heap NO_HUGEPAGE to avoid the need for > folio-split entirely. Disabling mTHP is one way to avoid potential bugs. However, as long as UFFDIO_MOVE is available, we can’t prevent others, aside from ART GC, from using it, right? So, we still need to address these issues with mTHP. If a trend-following Android app discovers the UFFDIO_MOVE API, it might use it, and it may not necessarily know to disable hugepages. Doesn’t that pose a risk? > > Furthermore, there are few cases in which COPY ioctl's overhead just > doesn't make sense for ART GC. So starting to use MOVE ioctl is the > right thing to do. > > What we need eventually to gain mTHP benefits is both MOVE ioctl to > support large-page migration as well as GC code in userspace to work > with mTHP in mind. > > > > > > > > > For now, a quick solution > > > > is to return -EBUSY. > > > > I'd like to see others' opinions on whether a full fix is worth > > > > pursuing. > > > > > > > > For anyone interested in reproducing it, the a.out test program is > > > > as below, > > > > > > > > #define _GNU_SOURCE > > > > #include <stdio.h> > > > > #include <stdlib.h> > > > > #include <string.h> > > > > #include <sys/mman.h> > > > > #include <sys/ioctl.h> > > > > #include <sys/syscall.h> > > > > #include <linux/userfaultfd.h> > > > > #include <fcntl.h> > > > > #include <pthread.h> > > > > #include <unistd.h> > > > > #include <poll.h> > > > > #include <errno.h> > > > > > > > > #define PAGE_SIZE 4096 > > > > #define REGION_SIZE (512 * 1024) > > > > > > > > #ifndef UFFDIO_MOVE > > > > struct uffdio_move { > > > > __u64 dst; > > > > __u64 src; > > > > __u64 len; > > > > #define UFFDIO_MOVE_MODE_DONTWAKE ((__u64)1<<0) > > > > #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1) > > > > __u64 mode; > > > > __s64 move; > > > > }; > > > > #define _UFFDIO_MOVE (0x05) > > > > #define UFFDIO_MOVE _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move) > > > > #endif > > > > > > > > void *src, *dst; > > > > int uffd; > > > > > > > > void *madvise_thread(void *arg) { > > > > if (madvise(src, REGION_SIZE, MADV_PAGEOUT) == -1) { > > > > perror("madvise MADV_PAGEOUT"); > > > > } > > > > return NULL; > > > > } > > > > > > > > void *fault_handler_thread(void *arg) { > > > > struct uffd_msg msg; > > > > struct uffdio_move move; > > > > struct pollfd pollfd = { .fd = uffd, .events = POLLIN }; > > > > > > > > pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL); > > > > pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL); > > > > > > > > while (1) { > > > > if (poll(&pollfd, 1, -1) == -1) { > > > > perror("poll"); > > > > exit(EXIT_FAILURE); > > > > } > > > > > > > > if (read(uffd, &msg, sizeof(msg)) <= 0) { > > > > perror("read"); > > > > exit(EXIT_FAILURE); > > > > } > > > > > > > > if (msg.event != UFFD_EVENT_PAGEFAULT) { > > > > fprintf(stderr, "Unexpected event\n"); > > > > exit(EXIT_FAILURE); > > > > } > > > > > > > > move.src = (unsigned long)src + (msg.arg.pagefault.address - (unsigned long)dst); > > > > move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1); > > > > move.len = PAGE_SIZE; > > > > move.mode = 0; > > > > > > > > if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) { > > > > perror("UFFDIO_MOVE"); > > > > exit(EXIT_FAILURE); > > > > } > > > > } > > > > return NULL; > > > > } > > > > > > > > int main() { > > > > again: > > > > pthread_t thr, madv_thr; > > > > struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 }; > > > > struct uffdio_register uffdio_register; > > > > > > > > src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > > > if (src == MAP_FAILED) { > > > > perror("mmap src"); > > > > exit(EXIT_FAILURE); > > > > } > > > > memset(src, 1, REGION_SIZE); > > > > > > > > dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > > > if (dst == MAP_FAILED) { > > > > perror("mmap dst"); > > > > exit(EXIT_FAILURE); > > > > } > > > > > > > > uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK); > > > > if (uffd == -1) { > > > > perror("userfaultfd"); > > > > exit(EXIT_FAILURE); > > > > } > > > > > > > > if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) { > > > > perror("UFFDIO_API"); > > > > exit(EXIT_FAILURE); > > > > } > > > > > > > > uffdio_register.range.start = (unsigned long)dst; > > > > uffdio_register.range.len = REGION_SIZE; > > > > uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; > > > > > > > > if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) { > > > > perror("UFFDIO_REGISTER"); > > > > exit(EXIT_FAILURE); > > > > } > > > > > > > > if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) { > > > > perror("pthread_create madvise_thread"); > > > > exit(EXIT_FAILURE); > > > > } > > > > > > > > if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) { > > > > perror("pthread_create fault_handler_thread"); > > > > exit(EXIT_FAILURE); > > > > } > > > > > > > > for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) { > > > > char val = ((char *)dst)[i]; > > > > printf("Accessing dst at offset %zu, value: %d\n", i, val); > > > > } > > > > > > > > pthread_join(madv_thr, NULL); > > > > pthread_cancel(thr); > > > > pthread_join(thr, NULL); > > > > > > > > munmap(src, REGION_SIZE); > > > > munmap(dst, REGION_SIZE); > > > > close(uffd); > > > > goto again; > > > > return 0; > > > > } > > > > > > > > As long as you enable mTHP (which likely increases the residency > > > > time of swapcache), you can reproduce the issue within a few > > > > seconds. But I guess the same race condition also exists with > > > > small folios. > > > > > > > > Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI") > > > > Cc: Andrea Arcangeli <aarcange@redhat.com> > > > > Cc: Suren Baghdasaryan <surenb@google.com> > > > > Cc: Al Viro <viro@zeniv.linux.org.uk> > > > > Cc: Axel Rasmussen <axelrasmussen@google.com> > > > > Cc: Brian Geffon <bgeffon@google.com> > > > > Cc: Christian Brauner <brauner@kernel.org> > > > > Cc: David Hildenbrand <david@redhat.com> > > > > Cc: Hugh Dickins <hughd@google.com> > > > > Cc: Jann Horn <jannh@google.com> > > > > Cc: Kalesh Singh <kaleshsingh@google.com> > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > > > > Cc: Lokesh Gidra <lokeshgidra@google.com> > > > > Cc: Matthew Wilcox (Oracle) <willy@infradead.org> > > > > Cc: Michal Hocko <mhocko@suse.com> > > > > Cc: Mike Rapoport (IBM) <rppt@kernel.org> > > > > Cc: Nicolas Geoffray <ngeoffray@google.com> > > > > Cc: Peter Xu <peterx@redhat.com> > > > > Cc: Ryan Roberts <ryan.roberts@arm.com> > > > > Cc: Shuah Khan <shuah@kernel.org> > > > > Cc: ZhangPeng <zhangpeng362@huawei.com> > > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > > > --- > > > > mm/userfaultfd.c | 11 +++++++++++ > > > > 1 file changed, 11 insertions(+) > > > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > > index 867898c4e30b..34cf1c8c725d 100644 > > > > --- a/mm/userfaultfd.c > > > > +++ b/mm/userfaultfd.c > > > > @@ -18,6 +18,7 @@ > > > > #include <asm/tlbflush.h> > > > > #include <asm/tlb.h> > > > > #include "internal.h" > > > > +#include "swap.h" > > > > > > > > static __always_inline > > > > bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) > > > > @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm, > > > > pmd_t *dst_pmd, pmd_t dst_pmdval, > > > > spinlock_t *dst_ptl, spinlock_t *src_ptl) > > > > { > > > > + struct folio *folio; > > > > + swp_entry_t entry; > > > > + > > > > if (!pte_swp_exclusive(orig_src_pte)) > > > > return -EBUSY; > > > > > > > > > > Would be helpful to add a comment explaining that this is the case > > > when the folio is in the swap cache. > > > > > > > + entry = pte_to_swp_entry(orig_src_pte); > > > > + folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry)); > > > > + if (!IS_ERR(folio)) { > > > > + folio_put(folio); > > > > + return -EBUSY; > > > > + } > > > > + > > > > double_pt_lock(dst_ptl, src_ptl); > > > > > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, > > > > -- > > > > 2.39.3 (Apple Git-146) > > > > > > Thanks Barry
On Wed, Feb 19, 2025 at 1:26 PM Barry Song <21cnbao@gmail.com> wrote: > > On Thu, Feb 20, 2025 at 10:03 AM Lokesh Gidra <lokeshgidra@google.com> wrote: > > > > On Wed, Feb 19, 2025 at 12:38 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > > > > > userfaultfd_move() checks whether the PTE entry is present or a > > > > > swap entry. > > > > > > > > > > - If the PTE entry is present, move_present_pte() handles folio > > > > > migration by setting: > > > > > > > > > > src_folio->index = linear_page_index(dst_vma, dst_addr); > > > > > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies > > > > > the PTE to the new dst_addr. > > > > > > > > > > This approach is incorrect because even if the PTE is a swap > > > > > entry, it can still reference a folio that remains in the swap > > > > > cache. > > > > > > > > > > If do_swap_page() is triggered, it may locate the folio in the > > > > > swap cache. However, during add_rmap operations, a kernel panic > > > > > can occur due to: > > > > > page_pgoff(folio, page) != linear_page_index(vma, address) > > > > > > > > Thanks for the report and reproducer! > > > > > > > > > > > > > > $./a.out > /dev/null > > > > > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > > > > > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > > > > > [ 13.337716] memcg:ffff00000405f000 > > > > > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > > > > > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > > > > > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > > > > > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > > > > > [ 13.340190] ------------[ cut here ]------------ > > > > > [ 13.340316] kernel BUG at mm/rmap.c:1380! > > > > > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > > > > > [ 13.340969] Modules linked in: > > > > > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > > > > > [ 13.341470] Hardware name: linux,dummy-virt (DT) > > > > > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > > > > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > > > > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > > > > [ 13.342018] sp : ffff80008752bb20 > > > > > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > > > > > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > > > > > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > > > > > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > > > > > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > > > > > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > > > > > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > > > > > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > > > > > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > > > > > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > > > > > [ 13.343876] Call trace: > > > > > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > > > > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > > > > [ 13.344333] do_swap_page+0x1060/0x1400 > > > > > [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > > > > [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > > > > [ 13.344586] do_page_fault+0x20c/0x770 > > > > > [ 13.344673] do_translation_fault+0xb4/0xf0 > > > > > [ 13.344759] do_mem_abort+0x48/0xa0 > > > > > [ 13.344842] el0_da+0x58/0x130 > > > > > [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > > > > [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > > > > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > > > > > [ 13.345504] ---[ end trace 0000000000000000 ]--- > > > > > [ 13.345715] note: a.out[107] exited with irqs disabled > > > > > [ 13.345954] note: a.out[107] exited with preempt_count 2 > > > > > > > > > > Fully fixing it would be quite complex, requiring similar handling > > > > > of folios as done in move_present_pte. > > > > > > > > How complex would that be? Is it a matter of adding > > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > > > > folio->index = linear_page_index like in move_present_pte() or > > > > something more? > > > > > > My main concern is still with large folios that require a split_folio() > > > during move_pages(), as the entire folio shares the same index and > > > anon_vma. However, userfaultfd_move() moves pages individually, > > > making a split necessary. > > > > > > However, in split_huge_page_to_list_to_order(), there is a: > > > > > > if (folio_test_writeback(folio)) > > > return -EBUSY; > > > > > > This is likely true for swapcache, right? However, even for move_present_pte(), > > > it simply returns -EBUSY: > > > > > > move_pages_pte() > > > { > > > /* at this point we have src_folio locked */ > > > if (folio_test_large(src_folio)) { > > > /* split_folio() can block */ > > > pte_unmap(&orig_src_pte); > > > pte_unmap(&orig_dst_pte); > > > src_pte = dst_pte = NULL; > > > err = split_folio(src_folio); > > > if (err) > > > goto out; > > > > > > /* have to reacquire the folio after it got split */ > > > folio_unlock(src_folio); > > > folio_put(src_folio); > > > src_folio = NULL; > > > goto retry; > > > } > > > } > > > > > > Do we need a folio_wait_writeback() before calling split_folio()? > > > > > > By the way, I have also reported that userfaultfd_move() has a fundamental > > > conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common > > > kernel. In this scenario, folios in the virtual zone won’t be split in > > > split_folio(). Instead, the large folio migrates into nr_pages small folios. > > > > > > Thus, the best-case scenario would be: > > > > > > mTHP -> migrate to small folios in split_folio() -> move small folios to > > > dst_addr > > > > > > While this works, it negates the performance benefits of > > > userfaultfd_move(), as it introduces two PTE operations (migration in > > > split_folio() and move in userfaultfd_move() while retry), nr_pages memory > > > allocations, and still requires one memcpy(). This could end up > > > performing even worse than userfaultfd_copy(), I guess. > > > > > > The worst-case scenario would be failing to allocate small folios in > > > split_folio(), then userfaultfd_move() might return -ENOMEM? > > > > > > Given these issues, I strongly recommend that ART hold off on upgrading > > > to userfaultfd_move() until these problems are fully understood and > > > resolved. Otherwise, we’re in for a rough ride! > > > > At the moment, ART GC doesn't work taking mTHP into consideration. We > > don't try to be careful in userspace to be large-page aligned or > > anything. Also, the MOVE ioctl implementation works either on > > huge-pages or on normal pages. IIUC, it can't handle mTHP large pages > > as a whole. But that's true for other userfaultfd ioctls as well. If > > we were to continue using COPY, it's not that it's in any way more > > friendly to mTHP than MOVE. In fact, that's one of the reasons I'm > > considering making the ART heap NO_HUGEPAGE to avoid the need for > > folio-split entirely. > > Disabling mTHP is one way to avoid potential bugs. However, as long as > UFFDIO_MOVE is available, we can’t prevent others, aside from ART GC, > from using it, right? So, we still need to address these issues with mTHP. > > If a trend-following Android app discovers the UFFDIO_MOVE API, it might > use it, and it may not necessarily know to disable hugepages. Doesn’t that > pose a risk? > I absolutely agree that these issues need to be addressed. Particularly the correctness bugs must be resolved at the earliest possible. I was just trying to answer your question as to why we want to use it, now that it is available, instead of continuing with COPY ioctl. As and when MOVE ioctl will start handling mTHP efficiently, I will make the required changes in the userspace to leverage mTHP benefits. > > > > Furthermore, there are few cases in which COPY ioctl's overhead just > > doesn't make sense for ART GC. So starting to use MOVE ioctl is the > > right thing to do. > > > > What we need eventually to gain mTHP benefits is both MOVE ioctl to > > support large-page migration as well as GC code in userspace to work > > with mTHP in mind. > > > > > > > > > > > > For now, a quick solution > > > > > is to return -EBUSY. > > > > > I'd like to see others' opinions on whether a full fix is worth > > > > > pursuing. > > > > > > > > > > For anyone interested in reproducing it, the a.out test program is > > > > > as below, > > > > > > > > > > #define _GNU_SOURCE > > > > > #include <stdio.h> > > > > > #include <stdlib.h> > > > > > #include <string.h> > > > > > #include <sys/mman.h> > > > > > #include <sys/ioctl.h> > > > > > #include <sys/syscall.h> > > > > > #include <linux/userfaultfd.h> > > > > > #include <fcntl.h> > > > > > #include <pthread.h> > > > > > #include <unistd.h> > > > > > #include <poll.h> > > > > > #include <errno.h> > > > > > > > > > > #define PAGE_SIZE 4096 > > > > > #define REGION_SIZE (512 * 1024) > > > > > > > > > > #ifndef UFFDIO_MOVE > > > > > struct uffdio_move { > > > > > __u64 dst; > > > > > __u64 src; > > > > > __u64 len; > > > > > #define UFFDIO_MOVE_MODE_DONTWAKE ((__u64)1<<0) > > > > > #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1) > > > > > __u64 mode; > > > > > __s64 move; > > > > > }; > > > > > #define _UFFDIO_MOVE (0x05) > > > > > #define UFFDIO_MOVE _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move) > > > > > #endif > > > > > > > > > > void *src, *dst; > > > > > int uffd; > > > > > > > > > > void *madvise_thread(void *arg) { > > > > > if (madvise(src, REGION_SIZE, MADV_PAGEOUT) == -1) { > > > > > perror("madvise MADV_PAGEOUT"); > > > > > } > > > > > return NULL; > > > > > } > > > > > > > > > > void *fault_handler_thread(void *arg) { > > > > > struct uffd_msg msg; > > > > > struct uffdio_move move; > > > > > struct pollfd pollfd = { .fd = uffd, .events = POLLIN }; > > > > > > > > > > pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL); > > > > > pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL); > > > > > > > > > > while (1) { > > > > > if (poll(&pollfd, 1, -1) == -1) { > > > > > perror("poll"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > > > > > > if (read(uffd, &msg, sizeof(msg)) <= 0) { > > > > > perror("read"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > > > > > > if (msg.event != UFFD_EVENT_PAGEFAULT) { > > > > > fprintf(stderr, "Unexpected event\n"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > > > > > > move.src = (unsigned long)src + (msg.arg.pagefault.address - (unsigned long)dst); > > > > > move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1); > > > > > move.len = PAGE_SIZE; > > > > > move.mode = 0; > > > > > > > > > > if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) { > > > > > perror("UFFDIO_MOVE"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > } > > > > > return NULL; > > > > > } > > > > > > > > > > int main() { > > > > > again: > > > > > pthread_t thr, madv_thr; > > > > > struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 }; > > > > > struct uffdio_register uffdio_register; > > > > > > > > > > src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > > > > if (src == MAP_FAILED) { > > > > > perror("mmap src"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > memset(src, 1, REGION_SIZE); > > > > > > > > > > dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > > > > if (dst == MAP_FAILED) { > > > > > perror("mmap dst"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > > > > > > uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK); > > > > > if (uffd == -1) { > > > > > perror("userfaultfd"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > > > > > > if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) { > > > > > perror("UFFDIO_API"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > > > > > > uffdio_register.range.start = (unsigned long)dst; > > > > > uffdio_register.range.len = REGION_SIZE; > > > > > uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; > > > > > > > > > > if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) { > > > > > perror("UFFDIO_REGISTER"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > > > > > > if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) { > > > > > perror("pthread_create madvise_thread"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > > > > > > if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) { > > > > > perror("pthread_create fault_handler_thread"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > > > > > > for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) { > > > > > char val = ((char *)dst)[i]; > > > > > printf("Accessing dst at offset %zu, value: %d\n", i, val); > > > > > } > > > > > > > > > > pthread_join(madv_thr, NULL); > > > > > pthread_cancel(thr); > > > > > pthread_join(thr, NULL); > > > > > > > > > > munmap(src, REGION_SIZE); > > > > > munmap(dst, REGION_SIZE); > > > > > close(uffd); > > > > > goto again; > > > > > return 0; > > > > > } > > > > > > > > > > As long as you enable mTHP (which likely increases the residency > > > > > time of swapcache), you can reproduce the issue within a few > > > > > seconds. But I guess the same race condition also exists with > > > > > small folios. > > > > > > > > > > Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI") > > > > > Cc: Andrea Arcangeli <aarcange@redhat.com> > > > > > Cc: Suren Baghdasaryan <surenb@google.com> > > > > > Cc: Al Viro <viro@zeniv.linux.org.uk> > > > > > Cc: Axel Rasmussen <axelrasmussen@google.com> > > > > > Cc: Brian Geffon <bgeffon@google.com> > > > > > Cc: Christian Brauner <brauner@kernel.org> > > > > > Cc: David Hildenbrand <david@redhat.com> > > > > > Cc: Hugh Dickins <hughd@google.com> > > > > > Cc: Jann Horn <jannh@google.com> > > > > > Cc: Kalesh Singh <kaleshsingh@google.com> > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > > > > > Cc: Lokesh Gidra <lokeshgidra@google.com> > > > > > Cc: Matthew Wilcox (Oracle) <willy@infradead.org> > > > > > Cc: Michal Hocko <mhocko@suse.com> > > > > > Cc: Mike Rapoport (IBM) <rppt@kernel.org> > > > > > Cc: Nicolas Geoffray <ngeoffray@google.com> > > > > > Cc: Peter Xu <peterx@redhat.com> > > > > > Cc: Ryan Roberts <ryan.roberts@arm.com> > > > > > Cc: Shuah Khan <shuah@kernel.org> > > > > > Cc: ZhangPeng <zhangpeng362@huawei.com> > > > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > > > > --- > > > > > mm/userfaultfd.c | 11 +++++++++++ > > > > > 1 file changed, 11 insertions(+) > > > > > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > > > index 867898c4e30b..34cf1c8c725d 100644 > > > > > --- a/mm/userfaultfd.c > > > > > +++ b/mm/userfaultfd.c > > > > > @@ -18,6 +18,7 @@ > > > > > #include <asm/tlbflush.h> > > > > > #include <asm/tlb.h> > > > > > #include "internal.h" > > > > > +#include "swap.h" > > > > > > > > > > static __always_inline > > > > > bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) > > > > > @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm, > > > > > pmd_t *dst_pmd, pmd_t dst_pmdval, > > > > > spinlock_t *dst_ptl, spinlock_t *src_ptl) > > > > > { > > > > > + struct folio *folio; > > > > > + swp_entry_t entry; > > > > > + > > > > > if (!pte_swp_exclusive(orig_src_pte)) > > > > > return -EBUSY; > > > > > > > > > > > > > Would be helpful to add a comment explaining that this is the case > > > > when the folio is in the swap cache. > > > > > > > > > + entry = pte_to_swp_entry(orig_src_pte); > > > > > + folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry)); > > > > > + if (!IS_ERR(folio)) { > > > > > + folio_put(folio); > > > > > + return -EBUSY; > > > > > + } > > > > > + > > > > > double_pt_lock(dst_ptl, src_ptl); > > > > > > > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, > > > > > -- > > > > > 2.39.3 (Apple Git-146) > > > > > > > > > > Thanks > Barry
On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote: > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > userfaultfd_move() checks whether the PTE entry is present or a > > > swap entry. > > > > > > - If the PTE entry is present, move_present_pte() handles folio > > > migration by setting: > > > > > > src_folio->index = linear_page_index(dst_vma, dst_addr); > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies > > > the PTE to the new dst_addr. > > > > > > This approach is incorrect because even if the PTE is a swap > > > entry, it can still reference a folio that remains in the swap > > > cache. > > > > > > If do_swap_page() is triggered, it may locate the folio in the > > > swap cache. However, during add_rmap operations, a kernel panic > > > can occur due to: > > > page_pgoff(folio, page) != linear_page_index(vma, address) > > > > Thanks for the report and reproducer! > > > > > > > > $./a.out > /dev/null > > > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > > > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > > > [ 13.337716] memcg:ffff00000405f000 > > > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > > > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > > > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > > > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > > > [ 13.340190] ------------[ cut here ]------------ > > > [ 13.340316] kernel BUG at mm/rmap.c:1380! > > > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > > > [ 13.340969] Modules linked in: > > > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > > > [ 13.341470] Hardware name: linux,dummy-virt (DT) > > > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > > [ 13.342018] sp : ffff80008752bb20 > > > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > > > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > > > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > > > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > > > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > > > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > > > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > > > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > > > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > > > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > > > [ 13.343876] Call trace: > > > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > > [ 13.344333] do_swap_page+0x1060/0x1400 > > > [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > > [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > > [ 13.344586] do_page_fault+0x20c/0x770 > > > [ 13.344673] do_translation_fault+0xb4/0xf0 > > > [ 13.344759] do_mem_abort+0x48/0xa0 > > > [ 13.344842] el0_da+0x58/0x130 > > > [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > > [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > > > [ 13.345504] ---[ end trace 0000000000000000 ]--- > > > [ 13.345715] note: a.out[107] exited with irqs disabled > > > [ 13.345954] note: a.out[107] exited with preempt_count 2 > > > > > > Fully fixing it would be quite complex, requiring similar handling > > > of folios as done in move_present_pte. > > > > How complex would that be? Is it a matter of adding > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > > folio->index = linear_page_index like in move_present_pte() or > > something more? > > My main concern is still with large folios that require a split_folio() > during move_pages(), as the entire folio shares the same index and > anon_vma. However, userfaultfd_move() moves pages individually, > making a split necessary. > > However, in split_huge_page_to_list_to_order(), there is a: > > if (folio_test_writeback(folio)) > return -EBUSY; > > This is likely true for swapcache, right? However, even for move_present_pte(), > it simply returns -EBUSY: > > move_pages_pte() > { > /* at this point we have src_folio locked */ > if (folio_test_large(src_folio)) { > /* split_folio() can block */ > pte_unmap(&orig_src_pte); > pte_unmap(&orig_dst_pte); > src_pte = dst_pte = NULL; > err = split_folio(src_folio); > if (err) > goto out; > > /* have to reacquire the folio after it got split */ > folio_unlock(src_folio); > folio_put(src_folio); > src_folio = NULL; > goto retry; > } > } > > Do we need a folio_wait_writeback() before calling split_folio()? Maybe no need in the first version to fix the immediate bug? It's also not always the case to hit writeback here. IIUC, writeback only happens for a short window when the folio was just added into swapcache. MOVE can happen much later after that anytime before a swapin. My understanding is that's also what Matthew wanted to point out. It may be better justified of that in a separate change with some performance measurements. Thanks,
On Thu, Feb 20, 2025 at 12:25:19AM +1300, Barry Song wrote: > @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm, > pmd_t *dst_pmd, pmd_t dst_pmdval, > spinlock_t *dst_ptl, spinlock_t *src_ptl) > { > + struct folio *folio; > + swp_entry_t entry; > + > if (!pte_swp_exclusive(orig_src_pte)) > return -EBUSY; > > + entry = pte_to_swp_entry(orig_src_pte); > + folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry)); [Besides what's being discussed elsewhere..] swap_cache_get_folio() says: * Caller must lock the swap device or hold a reference to keep it valid. Do we need get_swap_device() too here to avoid swapoff race? > + if (!IS_ERR(folio)) { > + folio_put(folio); > + return -EBUSY; > + } > + > double_pt_lock(dst_ptl, src_ptl); > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, > -- > 2.39.3 (Apple Git-146) >
On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote: > > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote: > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > > > userfaultfd_move() checks whether the PTE entry is present or a > > > > swap entry. > > > > > > > > - If the PTE entry is present, move_present_pte() handles folio > > > > migration by setting: > > > > > > > > src_folio->index = linear_page_index(dst_vma, dst_addr); > > > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies > > > > the PTE to the new dst_addr. > > > > > > > > This approach is incorrect because even if the PTE is a swap > > > > entry, it can still reference a folio that remains in the swap > > > > cache. > > > > > > > > If do_swap_page() is triggered, it may locate the folio in the > > > > swap cache. However, during add_rmap operations, a kernel panic > > > > can occur due to: > > > > page_pgoff(folio, page) != linear_page_index(vma, address) > > > > > > Thanks for the report and reproducer! > > > > > > > > > > > $./a.out > /dev/null > > > > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > > > > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > > > > [ 13.337716] memcg:ffff00000405f000 > > > > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > > > > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > > > > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > > > > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > > > > [ 13.340190] ------------[ cut here ]------------ > > > > [ 13.340316] kernel BUG at mm/rmap.c:1380! > > > > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > > > > [ 13.340969] Modules linked in: > > > > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > > > > [ 13.341470] Hardware name: linux,dummy-virt (DT) > > > > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > > > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > > > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > > > [ 13.342018] sp : ffff80008752bb20 > > > > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > > > > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > > > > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > > > > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > > > > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > > > > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > > > > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > > > > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > > > > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > > > > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > > > > [ 13.343876] Call trace: > > > > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > > > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > > > [ 13.344333] do_swap_page+0x1060/0x1400 > > > > [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > > > [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > > > [ 13.344586] do_page_fault+0x20c/0x770 > > > > [ 13.344673] do_translation_fault+0xb4/0xf0 > > > > [ 13.344759] do_mem_abort+0x48/0xa0 > > > > [ 13.344842] el0_da+0x58/0x130 > > > > [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > > > [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > > > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > > > > [ 13.345504] ---[ end trace 0000000000000000 ]--- > > > > [ 13.345715] note: a.out[107] exited with irqs disabled > > > > [ 13.345954] note: a.out[107] exited with preempt_count 2 > > > > > > > > Fully fixing it would be quite complex, requiring similar handling > > > > of folios as done in move_present_pte. > > > > > > How complex would that be? Is it a matter of adding > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > > > folio->index = linear_page_index like in move_present_pte() or > > > something more? > > > > My main concern is still with large folios that require a split_folio() > > during move_pages(), as the entire folio shares the same index and > > anon_vma. However, userfaultfd_move() moves pages individually, > > making a split necessary. > > > > However, in split_huge_page_to_list_to_order(), there is a: > > > > if (folio_test_writeback(folio)) > > return -EBUSY; > > > > This is likely true for swapcache, right? However, even for move_present_pte(), > > it simply returns -EBUSY: > > > > move_pages_pte() > > { > > /* at this point we have src_folio locked */ > > if (folio_test_large(src_folio)) { > > /* split_folio() can block */ > > pte_unmap(&orig_src_pte); > > pte_unmap(&orig_dst_pte); > > src_pte = dst_pte = NULL; > > err = split_folio(src_folio); > > if (err) > > goto out; > > > > /* have to reacquire the folio after it got split */ > > folio_unlock(src_folio); > > folio_put(src_folio); > > src_folio = NULL; > > goto retry; > > } > > } > > > > Do we need a folio_wait_writeback() before calling split_folio()? > > Maybe no need in the first version to fix the immediate bug? > > It's also not always the case to hit writeback here. IIUC, writeback only > happens for a short window when the folio was just added into swapcache. > MOVE can happen much later after that anytime before a swapin. My > understanding is that's also what Matthew wanted to point out. It may be > better justified of that in a separate change with some performance > measurements. The bug we’re discussing occurs precisely within the short window you mentioned. 1. add_to_swap: The folio is added to swapcache. 2. try_to_unmap: PTEs are converted to swap entries. 3. pageout 4. Swapcache is cleared. The issue happens between steps 2 and 4, where the PTE is not present, but the folio is still in swapcache - the current code does move_swap_pte() but does not fixup folio->index within swapcache. My point is that if we want a proper fix for mTHP, we'd better handle writeback. Otherwise, this isn’t much different from directly returning -EBUSY as proposed in this RFC. For small folios, there’s no split_folio issue, making it relatively simpler. Lokesh mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely the first priority. > > Thanks, > > -- > Peter Xu Thanks Barry
On Wed, Feb 19, 2025 at 3:04 PM Barry Song <21cnbao@gmail.com> wrote: > > On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote: > > > > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote: > > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > > > > > userfaultfd_move() checks whether the PTE entry is present or a > > > > > swap entry. > > > > > > > > > > - If the PTE entry is present, move_present_pte() handles folio > > > > > migration by setting: > > > > > > > > > > src_folio->index = linear_page_index(dst_vma, dst_addr); > > > > > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies > > > > > the PTE to the new dst_addr. > > > > > > > > > > This approach is incorrect because even if the PTE is a swap > > > > > entry, it can still reference a folio that remains in the swap > > > > > cache. > > > > > > > > > > If do_swap_page() is triggered, it may locate the folio in the > > > > > swap cache. However, during add_rmap operations, a kernel panic > > > > > can occur due to: > > > > > page_pgoff(folio, page) != linear_page_index(vma, address) > > > > > > > > Thanks for the report and reproducer! > > > > > > > > > > > > > > $./a.out > /dev/null > > > > > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > > > > > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > > > > > [ 13.337716] memcg:ffff00000405f000 > > > > > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > > > > > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > > > > > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > > > > > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > > > > > [ 13.340190] ------------[ cut here ]------------ > > > > > [ 13.340316] kernel BUG at mm/rmap.c:1380! > > > > > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > > > > > [ 13.340969] Modules linked in: > > > > > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > > > > > [ 13.341470] Hardware name: linux,dummy-virt (DT) > > > > > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > > > > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > > > > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > > > > [ 13.342018] sp : ffff80008752bb20 > > > > > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > > > > > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > > > > > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > > > > > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > > > > > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > > > > > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > > > > > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > > > > > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > > > > > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > > > > > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > > > > > [ 13.343876] Call trace: > > > > > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > > > > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > > > > [ 13.344333] do_swap_page+0x1060/0x1400 > > > > > [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > > > > [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > > > > [ 13.344586] do_page_fault+0x20c/0x770 > > > > > [ 13.344673] do_translation_fault+0xb4/0xf0 > > > > > [ 13.344759] do_mem_abort+0x48/0xa0 > > > > > [ 13.344842] el0_da+0x58/0x130 > > > > > [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > > > > [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > > > > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > > > > > [ 13.345504] ---[ end trace 0000000000000000 ]--- > > > > > [ 13.345715] note: a.out[107] exited with irqs disabled > > > > > [ 13.345954] note: a.out[107] exited with preempt_count 2 > > > > > > > > > > Fully fixing it would be quite complex, requiring similar handling > > > > > of folios as done in move_present_pte. > > > > > > > > How complex would that be? Is it a matter of adding > > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > > > > folio->index = linear_page_index like in move_present_pte() or > > > > something more? > > > > > > My main concern is still with large folios that require a split_folio() > > > during move_pages(), as the entire folio shares the same index and > > > anon_vma. However, userfaultfd_move() moves pages individually, > > > making a split necessary. > > > > > > However, in split_huge_page_to_list_to_order(), there is a: > > > > > > if (folio_test_writeback(folio)) > > > return -EBUSY; > > > > > > This is likely true for swapcache, right? However, even for move_present_pte(), > > > it simply returns -EBUSY: > > > > > > move_pages_pte() > > > { > > > /* at this point we have src_folio locked */ > > > if (folio_test_large(src_folio)) { > > > /* split_folio() can block */ > > > pte_unmap(&orig_src_pte); > > > pte_unmap(&orig_dst_pte); > > > src_pte = dst_pte = NULL; > > > err = split_folio(src_folio); > > > if (err) > > > goto out; > > > > > > /* have to reacquire the folio after it got split */ > > > folio_unlock(src_folio); > > > folio_put(src_folio); > > > src_folio = NULL; > > > goto retry; > > > } > > > } > > > > > > Do we need a folio_wait_writeback() before calling split_folio()? > > > > Maybe no need in the first version to fix the immediate bug? > > > > It's also not always the case to hit writeback here. IIUC, writeback only > > happens for a short window when the folio was just added into swapcache. > > MOVE can happen much later after that anytime before a swapin. My > > understanding is that's also what Matthew wanted to point out. It may be > > better justified of that in a separate change with some performance > > measurements. > > The bug we’re discussing occurs precisely within the short window you > mentioned. > > 1. add_to_swap: The folio is added to swapcache. > 2. try_to_unmap: PTEs are converted to swap entries. > 3. pageout > 4. Swapcache is cleared. > > The issue happens between steps 2 and 4, where the PTE is not present, but > the folio is still in swapcache - the current code does move_swap_pte() but does > not fixup folio->index within swapcache. > > My point is that if we want a proper fix for mTHP, we'd better handle writeback. > Otherwise, this isn’t much different from directly returning -EBUSY as proposed > in this RFC. > > For small folios, there’s no split_folio issue, making it relatively > simpler. Lokesh > mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely > the first priority. Fixing for the non-mTHP case first sounds good to me. For a large folio in swap cache maybe you can return EBUSY for now? But for that, I believe the cleanest and simplest would be to restructure move_pages_pte() such that folio is retrieved from vm_normal_foliio() if src_pte is present, otherwise, from filemap_get_folio() and then incorporate the check you have in this FRC in there. This way the entire locking dance logic in there can be reused for swap-cache case as well. > > > > > Thanks, > > > > -- > > Peter Xu > > Thanks > Barry
On Thu, Feb 20, 2025 at 12:19 PM Lokesh Gidra <lokeshgidra@google.com> wrote: > > On Wed, Feb 19, 2025 at 3:04 PM Barry Song <21cnbao@gmail.com> wrote: > > > > On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote: > > > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > > > > > > > userfaultfd_move() checks whether the PTE entry is present or a > > > > > > swap entry. > > > > > > > > > > > > - If the PTE entry is present, move_present_pte() handles folio > > > > > > migration by setting: > > > > > > > > > > > > src_folio->index = linear_page_index(dst_vma, dst_addr); > > > > > > > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies > > > > > > the PTE to the new dst_addr. > > > > > > > > > > > > This approach is incorrect because even if the PTE is a swap > > > > > > entry, it can still reference a folio that remains in the swap > > > > > > cache. > > > > > > > > > > > > If do_swap_page() is triggered, it may locate the folio in the > > > > > > swap cache. However, during add_rmap operations, a kernel panic > > > > > > can occur due to: > > > > > > page_pgoff(folio, page) != linear_page_index(vma, address) > > > > > > > > > > Thanks for the report and reproducer! > > > > > > > > > > > > > > > > > $./a.out > /dev/null > > > > > > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > > > > > > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > > > > > > [ 13.337716] memcg:ffff00000405f000 > > > > > > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > > > > > > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > > > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > > > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > > > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > > > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > > > > > > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > > > > > > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > > > > > > [ 13.340190] ------------[ cut here ]------------ > > > > > > [ 13.340316] kernel BUG at mm/rmap.c:1380! > > > > > > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > > > > > > [ 13.340969] Modules linked in: > > > > > > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > > > > > > [ 13.341470] Hardware name: linux,dummy-virt (DT) > > > > > > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > > > > > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > > > > > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > > > > > [ 13.342018] sp : ffff80008752bb20 > > > > > > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > > > > > > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > > > > > > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > > > > > > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > > > > > > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > > > > > > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > > > > > > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > > > > > > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > > > > > > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > > > > > > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > > > > > > [ 13.343876] Call trace: > > > > > > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > > > > > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > > > > > [ 13.344333] do_swap_page+0x1060/0x1400 > > > > > > [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > > > > > [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > > > > > [ 13.344586] do_page_fault+0x20c/0x770 > > > > > > [ 13.344673] do_translation_fault+0xb4/0xf0 > > > > > > [ 13.344759] do_mem_abort+0x48/0xa0 > > > > > > [ 13.344842] el0_da+0x58/0x130 > > > > > > [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > > > > > [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > > > > > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > > > > > > [ 13.345504] ---[ end trace 0000000000000000 ]--- > > > > > > [ 13.345715] note: a.out[107] exited with irqs disabled > > > > > > [ 13.345954] note: a.out[107] exited with preempt_count 2 > > > > > > > > > > > > Fully fixing it would be quite complex, requiring similar handling > > > > > > of folios as done in move_present_pte. > > > > > > > > > > How complex would that be? Is it a matter of adding > > > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > > > > > folio->index = linear_page_index like in move_present_pte() or > > > > > something more? > > > > > > > > My main concern is still with large folios that require a split_folio() > > > > during move_pages(), as the entire folio shares the same index and > > > > anon_vma. However, userfaultfd_move() moves pages individually, > > > > making a split necessary. > > > > > > > > However, in split_huge_page_to_list_to_order(), there is a: > > > > > > > > if (folio_test_writeback(folio)) > > > > return -EBUSY; > > > > > > > > This is likely true for swapcache, right? However, even for move_present_pte(), > > > > it simply returns -EBUSY: > > > > > > > > move_pages_pte() > > > > { > > > > /* at this point we have src_folio locked */ > > > > if (folio_test_large(src_folio)) { > > > > /* split_folio() can block */ > > > > pte_unmap(&orig_src_pte); > > > > pte_unmap(&orig_dst_pte); > > > > src_pte = dst_pte = NULL; > > > > err = split_folio(src_folio); > > > > if (err) > > > > goto out; > > > > > > > > /* have to reacquire the folio after it got split */ > > > > folio_unlock(src_folio); > > > > folio_put(src_folio); > > > > src_folio = NULL; > > > > goto retry; > > > > } > > > > } > > > > > > > > Do we need a folio_wait_writeback() before calling split_folio()? > > > > > > Maybe no need in the first version to fix the immediate bug? > > > > > > It's also not always the case to hit writeback here. IIUC, writeback only > > > happens for a short window when the folio was just added into swapcache. > > > MOVE can happen much later after that anytime before a swapin. My > > > understanding is that's also what Matthew wanted to point out. It may be > > > better justified of that in a separate change with some performance > > > measurements. > > > > The bug we’re discussing occurs precisely within the short window you > > mentioned. > > > > 1. add_to_swap: The folio is added to swapcache. > > 2. try_to_unmap: PTEs are converted to swap entries. > > 3. pageout > > 4. Swapcache is cleared. > > > > The issue happens between steps 2 and 4, where the PTE is not present, but > > the folio is still in swapcache - the current code does move_swap_pte() but does > > not fixup folio->index within swapcache. > > > > My point is that if we want a proper fix for mTHP, we'd better handle writeback. > > Otherwise, this isn’t much different from directly returning -EBUSY as proposed > > in this RFC. > > > > For small folios, there’s no split_folio issue, making it relatively > > simpler. Lokesh > > mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely > > the first priority. > > Fixing for the non-mTHP case first sounds good to me. For a large > folio in swap cache maybe you can return EBUSY for now? > > But for that, I believe the cleanest and simplest would be to > restructure move_pages_pte() such that folio is retrieved from > vm_normal_foliio() if src_pte is present, otherwise, from > filemap_get_folio() and then incorporate the check you have in this > FRC in there. This way the entire locking dance logic in there can be > reused for swap-cache case as well. Yep, let me give it a try in v2. > > > > > > > > Thanks, > > > > > > -- > > > Peter Xu > > > > Thanks > > Barry
On Thu, Feb 20, 2025 at 11:31 AM Peter Xu <peterx@redhat.com> wrote: > > On Thu, Feb 20, 2025 at 12:25:19AM +1300, Barry Song wrote: > > @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm, > > pmd_t *dst_pmd, pmd_t dst_pmdval, > > spinlock_t *dst_ptl, spinlock_t *src_ptl) > > { > > + struct folio *folio; > > + swp_entry_t entry; > > + > > if (!pte_swp_exclusive(orig_src_pte)) > > return -EBUSY; > > > > + entry = pte_to_swp_entry(orig_src_pte); > > + folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry)); > > [Besides what's being discussed elsewhere..] > > swap_cache_get_folio() says: > > * Caller must lock the swap device or hold a reference to keep it valid. > > Do we need get_swap_device() too here to avoid swapoff race? > Yep, thanks! Let me fix it in v2. > > + if (!IS_ERR(folio)) { > > + folio_put(folio); > > + return -EBUSY; > > + } > > + > > double_pt_lock(dst_ptl, src_ptl); > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, > > -- > > 2.39.3 (Apple Git-146) > > > > -- > Peter Xu > Thanks Barry
On 19.02.25 19:58, Suren Baghdasaryan wrote: > On Wed, Feb 19, 2025 at 10:30 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 19.02.25 19:26, Suren Baghdasaryan wrote: >>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: >>>> >>>> From: Barry Song <v-songbaohua@oppo.com> >>>> >>>> userfaultfd_move() checks whether the PTE entry is present or a >>>> swap entry. >>>> >>>> - If the PTE entry is present, move_present_pte() handles folio >>>> migration by setting: >>>> >>>> src_folio->index = linear_page_index(dst_vma, dst_addr); >>>> >>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies >>>> the PTE to the new dst_addr. >>>> >>>> This approach is incorrect because even if the PTE is a swap >>>> entry, it can still reference a folio that remains in the swap >>>> cache. >>>> >>>> If do_swap_page() is triggered, it may locate the folio in the >>>> swap cache. However, during add_rmap operations, a kernel panic >>>> can occur due to: >>>> page_pgoff(folio, page) != linear_page_index(vma, address) >>> >>> Thanks for the report and reproducer! >>> >>>> >>>> $./a.out > /dev/null >>>> [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c >>>> [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 >>>> [ 13.337716] memcg:ffff00000405f000 >>>> [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) >>>> [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 >>>> [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 >>>> [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 >>>> [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 >>>> [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 >>>> [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 >>>> [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) >>>> [ 13.340190] ------------[ cut here ]------------ >>>> [ 13.340316] kernel BUG at mm/rmap.c:1380! >>>> [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP >>>> [ 13.340969] Modules linked in: >>>> [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 >>>> [ 13.341470] Hardware name: linux,dummy-virt (DT) >>>> [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>>> [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 >>>> [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 >>>> [ 13.342018] sp : ffff80008752bb20 >>>> [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 >>>> [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 >>>> [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 >>>> [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff >>>> [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f >>>> [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 >>>> [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 >>>> [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 >>>> [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 >>>> [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f >>>> [ 13.343876] Call trace: >>>> [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) >>>> [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 >>>> [ 13.344333] do_swap_page+0x1060/0x1400 >>>> [ 13.344417] __handle_mm_fault+0x61c/0xbc8 >>>> [ 13.344504] handle_mm_fault+0xd8/0x2e8 >>>> [ 13.344586] do_page_fault+0x20c/0x770 >>>> [ 13.344673] do_translation_fault+0xb4/0xf0 >>>> [ 13.344759] do_mem_abort+0x48/0xa0 >>>> [ 13.344842] el0_da+0x58/0x130 >>>> [ 13.344914] el0t_64_sync_handler+0xc4/0x138 >>>> [ 13.345002] el0t_64_sync+0x1ac/0x1b0 >>>> [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) >>>> [ 13.345504] ---[ end trace 0000000000000000 ]--- >>>> [ 13.345715] note: a.out[107] exited with irqs disabled >>>> [ 13.345954] note: a.out[107] exited with preempt_count 2 >>>> >>>> Fully fixing it would be quite complex, requiring similar handling >>>> of folios as done in move_present_pte. >>> >>> How complex would that be? Is it a matter of adding >>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and >>> folio->index = linear_page_index like in move_present_pte() or >>> something more? >> >> If the entry is pte_swp_exclusive(), and the folio is order-0, it cannot >> be pinned and we may be able to move it I think. >> >> So all that's required is to check pte_swp_exclusive() and the folio size. >> >> ... in theory :) Not sure about the swap details. > > Looking some more into it, I think we would have to perform all the > folio and anon_vma locking and pinning that we do for present pages in > move_pages_pte(). If that's correct then maybe treating swapcache > pages like a present page inside move_pages_pte() would be simpler? I'd be more in favor of not doing that. Maybe there are parts we can move out into helper functions instead, so we can reuse them?
On 19.02.25 21:37, Barry Song wrote: > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: >> >> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: >>> >>> From: Barry Song <v-songbaohua@oppo.com> >>> >>> userfaultfd_move() checks whether the PTE entry is present or a >>> swap entry. >>> >>> - If the PTE entry is present, move_present_pte() handles folio >>> migration by setting: >>> >>> src_folio->index = linear_page_index(dst_vma, dst_addr); >>> >>> - If the PTE entry is a swap entry, move_swap_pte() simply copies >>> the PTE to the new dst_addr. >>> >>> This approach is incorrect because even if the PTE is a swap >>> entry, it can still reference a folio that remains in the swap >>> cache. >>> >>> If do_swap_page() is triggered, it may locate the folio in the >>> swap cache. However, during add_rmap operations, a kernel panic >>> can occur due to: >>> page_pgoff(folio, page) != linear_page_index(vma, address) >> >> Thanks for the report and reproducer! >> >>> >>> $./a.out > /dev/null >>> [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c >>> [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 >>> [ 13.337716] memcg:ffff00000405f000 >>> [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) >>> [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 >>> [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 >>> [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 >>> [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 >>> [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 >>> [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 >>> [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) >>> [ 13.340190] ------------[ cut here ]------------ >>> [ 13.340316] kernel BUG at mm/rmap.c:1380! >>> [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP >>> [ 13.340969] Modules linked in: >>> [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 >>> [ 13.341470] Hardware name: linux,dummy-virt (DT) >>> [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>> [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 >>> [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 >>> [ 13.342018] sp : ffff80008752bb20 >>> [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 >>> [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 >>> [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 >>> [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff >>> [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f >>> [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 >>> [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 >>> [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 >>> [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 >>> [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f >>> [ 13.343876] Call trace: >>> [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) >>> [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 >>> [ 13.344333] do_swap_page+0x1060/0x1400 >>> [ 13.344417] __handle_mm_fault+0x61c/0xbc8 >>> [ 13.344504] handle_mm_fault+0xd8/0x2e8 >>> [ 13.344586] do_page_fault+0x20c/0x770 >>> [ 13.344673] do_translation_fault+0xb4/0xf0 >>> [ 13.344759] do_mem_abort+0x48/0xa0 >>> [ 13.344842] el0_da+0x58/0x130 >>> [ 13.344914] el0t_64_sync_handler+0xc4/0x138 >>> [ 13.345002] el0t_64_sync+0x1ac/0x1b0 >>> [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) >>> [ 13.345504] ---[ end trace 0000000000000000 ]--- >>> [ 13.345715] note: a.out[107] exited with irqs disabled >>> [ 13.345954] note: a.out[107] exited with preempt_count 2 >>> >>> Fully fixing it would be quite complex, requiring similar handling >>> of folios as done in move_present_pte. >> >> How complex would that be? Is it a matter of adding >> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and >> folio->index = linear_page_index like in move_present_pte() or >> something more? > > My main concern is still with large folios that require a split_folio() > during move_pages(), as the entire folio shares the same index and > anon_vma. However, userfaultfd_move() moves pages individually, > making a split necessary. > > However, in split_huge_page_to_list_to_order(), there is a: > > if (folio_test_writeback(folio)) > return -EBUSY; > > This is likely true for swapcache, right? However, even for move_present_pte(), > it simply returns -EBUSY: > > move_pages_pte() > { > /* at this point we have src_folio locked */ > if (folio_test_large(src_folio)) { > /* split_folio() can block */ > pte_unmap(&orig_src_pte); > pte_unmap(&orig_dst_pte); > src_pte = dst_pte = NULL; > err = split_folio(src_folio); > if (err) > goto out; > > /* have to reacquire the folio after it got split */ > folio_unlock(src_folio); > folio_put(src_folio); > src_folio = NULL; > goto retry; > } > } > > Do we need a folio_wait_writeback() before calling split_folio()? > > By the way, I have also reported that userfaultfd_move() has a fundamental > conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common > kernel. In this scenario, folios in the virtual zone won’t be split in > split_folio(). Instead, the large folio migrates into nr_pages small folios. > > Thus, the best-case scenario would be: > > mTHP -> migrate to small folios in split_folio() -> move small folios to > dst_addr > > While this works, it negates the performance benefits of > userfaultfd_move(), as it introduces two PTE operations (migration in > split_folio() and move in userfaultfd_move() while retry), nr_pages memory > allocations, and still requires one memcpy(). This could end up > performing even worse than userfaultfd_copy(), I guess. > > The worst-case scenario would be failing to allocate small folios in > split_folio(), then userfaultfd_move() might return -ENOMEM? Although that's an Android problem and not an upstream problem, I'll note that there are other reasons why the split / move might fail, and user space either must retry or fallback to a COPY. Regarding mTHP, we could move the whole folio if the user space-provided range allows for batching over multiple PTEs (nr_ptes), they are in a single VMA, and folio_mapcount() == nr_ptes. There are corner cases to handle, such as moving mTHPs such that they suddenly cross two page tables I assume, that are harder to handle when not moving individual PTEs where that cannot happen.
On Thu, Feb 20, 2025 at 9:40 PM David Hildenbrand <david@redhat.com> wrote: > > On 19.02.25 19:58, Suren Baghdasaryan wrote: > > On Wed, Feb 19, 2025 at 10:30 AM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 19.02.25 19:26, Suren Baghdasaryan wrote: > >>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > >>>> > >>>> From: Barry Song <v-songbaohua@oppo.com> > >>>> > >>>> userfaultfd_move() checks whether the PTE entry is present or a > >>>> swap entry. > >>>> > >>>> - If the PTE entry is present, move_present_pte() handles folio > >>>> migration by setting: > >>>> > >>>> src_folio->index = linear_page_index(dst_vma, dst_addr); > >>>> > >>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies > >>>> the PTE to the new dst_addr. > >>>> > >>>> This approach is incorrect because even if the PTE is a swap > >>>> entry, it can still reference a folio that remains in the swap > >>>> cache. > >>>> > >>>> If do_swap_page() is triggered, it may locate the folio in the > >>>> swap cache. However, during add_rmap operations, a kernel panic > >>>> can occur due to: > >>>> page_pgoff(folio, page) != linear_page_index(vma, address) > >>> > >>> Thanks for the report and reproducer! > >>> > >>>> > >>>> $./a.out > /dev/null > >>>> [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > >>>> [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > >>>> [ 13.337716] memcg:ffff00000405f000 > >>>> [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > >>>> [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > >>>> [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > >>>> [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > >>>> [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > >>>> [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > >>>> [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > >>>> [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > >>>> [ 13.340190] ------------[ cut here ]------------ > >>>> [ 13.340316] kernel BUG at mm/rmap.c:1380! > >>>> [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > >>>> [ 13.340969] Modules linked in: > >>>> [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > >>>> [ 13.341470] Hardware name: linux,dummy-virt (DT) > >>>> [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > >>>> [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > >>>> [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > >>>> [ 13.342018] sp : ffff80008752bb20 > >>>> [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > >>>> [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > >>>> [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > >>>> [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > >>>> [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > >>>> [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > >>>> [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > >>>> [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > >>>> [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > >>>> [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > >>>> [ 13.343876] Call trace: > >>>> [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > >>>> [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > >>>> [ 13.344333] do_swap_page+0x1060/0x1400 > >>>> [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > >>>> [ 13.344504] handle_mm_fault+0xd8/0x2e8 > >>>> [ 13.344586] do_page_fault+0x20c/0x770 > >>>> [ 13.344673] do_translation_fault+0xb4/0xf0 > >>>> [ 13.344759] do_mem_abort+0x48/0xa0 > >>>> [ 13.344842] el0_da+0x58/0x130 > >>>> [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > >>>> [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > >>>> [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > >>>> [ 13.345504] ---[ end trace 0000000000000000 ]--- > >>>> [ 13.345715] note: a.out[107] exited with irqs disabled > >>>> [ 13.345954] note: a.out[107] exited with preempt_count 2 > >>>> > >>>> Fully fixing it would be quite complex, requiring similar handling > >>>> of folios as done in move_present_pte. > >>> > >>> How complex would that be? Is it a matter of adding > >>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > >>> folio->index = linear_page_index like in move_present_pte() or > >>> something more? > >> > >> If the entry is pte_swp_exclusive(), and the folio is order-0, it cannot > >> be pinned and we may be able to move it I think. > >> > >> So all that's required is to check pte_swp_exclusive() and the folio size. > >> > >> ... in theory :) Not sure about the swap details. > > > > Looking some more into it, I think we would have to perform all the > > folio and anon_vma locking and pinning that we do for present pages in > > move_pages_pte(). If that's correct then maybe treating swapcache > > pages like a present page inside move_pages_pte() would be simpler? > > I'd be more in favor of not doing that. Maybe there are parts we can > move out into helper functions instead, so we can reuse them? I actually have a v2 ready. Maybe we can discuss if some of the code can be extracted as a helper based on the below before I send it formally? I’d say there are many parts that can be shared with present PTE, but there are two major differences: 1. Page exclusivity – swapcache doesn’t require it (try_to_unmap_one has remove Exclusive flag;) 2. src_anon_vma and its lock – swapcache doesn’t require it(folio is not mapped) Subject: [PATCH v2 Discussing with David] mm: Fix kernel crash when userfaultfd_move encounters swapcache userfaultfd_move() checks whether the PTE entry is present or a swap entry. - If the PTE entry is present, move_present_pte() handles folio migration by setting: src_folio->index = linear_page_index(dst_vma, dst_addr); - If the PTE entry is a swap entry, move_swap_pte() simply copies the PTE to the new dst_addr. This approach is incorrect because, even if the PTE is a swap entry, it can still reference a folio that remains in the swap cache. This exposes a race condition between steps 2 and 4: 1. add_to_swap: The folio is added to the swapcache. 2. try_to_unmap: PTEs are converted to swap entries. 3. pageout: The folio is written back. 4. Swapcache is cleared. If userfaultfd_move() happens in the window between step 2 and step 4, after the swap PTE is moved to the destination, accessing the destination triggers do_swap_page(), which may locate the folio in the swap cache. However, during add_rmap operations, a kernel panic can occur due to: page_pgoff(folio, page) != linear_page_index(vma, address) This happens because move_swap_pte() has never updated the index to match dst_vma and dst_addr. $./a.out > /dev/null [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 [ 13.337716] memcg:ffff00000405f000 [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) [ 13.340190] ------------[ cut here ]------------ [ 13.340316] kernel BUG at mm/rmap.c:1380! [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP [ 13.340969] Modules linked in: [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 [ 13.341470] Hardware name: linux,dummy-virt (DT) [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 [ 13.342018] sp : ffff80008752bb20 [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f [ 13.343876] Call trace: [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 [ 13.344333] do_swap_page+0x1060/0x1400 [ 13.344417] __handle_mm_fault+0x61c/0xbc8 [ 13.344504] handle_mm_fault+0xd8/0x2e8 [ 13.344586] do_page_fault+0x20c/0x770 [ 13.344673] do_translation_fault+0xb4/0xf0 [ 13.344759] do_mem_abort+0x48/0xa0 [ 13.344842] el0_da+0x58/0x130 [ 13.344914] el0t_64_sync_handler+0xc4/0x138 [ 13.345002] el0t_64_sync+0x1ac/0x1b0 [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) [ 13.345504] ---[ end trace 0000000000000000 ]--- [ 13.345715] note: a.out[107] exited with irqs disabled [ 13.345954] note: a.out[107] exited with preempt_count 2 This patch also checks the swapcache when handling swap entries. If a match is found in the swapcache, it processes it similarly to a present PTE. However, there are some differences. For example, the folio is no longer exclusive because folio_try_share_anon_rmap_pte() is performed during unmapping. Furthermore, in the case of swapcache, the folio has already been unmapped, eliminating the risk of concurrent rmap walks and removing the need to acquire src_folio's anon_vma or lock. Note that for large folios, in the swapcache handling path, we still frequently encounter -EBUSY returns because split_folio() returns -EBUSY when the folio is under writeback. That is not an urgent fix, so a following patch will address it. Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI") Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Nicolas Geoffray <ngeoffray@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: ZhangPeng <zhangpeng362@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Barry Song <v-songbaohua@oppo.com> --- mm/userfaultfd.c | 228 +++++++++++++++++++++++++++-------------------- 1 file changed, 133 insertions(+), 95 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 867898c4e30b..e5718835a964 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -18,6 +18,7 @@ #include <asm/tlbflush.h> #include <asm/tlb.h> #include "internal.h" +#include "swap.h" static __always_inline bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) @@ -1025,7 +1026,7 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte, pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd)); } -static int move_present_pte(struct mm_struct *mm, +static int move_pte_and_folio(struct mm_struct *mm, struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long dst_addr, unsigned long src_addr, @@ -1046,7 +1047,7 @@ static int move_present_pte(struct mm_struct *mm, } if (folio_test_large(src_folio) || folio_maybe_dma_pinned(src_folio) || - !PageAnonExclusive(&src_folio->page)) { + (pte_present(orig_src_pte) && !PageAnonExclusive(&src_folio->page))) { err = -EBUSY; goto out; } @@ -1062,10 +1063,13 @@ static int move_present_pte(struct mm_struct *mm, folio_move_anon_rmap(src_folio, dst_vma); src_folio->index = linear_page_index(dst_vma, dst_addr); - orig_dst_pte = mk_pte(&src_folio->page, dst_vma->vm_page_prot); - /* Follow mremap() behavior and treat the entry dirty after the move */ - orig_dst_pte = pte_mkwrite(pte_mkdirty(orig_dst_pte), dst_vma); - + if (pte_present(orig_src_pte)) { + orig_dst_pte = mk_pte(&src_folio->page, dst_vma->vm_page_prot); + /* Follow mremap() behavior and treat the entry dirty after the move */ + orig_dst_pte = pte_mkwrite(pte_mkdirty(orig_dst_pte), dst_vma); + } else { /* swap entry */ + orig_dst_pte = orig_src_pte; + } set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte); out: double_pt_unlock(dst_ptl, src_ptl); @@ -1079,9 +1083,6 @@ static int move_swap_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t dst_pmdval, spinlock_t *dst_ptl, spinlock_t *src_ptl) { - if (!pte_swp_exclusive(orig_src_pte)) - return -EBUSY; - double_pt_lock(dst_ptl, src_ptl); if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, @@ -1137,6 +1138,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, __u64 mode) { swp_entry_t entry; + struct swap_info_struct *si = NULL; pte_t orig_src_pte, orig_dst_pte; pte_t src_folio_pte; spinlock_t *src_ptl, *dst_ptl; @@ -1220,122 +1222,156 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, goto out; } - if (pte_present(orig_src_pte)) { - if (is_zero_pfn(pte_pfn(orig_src_pte))) { - err = move_zeropage_pte(mm, dst_vma, src_vma, - dst_addr, src_addr, dst_pte, src_pte, - orig_dst_pte, orig_src_pte, - dst_pmd, dst_pmdval, dst_ptl, src_ptl); + if (!pte_present(orig_src_pte)) { + entry = pte_to_swp_entry(orig_src_pte); + if (is_migration_entry(entry)) { + pte_unmap(&orig_src_pte); + pte_unmap(&orig_dst_pte); + src_pte = dst_pte = NULL; + migration_entry_wait(mm, src_pmd, src_addr); + err = -EAGAIN; + goto out; + } + + if (non_swap_entry(entry)) { + err = -EFAULT; + goto out; + } + + if (!pte_swp_exclusive(orig_src_pte)) { + err = -EBUSY; + goto out; + } + /* Prevent swapoff from happening to us. */ + if (!si) + si = get_swap_device(entry); + if (unlikely(!si)) { + err = -EAGAIN; goto out; } + } + + if (pte_present(orig_src_pte) && is_zero_pfn(pte_pfn(orig_src_pte))) { + err = move_zeropage_pte(mm, dst_vma, src_vma, + dst_addr, src_addr, dst_pte, src_pte, + orig_dst_pte, orig_src_pte, + dst_pmd, dst_pmdval, dst_ptl, src_ptl); + goto out; + } + + /* + * Pin and lock both source folio and anon_vma. Since we are in + * RCU read section, we can't block, so on contention have to + * unmap the ptes, obtain the lock and retry. + */ + if (!src_folio) { + struct folio *folio; /* - * Pin and lock both source folio and anon_vma. Since we are in - * RCU read section, we can't block, so on contention have to - * unmap the ptes, obtain the lock and retry. + * Pin the page while holding the lock to be sure the + * page isn't freed under us */ - if (!src_folio) { - struct folio *folio; + spin_lock(src_ptl); + if (!pte_same(orig_src_pte, ptep_get(src_pte))) { + spin_unlock(src_ptl); + err = -EAGAIN; + goto out; + } - /* - * Pin the page while holding the lock to be sure the - * page isn't freed under us - */ - spin_lock(src_ptl); - if (!pte_same(orig_src_pte, ptep_get(src_pte))) { + if (pte_present(orig_src_pte)) { + folio = vm_normal_folio(src_vma, src_addr, orig_src_pte); + if (!folio) { spin_unlock(src_ptl); - err = -EAGAIN; + err = -EBUSY; goto out; } - - folio = vm_normal_folio(src_vma, src_addr, orig_src_pte); - if (!folio || !PageAnonExclusive(&folio->page)) { + if (!PageAnonExclusive(&folio->page)) { spin_unlock(src_ptl); err = -EBUSY; goto out; } - folio_get(folio); - src_folio = folio; - src_folio_pte = orig_src_pte; - spin_unlock(src_ptl); - - if (!folio_trylock(src_folio)) { - pte_unmap(&orig_src_pte); - pte_unmap(&orig_dst_pte); - src_pte = dst_pte = NULL; - /* now we can block and wait */ - folio_lock(src_folio); - goto retry; - } - - if (WARN_ON_ONCE(!folio_test_anon(src_folio))) { - err = -EBUSY; + } else { + /* + * Check if swapcache exists. + * If it does, we need to move the folio + * even if the PTE is a swap entry. + */ + folio = filemap_get_folio(swap_address_space(entry), + swap_cache_index(entry)); + if (IS_ERR(folio)) { + spin_unlock(src_ptl); + err = move_swap_pte(mm, dst_addr, src_addr, dst_pte, src_pte, + orig_dst_pte, orig_src_pte, dst_pmd, + dst_pmdval, dst_ptl, src_ptl); goto out; } } - /* at this point we have src_folio locked */ - if (folio_test_large(src_folio)) { - /* split_folio() can block */ + src_folio = folio; + src_folio_pte = orig_src_pte; + spin_unlock(src_ptl); + + if (!folio_trylock(src_folio)) { pte_unmap(&orig_src_pte); pte_unmap(&orig_dst_pte); src_pte = dst_pte = NULL; - err = split_folio(src_folio); - if (err) - goto out; - /* have to reacquire the folio after it got split */ - folio_unlock(src_folio); - folio_put(src_folio); - src_folio = NULL; + /* now we can block and wait */ + folio_lock(src_folio); goto retry; } - if (!src_anon_vma) { - /* - * folio_referenced walks the anon_vma chain - * without the folio lock. Serialize against it with - * the anon_vma lock, the folio lock is not enough. - */ - src_anon_vma = folio_get_anon_vma(src_folio); - if (!src_anon_vma) { - /* page was unmapped from under us */ - err = -EAGAIN; - goto out; - } - if (!anon_vma_trylock_write(src_anon_vma)) { - pte_unmap(&orig_src_pte); - pte_unmap(&orig_dst_pte); - src_pte = dst_pte = NULL; - /* now we can block and wait */ - anon_vma_lock_write(src_anon_vma); - goto retry; - } + if (WARN_ON_ONCE(!folio_test_anon(src_folio))) { + err = -EBUSY; + goto out; } + } - err = move_present_pte(mm, dst_vma, src_vma, - dst_addr, src_addr, dst_pte, src_pte, - orig_dst_pte, orig_src_pte, dst_pmd, - dst_pmdval, dst_ptl, src_ptl, src_folio); - } else { - entry = pte_to_swp_entry(orig_src_pte); - if (non_swap_entry(entry)) { - if (is_migration_entry(entry)) { - pte_unmap(&orig_src_pte); - pte_unmap(&orig_dst_pte); - src_pte = dst_pte = NULL; - migration_entry_wait(mm, src_pmd, src_addr); - err = -EAGAIN; - } else - err = -EFAULT; + /* at this point we have src_folio locked */ + if (folio_test_large(src_folio)) { + /* split_folio() can block */ + pte_unmap(&orig_src_pte); + pte_unmap(&orig_dst_pte); + src_pte = dst_pte = NULL; + err = split_folio(src_folio); + if (err) goto out; - } + /* have to reacquire the folio after it got split */ + folio_unlock(src_folio); + folio_put(src_folio); + src_folio = NULL; + goto retry; + } - err = move_swap_pte(mm, dst_addr, src_addr, dst_pte, src_pte, - orig_dst_pte, orig_src_pte, dst_pmd, - dst_pmdval, dst_ptl, src_ptl); + if (!src_anon_vma && pte_present(orig_src_pte)) { + /* + * folio_referenced walks the anon_vma chain + * without the folio lock. Serialize against it with + * the anon_vma lock, the folio lock is not enough. + * In the swapcache case, the folio has been unmapped, + * so there is no concurrent rmap walk. + */ + src_anon_vma = folio_get_anon_vma(src_folio); + if (!src_anon_vma) { + /* page was unmapped from under us */ + err = -EAGAIN; + goto out; + } + if (!anon_vma_trylock_write(src_anon_vma)) { + pte_unmap(&orig_src_pte); + pte_unmap(&orig_dst_pte); + src_pte = dst_pte = NULL; + /* now we can block and wait */ + anon_vma_lock_write(src_anon_vma); + goto retry; + } } + err = move_pte_and_folio(mm, dst_vma, src_vma, + dst_addr, src_addr, dst_pte, src_pte, + orig_dst_pte, orig_src_pte, dst_pmd, + dst_pmdval, dst_ptl, src_ptl, src_folio); + out: if (src_anon_vma) { anon_vma_unlock_write(src_anon_vma); @@ -1351,6 +1387,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pte_unmap(src_pte); mmu_notifier_invalidate_range_end(&range); + if (si) + put_swap_device(si); return err; }
On Thu, Feb 20, 2025 at 9:51 PM David Hildenbrand <david@redhat.com> wrote: > > On 19.02.25 21:37, Barry Song wrote: > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: > >> > >> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > >>> > >>> From: Barry Song <v-songbaohua@oppo.com> > >>> > >>> userfaultfd_move() checks whether the PTE entry is present or a > >>> swap entry. > >>> > >>> - If the PTE entry is present, move_present_pte() handles folio > >>> migration by setting: > >>> > >>> src_folio->index = linear_page_index(dst_vma, dst_addr); > >>> > >>> - If the PTE entry is a swap entry, move_swap_pte() simply copies > >>> the PTE to the new dst_addr. > >>> > >>> This approach is incorrect because even if the PTE is a swap > >>> entry, it can still reference a folio that remains in the swap > >>> cache. > >>> > >>> If do_swap_page() is triggered, it may locate the folio in the > >>> swap cache. However, during add_rmap operations, a kernel panic > >>> can occur due to: > >>> page_pgoff(folio, page) != linear_page_index(vma, address) > >> > >> Thanks for the report and reproducer! > >> > >>> > >>> $./a.out > /dev/null > >>> [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > >>> [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > >>> [ 13.337716] memcg:ffff00000405f000 > >>> [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > >>> [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > >>> [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > >>> [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > >>> [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > >>> [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > >>> [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > >>> [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > >>> [ 13.340190] ------------[ cut here ]------------ > >>> [ 13.340316] kernel BUG at mm/rmap.c:1380! > >>> [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > >>> [ 13.340969] Modules linked in: > >>> [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > >>> [ 13.341470] Hardware name: linux,dummy-virt (DT) > >>> [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > >>> [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > >>> [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > >>> [ 13.342018] sp : ffff80008752bb20 > >>> [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > >>> [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > >>> [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > >>> [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > >>> [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > >>> [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > >>> [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > >>> [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > >>> [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > >>> [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > >>> [ 13.343876] Call trace: > >>> [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > >>> [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > >>> [ 13.344333] do_swap_page+0x1060/0x1400 > >>> [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > >>> [ 13.344504] handle_mm_fault+0xd8/0x2e8 > >>> [ 13.344586] do_page_fault+0x20c/0x770 > >>> [ 13.344673] do_translation_fault+0xb4/0xf0 > >>> [ 13.344759] do_mem_abort+0x48/0xa0 > >>> [ 13.344842] el0_da+0x58/0x130 > >>> [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > >>> [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > >>> [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > >>> [ 13.345504] ---[ end trace 0000000000000000 ]--- > >>> [ 13.345715] note: a.out[107] exited with irqs disabled > >>> [ 13.345954] note: a.out[107] exited with preempt_count 2 > >>> > >>> Fully fixing it would be quite complex, requiring similar handling > >>> of folios as done in move_present_pte. > >> > >> How complex would that be? Is it a matter of adding > >> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > >> folio->index = linear_page_index like in move_present_pte() or > >> something more? > > > > My main concern is still with large folios that require a split_folio() > > during move_pages(), as the entire folio shares the same index and > > anon_vma. However, userfaultfd_move() moves pages individually, > > making a split necessary. > > > > However, in split_huge_page_to_list_to_order(), there is a: > > > > if (folio_test_writeback(folio)) > > return -EBUSY; > > > > This is likely true for swapcache, right? However, even for move_present_pte(), > > it simply returns -EBUSY: > > > > move_pages_pte() > > { > > /* at this point we have src_folio locked */ > > if (folio_test_large(src_folio)) { > > /* split_folio() can block */ > > pte_unmap(&orig_src_pte); > > pte_unmap(&orig_dst_pte); > > src_pte = dst_pte = NULL; > > err = split_folio(src_folio); > > if (err) > > goto out; > > > > /* have to reacquire the folio after it got split */ > > folio_unlock(src_folio); > > folio_put(src_folio); > > src_folio = NULL; > > goto retry; > > } > > } > > > > Do we need a folio_wait_writeback() before calling split_folio()? > > > > By the way, I have also reported that userfaultfd_move() has a fundamental > > conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common > > kernel. In this scenario, folios in the virtual zone won’t be split in > > split_folio(). Instead, the large folio migrates into nr_pages small folios. > > > Thus, the best-case scenario would be: > > > > mTHP -> migrate to small folios in split_folio() -> move small folios to > > dst_addr > > > > While this works, it negates the performance benefits of > > userfaultfd_move(), as it introduces two PTE operations (migration in > > split_folio() and move in userfaultfd_move() while retry), nr_pages memory > > allocations, and still requires one memcpy(). This could end up > > performing even worse than userfaultfd_copy(), I guess. > > > The worst-case scenario would be failing to allocate small folios in > > split_folio(), then userfaultfd_move() might return -ENOMEM? > > Although that's an Android problem and not an upstream problem, I'll > note that there are other reasons why the split / move might fail, and > user space either must retry or fallback to a COPY. > > Regarding mTHP, we could move the whole folio if the user space-provided > range allows for batching over multiple PTEs (nr_ptes), they are in a > single VMA, and folio_mapcount() == nr_ptes. > > There are corner cases to handle, such as moving mTHPs such that they > suddenly cross two page tables I assume, that are harder to handle when > not moving individual PTEs where that cannot happen. This is a useful suggestion. I’ve heard that Lokesh is also interested in modifying ART to perform moves at the mTHP granularity, which would require kernel modifications as well. It’s likely the direction we’ll take after fixing the current urgent bugs. The current split_folio() really isn’t ideal. The corner cases you mentioned are definitely worth considering. However, once we can perform batch UFFDIO_MOVE, I believe that in most cases, the conflict between userfaultfd_move() and TAO will be resolved ? For those corner cases, ART will still need to be fully aware that falling back to copy or retrying is necessary. > > -- > Cheers, > > David / dhildenb > Thanks Barry
On 20.02.25 10:31, Barry Song wrote: > On Thu, Feb 20, 2025 at 9:51 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 19.02.25 21:37, Barry Song wrote: >>> On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: >>>> >>>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: >>>>> >>>>> From: Barry Song <v-songbaohua@oppo.com> >>>>> >>>>> userfaultfd_move() checks whether the PTE entry is present or a >>>>> swap entry. >>>>> >>>>> - If the PTE entry is present, move_present_pte() handles folio >>>>> migration by setting: >>>>> >>>>> src_folio->index = linear_page_index(dst_vma, dst_addr); >>>>> >>>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies >>>>> the PTE to the new dst_addr. >>>>> >>>>> This approach is incorrect because even if the PTE is a swap >>>>> entry, it can still reference a folio that remains in the swap >>>>> cache. >>>>> >>>>> If do_swap_page() is triggered, it may locate the folio in the >>>>> swap cache. However, during add_rmap operations, a kernel panic >>>>> can occur due to: >>>>> page_pgoff(folio, page) != linear_page_index(vma, address) >>>> >>>> Thanks for the report and reproducer! >>>> >>>>> >>>>> $./a.out > /dev/null >>>>> [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c >>>>> [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 >>>>> [ 13.337716] memcg:ffff00000405f000 >>>>> [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) >>>>> [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 >>>>> [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 >>>>> [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 >>>>> [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 >>>>> [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 >>>>> [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 >>>>> [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) >>>>> [ 13.340190] ------------[ cut here ]------------ >>>>> [ 13.340316] kernel BUG at mm/rmap.c:1380! >>>>> [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP >>>>> [ 13.340969] Modules linked in: >>>>> [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 >>>>> [ 13.341470] Hardware name: linux,dummy-virt (DT) >>>>> [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>>>> [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 >>>>> [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 >>>>> [ 13.342018] sp : ffff80008752bb20 >>>>> [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 >>>>> [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 >>>>> [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 >>>>> [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff >>>>> [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f >>>>> [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 >>>>> [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 >>>>> [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 >>>>> [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 >>>>> [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f >>>>> [ 13.343876] Call trace: >>>>> [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) >>>>> [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 >>>>> [ 13.344333] do_swap_page+0x1060/0x1400 >>>>> [ 13.344417] __handle_mm_fault+0x61c/0xbc8 >>>>> [ 13.344504] handle_mm_fault+0xd8/0x2e8 >>>>> [ 13.344586] do_page_fault+0x20c/0x770 >>>>> [ 13.344673] do_translation_fault+0xb4/0xf0 >>>>> [ 13.344759] do_mem_abort+0x48/0xa0 >>>>> [ 13.344842] el0_da+0x58/0x130 >>>>> [ 13.344914] el0t_64_sync_handler+0xc4/0x138 >>>>> [ 13.345002] el0t_64_sync+0x1ac/0x1b0 >>>>> [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) >>>>> [ 13.345504] ---[ end trace 0000000000000000 ]--- >>>>> [ 13.345715] note: a.out[107] exited with irqs disabled >>>>> [ 13.345954] note: a.out[107] exited with preempt_count 2 >>>>> >>>>> Fully fixing it would be quite complex, requiring similar handling >>>>> of folios as done in move_present_pte. >>>> >>>> How complex would that be? Is it a matter of adding >>>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and >>>> folio->index = linear_page_index like in move_present_pte() or >>>> something more? >>> >>> My main concern is still with large folios that require a split_folio() >>> during move_pages(), as the entire folio shares the same index and >>> anon_vma. However, userfaultfd_move() moves pages individually, >>> making a split necessary. >>> >>> However, in split_huge_page_to_list_to_order(), there is a: >>> >>> if (folio_test_writeback(folio)) >>> return -EBUSY; >>> >>> This is likely true for swapcache, right? However, even for move_present_pte(), >>> it simply returns -EBUSY: >>> >>> move_pages_pte() >>> { >>> /* at this point we have src_folio locked */ >>> if (folio_test_large(src_folio)) { >>> /* split_folio() can block */ >>> pte_unmap(&orig_src_pte); >>> pte_unmap(&orig_dst_pte); >>> src_pte = dst_pte = NULL; >>> err = split_folio(src_folio); >>> if (err) >>> goto out; >>> >>> /* have to reacquire the folio after it got split */ >>> folio_unlock(src_folio); >>> folio_put(src_folio); >>> src_folio = NULL; >>> goto retry; >>> } >>> } >>> >>> Do we need a folio_wait_writeback() before calling split_folio()? >>> >>> By the way, I have also reported that userfaultfd_move() has a fundamental >>> conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common >>> kernel. In this scenario, folios in the virtual zone won’t be split in >>> split_folio(). Instead, the large folio migrates into nr_pages small folios. >> > > Thus, the best-case scenario would be: >>> >>> mTHP -> migrate to small folios in split_folio() -> move small folios to >>> dst_addr >>> >>> While this works, it negates the performance benefits of >>> userfaultfd_move(), as it introduces two PTE operations (migration in >>> split_folio() and move in userfaultfd_move() while retry), nr_pages memory >>> allocations, and still requires one memcpy(). This could end up >>> performing even worse than userfaultfd_copy(), I guess. >> > > The worst-case scenario would be failing to allocate small folios in >>> split_folio(), then userfaultfd_move() might return -ENOMEM? >> >> Although that's an Android problem and not an upstream problem, I'll >> note that there are other reasons why the split / move might fail, and >> user space either must retry or fallback to a COPY. >> >> Regarding mTHP, we could move the whole folio if the user space-provided >> range allows for batching over multiple PTEs (nr_ptes), they are in a >> single VMA, and folio_mapcount() == nr_ptes. >> >> There are corner cases to handle, such as moving mTHPs such that they >> suddenly cross two page tables I assume, that are harder to handle when >> not moving individual PTEs where that cannot happen. > > This is a useful suggestion. I’ve heard that Lokesh is also interested in > modifying ART to perform moves at the mTHP granularity, which would require > kernel modifications as well. It’s likely the direction we’ll take after > fixing the current urgent bugs. The current split_folio() really isn’t ideal. > > The corner cases you mentioned are definitely worth considering. However, > once we can perform batch UFFDIO_MOVE, I believe that in most cases, > the conflict between userfaultfd_move() and TAO will be resolved ? Well, as soon as you would have varying mTHP sizes, you'd still run into the split with TAO. Maybe that doesn't apply with Android today, but I can just guess that performing sub-mTHP moving would still be required for GC at some point.
On 20.02.25 10:21, Barry Song wrote: > On Thu, Feb 20, 2025 at 9:40 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 19.02.25 19:58, Suren Baghdasaryan wrote: >>> On Wed, Feb 19, 2025 at 10:30 AM David Hildenbrand <david@redhat.com> wrote: >>>> >>>> On 19.02.25 19:26, Suren Baghdasaryan wrote: >>>>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: >>>>>> >>>>>> From: Barry Song <v-songbaohua@oppo.com> >>>>>> >>>>>> userfaultfd_move() checks whether the PTE entry is present or a >>>>>> swap entry. >>>>>> >>>>>> - If the PTE entry is present, move_present_pte() handles folio >>>>>> migration by setting: >>>>>> >>>>>> src_folio->index = linear_page_index(dst_vma, dst_addr); >>>>>> >>>>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies >>>>>> the PTE to the new dst_addr. >>>>>> >>>>>> This approach is incorrect because even if the PTE is a swap >>>>>> entry, it can still reference a folio that remains in the swap >>>>>> cache. >>>>>> >>>>>> If do_swap_page() is triggered, it may locate the folio in the >>>>>> swap cache. However, during add_rmap operations, a kernel panic >>>>>> can occur due to: >>>>>> page_pgoff(folio, page) != linear_page_index(vma, address) >>>>> >>>>> Thanks for the report and reproducer! >>>>> >>>>>> >>>>>> $./a.out > /dev/null >>>>>> [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c >>>>>> [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 >>>>>> [ 13.337716] memcg:ffff00000405f000 >>>>>> [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) >>>>>> [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 >>>>>> [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 >>>>>> [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 >>>>>> [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 >>>>>> [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 >>>>>> [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 >>>>>> [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) >>>>>> [ 13.340190] ------------[ cut here ]------------ >>>>>> [ 13.340316] kernel BUG at mm/rmap.c:1380! >>>>>> [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP >>>>>> [ 13.340969] Modules linked in: >>>>>> [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 >>>>>> [ 13.341470] Hardware name: linux,dummy-virt (DT) >>>>>> [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>>>>> [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 >>>>>> [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 >>>>>> [ 13.342018] sp : ffff80008752bb20 >>>>>> [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 >>>>>> [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 >>>>>> [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 >>>>>> [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff >>>>>> [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f >>>>>> [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 >>>>>> [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 >>>>>> [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 >>>>>> [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 >>>>>> [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f >>>>>> [ 13.343876] Call trace: >>>>>> [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) >>>>>> [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 >>>>>> [ 13.344333] do_swap_page+0x1060/0x1400 >>>>>> [ 13.344417] __handle_mm_fault+0x61c/0xbc8 >>>>>> [ 13.344504] handle_mm_fault+0xd8/0x2e8 >>>>>> [ 13.344586] do_page_fault+0x20c/0x770 >>>>>> [ 13.344673] do_translation_fault+0xb4/0xf0 >>>>>> [ 13.344759] do_mem_abort+0x48/0xa0 >>>>>> [ 13.344842] el0_da+0x58/0x130 >>>>>> [ 13.344914] el0t_64_sync_handler+0xc4/0x138 >>>>>> [ 13.345002] el0t_64_sync+0x1ac/0x1b0 >>>>>> [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) >>>>>> [ 13.345504] ---[ end trace 0000000000000000 ]--- >>>>>> [ 13.345715] note: a.out[107] exited with irqs disabled >>>>>> [ 13.345954] note: a.out[107] exited with preempt_count 2 >>>>>> >>>>>> Fully fixing it would be quite complex, requiring similar handling >>>>>> of folios as done in move_present_pte. >>>>> >>>>> How complex would that be? Is it a matter of adding >>>>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and >>>>> folio->index = linear_page_index like in move_present_pte() or >>>>> something more? >>>> >>>> If the entry is pte_swp_exclusive(), and the folio is order-0, it cannot >>>> be pinned and we may be able to move it I think. >>>> >>>> So all that's required is to check pte_swp_exclusive() and the folio size. >>>> >>>> ... in theory :) Not sure about the swap details. >>> >>> Looking some more into it, I think we would have to perform all the >>> folio and anon_vma locking and pinning that we do for present pages in >>> move_pages_pte(). If that's correct then maybe treating swapcache >>> pages like a present page inside move_pages_pte() would be simpler? >> >> I'd be more in favor of not doing that. Maybe there are parts we can >> move out into helper functions instead, so we can reuse them? > > I actually have a v2 ready. Maybe we can discuss if some of the code can be > extracted as a helper based on the below before I send it formally? > > I’d say there are many parts that can be shared with present PTE, but there > are two major differences: > > 1. Page exclusivity – swapcache doesn’t require it (try_to_unmap_one has remove > Exclusive flag;) > 2. src_anon_vma and its lock – swapcache doesn’t require it(folio is not mapped) > That's a lot of complicated code you have there (not your fault, it's complicated stuff ... ) :) Some of it might be compressed/simplified by the use of "else if". I'll try to take a closer look later (will have to apply it to see the context better). Just one independent comment because I stumbled over this recently: [...] > @@ -1062,10 +1063,13 @@ static int move_present_pte(struct mm_struct *mm, > folio_move_anon_rmap(src_folio, dst_vma); > src_folio->index = linear_page_index(dst_vma, dst_addr); > > - orig_dst_pte = mk_pte(&src_folio->page, dst_vma->vm_page_prot); > - /* Follow mremap() behavior and treat the entry dirty after the move */ > - orig_dst_pte = pte_mkwrite(pte_mkdirty(orig_dst_pte), dst_vma); > - > + if (pte_present(orig_src_pte)) { > + orig_dst_pte = mk_pte(&src_folio->page, dst_vma->vm_page_prot); > + /* Follow mremap() behavior and treat the entry dirty after the move */ > + orig_dst_pte = pte_mkwrite(pte_mkdirty(orig_dst_pte), dst_vma); I'll note that the comment and mkdirty is misleading/wrong. It's softdirty that we care about only. But that is something independent of this change. For swp PTEs, we maybe also would want to set softdirty. See move_soft_dirty_pte() on what is actually done on the mremap path.
On Thu, Feb 20, 2025 at 10:36 PM David Hildenbrand <david@redhat.com> wrote: > > On 20.02.25 10:31, Barry Song wrote: > > On Thu, Feb 20, 2025 at 9:51 PM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 19.02.25 21:37, Barry Song wrote: > >>> On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: > >>>> > >>>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > >>>>> > >>>>> From: Barry Song <v-songbaohua@oppo.com> > >>>>> > >>>>> userfaultfd_move() checks whether the PTE entry is present or a > >>>>> swap entry. > >>>>> > >>>>> - If the PTE entry is present, move_present_pte() handles folio > >>>>> migration by setting: > >>>>> > >>>>> src_folio->index = linear_page_index(dst_vma, dst_addr); > >>>>> > >>>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies > >>>>> the PTE to the new dst_addr. > >>>>> > >>>>> This approach is incorrect because even if the PTE is a swap > >>>>> entry, it can still reference a folio that remains in the swap > >>>>> cache. > >>>>> > >>>>> If do_swap_page() is triggered, it may locate the folio in the > >>>>> swap cache. However, during add_rmap operations, a kernel panic > >>>>> can occur due to: > >>>>> page_pgoff(folio, page) != linear_page_index(vma, address) > >>>> > >>>> Thanks for the report and reproducer! > >>>> > >>>>> > >>>>> $./a.out > /dev/null > >>>>> [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > >>>>> [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > >>>>> [ 13.337716] memcg:ffff00000405f000 > >>>>> [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > >>>>> [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > >>>>> [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > >>>>> [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > >>>>> [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > >>>>> [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > >>>>> [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > >>>>> [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > >>>>> [ 13.340190] ------------[ cut here ]------------ > >>>>> [ 13.340316] kernel BUG at mm/rmap.c:1380! > >>>>> [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > >>>>> [ 13.340969] Modules linked in: > >>>>> [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > >>>>> [ 13.341470] Hardware name: linux,dummy-virt (DT) > >>>>> [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > >>>>> [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > >>>>> [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > >>>>> [ 13.342018] sp : ffff80008752bb20 > >>>>> [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > >>>>> [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > >>>>> [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > >>>>> [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > >>>>> [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > >>>>> [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > >>>>> [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > >>>>> [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > >>>>> [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > >>>>> [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > >>>>> [ 13.343876] Call trace: > >>>>> [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > >>>>> [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > >>>>> [ 13.344333] do_swap_page+0x1060/0x1400 > >>>>> [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > >>>>> [ 13.344504] handle_mm_fault+0xd8/0x2e8 > >>>>> [ 13.344586] do_page_fault+0x20c/0x770 > >>>>> [ 13.344673] do_translation_fault+0xb4/0xf0 > >>>>> [ 13.344759] do_mem_abort+0x48/0xa0 > >>>>> [ 13.344842] el0_da+0x58/0x130 > >>>>> [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > >>>>> [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > >>>>> [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > >>>>> [ 13.345504] ---[ end trace 0000000000000000 ]--- > >>>>> [ 13.345715] note: a.out[107] exited with irqs disabled > >>>>> [ 13.345954] note: a.out[107] exited with preempt_count 2 > >>>>> > >>>>> Fully fixing it would be quite complex, requiring similar handling > >>>>> of folios as done in move_present_pte. > >>>> > >>>> How complex would that be? Is it a matter of adding > >>>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > >>>> folio->index = linear_page_index like in move_present_pte() or > >>>> something more? > >>> > >>> My main concern is still with large folios that require a split_folio() > >>> during move_pages(), as the entire folio shares the same index and > >>> anon_vma. However, userfaultfd_move() moves pages individually, > >>> making a split necessary. > >>> > >>> However, in split_huge_page_to_list_to_order(), there is a: > >>> > >>> if (folio_test_writeback(folio)) > >>> return -EBUSY; > >>> > >>> This is likely true for swapcache, right? However, even for move_present_pte(), > >>> it simply returns -EBUSY: > >>> > >>> move_pages_pte() > >>> { > >>> /* at this point we have src_folio locked */ > >>> if (folio_test_large(src_folio)) { > >>> /* split_folio() can block */ > >>> pte_unmap(&orig_src_pte); > >>> pte_unmap(&orig_dst_pte); > >>> src_pte = dst_pte = NULL; > >>> err = split_folio(src_folio); > >>> if (err) > >>> goto out; > >>> > >>> /* have to reacquire the folio after it got split */ > >>> folio_unlock(src_folio); > >>> folio_put(src_folio); > >>> src_folio = NULL; > >>> goto retry; > >>> } > >>> } > >>> > >>> Do we need a folio_wait_writeback() before calling split_folio()? > >>> > >>> By the way, I have also reported that userfaultfd_move() has a fundamental > >>> conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common > >>> kernel. In this scenario, folios in the virtual zone won’t be split in > >>> split_folio(). Instead, the large folio migrates into nr_pages small folios. > >> > > Thus, the best-case scenario would be: > >>> > >>> mTHP -> migrate to small folios in split_folio() -> move small folios to > >>> dst_addr > >>> > >>> While this works, it negates the performance benefits of > >>> userfaultfd_move(), as it introduces two PTE operations (migration in > >>> split_folio() and move in userfaultfd_move() while retry), nr_pages memory > >>> allocations, and still requires one memcpy(). This could end up > >>> performing even worse than userfaultfd_copy(), I guess. > >> > > The worst-case scenario would be failing to allocate small folios in > >>> split_folio(), then userfaultfd_move() might return -ENOMEM? > >> > >> Although that's an Android problem and not an upstream problem, I'll > >> note that there are other reasons why the split / move might fail, and > >> user space either must retry or fallback to a COPY. > >> > >> Regarding mTHP, we could move the whole folio if the user space-provided > >> range allows for batching over multiple PTEs (nr_ptes), they are in a > >> single VMA, and folio_mapcount() == nr_ptes. > >> > >> There are corner cases to handle, such as moving mTHPs such that they > >> suddenly cross two page tables I assume, that are harder to handle when > >> not moving individual PTEs where that cannot happen. > > > > This is a useful suggestion. I’ve heard that Lokesh is also interested in > > modifying ART to perform moves at the mTHP granularity, which would require > > kernel modifications as well. It’s likely the direction we’ll take after > > fixing the current urgent bugs. The current split_folio() really isn’t ideal. > > > > The corner cases you mentioned are definitely worth considering. However, > > once we can perform batch UFFDIO_MOVE, I believe that in most cases, > > the conflict between userfaultfd_move() and TAO will be resolved ? > > Well, as soon as you would have varying mTHP sizes, you'd still run into > the split with TAO. Maybe that doesn't apply with Android today, but I > can just guess that performing sub-mTHP moving would still be required > for GC at some point. With patch v2[1], as discussed in my previous email, I have observed that small folios consistently succeed without crashing. Similarly, mTHP no longer crashes; however, it still returns -EBUSY during the raced time window, even after adding folio_wait_writeback. While I previously mentioned that folio_writeback prevents mTHP from splitting, this is not the only factor. The split_folio() function still returns -EBUSY because folio_get_anon_vma(folio) returns NULL when the folio is not mapped. int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, unsigned int new_order) { anon_vma = folio_get_anon_vma(folio); if (!anon_vma) { ret = -EBUSY; goto out; } end = -1; mapping = NULL; anon_vma_lock_write(anon_vma); } Even if mTHP is not from TAO's virtual zone, userfaultfd_move() will still fail when performing sub-mTHP moving in the swap cache case due to: struct anon_vma *folio_get_anon_vma(const struct folio *folio) { ... if (!folio_mapped(folio)) goto out; ... } We likely need to modify split_folio() to support splitting unmapped anon folios within the swap cache or introduce a new function like split_unmapped_anon_folio()? Otherwise, userspace will have to fall back to UFFDIO_COPY or retry. As it stands, I see no way for sub-mTHP to survive moving with the current code and within the existing raced window. For mTHP, there is essentially no difference between returning -EBUSY immediately upon detecting that it is within the swap cache, as proposed in v1. [1] https://lore.kernel.org/linux-mm/20250220092101.71966-1-21cnbao@gmail.com/ > > -- > Cheers, > > David / dhildenb > Thanks Barry
On Thu, Feb 20, 2025 at 1:45 PM Barry Song <21cnbao@gmail.com> wrote: > > On Thu, Feb 20, 2025 at 10:36 PM David Hildenbrand <david@redhat.com> wrote: > > > > On 20.02.25 10:31, Barry Song wrote: > > > On Thu, Feb 20, 2025 at 9:51 PM David Hildenbrand <david@redhat.com> wrote: > > >> > > >> On 19.02.25 21:37, Barry Song wrote: > > >>> On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: > > >>>> > > >>>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > >>>>> > > >>>>> From: Barry Song <v-songbaohua@oppo.com> > > >>>>> > > >>>>> userfaultfd_move() checks whether the PTE entry is present or a > > >>>>> swap entry. > > >>>>> > > >>>>> - If the PTE entry is present, move_present_pte() handles folio > > >>>>> migration by setting: > > >>>>> > > >>>>> src_folio->index = linear_page_index(dst_vma, dst_addr); > > >>>>> > > >>>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies > > >>>>> the PTE to the new dst_addr. > > >>>>> > > >>>>> This approach is incorrect because even if the PTE is a swap > > >>>>> entry, it can still reference a folio that remains in the swap > > >>>>> cache. > > >>>>> > > >>>>> If do_swap_page() is triggered, it may locate the folio in the > > >>>>> swap cache. However, during add_rmap operations, a kernel panic > > >>>>> can occur due to: > > >>>>> page_pgoff(folio, page) != linear_page_index(vma, address) > > >>>> > > >>>> Thanks for the report and reproducer! > > >>>> > > >>>>> > > >>>>> $./a.out > /dev/null > > >>>>> [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > > >>>>> [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > > >>>>> [ 13.337716] memcg:ffff00000405f000 > > >>>>> [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > > >>>>> [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > >>>>> [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > >>>>> [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > >>>>> [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > >>>>> [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > > >>>>> [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > > >>>>> [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > > >>>>> [ 13.340190] ------------[ cut here ]------------ > > >>>>> [ 13.340316] kernel BUG at mm/rmap.c:1380! > > >>>>> [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > > >>>>> [ 13.340969] Modules linked in: > > >>>>> [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > > >>>>> [ 13.341470] Hardware name: linux,dummy-virt (DT) > > >>>>> [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > >>>>> [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > >>>>> [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > >>>>> [ 13.342018] sp : ffff80008752bb20 > > >>>>> [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > > >>>>> [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > > >>>>> [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > > >>>>> [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > > >>>>> [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > > >>>>> [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > > >>>>> [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > > >>>>> [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > > >>>>> [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > > >>>>> [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > > >>>>> [ 13.343876] Call trace: > > >>>>> [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > >>>>> [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > >>>>> [ 13.344333] do_swap_page+0x1060/0x1400 > > >>>>> [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > >>>>> [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > >>>>> [ 13.344586] do_page_fault+0x20c/0x770 > > >>>>> [ 13.344673] do_translation_fault+0xb4/0xf0 > > >>>>> [ 13.344759] do_mem_abort+0x48/0xa0 > > >>>>> [ 13.344842] el0_da+0x58/0x130 > > >>>>> [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > >>>>> [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > >>>>> [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > > >>>>> [ 13.345504] ---[ end trace 0000000000000000 ]--- > > >>>>> [ 13.345715] note: a.out[107] exited with irqs disabled > > >>>>> [ 13.345954] note: a.out[107] exited with preempt_count 2 > > >>>>> > > >>>>> Fully fixing it would be quite complex, requiring similar handling > > >>>>> of folios as done in move_present_pte. > > >>>> > > >>>> How complex would that be? Is it a matter of adding > > >>>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > > >>>> folio->index = linear_page_index like in move_present_pte() or > > >>>> something more? > > >>> > > >>> My main concern is still with large folios that require a split_folio() > > >>> during move_pages(), as the entire folio shares the same index and > > >>> anon_vma. However, userfaultfd_move() moves pages individually, > > >>> making a split necessary. > > >>> > > >>> However, in split_huge_page_to_list_to_order(), there is a: > > >>> > > >>> if (folio_test_writeback(folio)) > > >>> return -EBUSY; > > >>> > > >>> This is likely true for swapcache, right? However, even for move_present_pte(), > > >>> it simply returns -EBUSY: > > >>> > > >>> move_pages_pte() > > >>> { > > >>> /* at this point we have src_folio locked */ > > >>> if (folio_test_large(src_folio)) { > > >>> /* split_folio() can block */ > > >>> pte_unmap(&orig_src_pte); > > >>> pte_unmap(&orig_dst_pte); > > >>> src_pte = dst_pte = NULL; > > >>> err = split_folio(src_folio); > > >>> if (err) > > >>> goto out; > > >>> > > >>> /* have to reacquire the folio after it got split */ > > >>> folio_unlock(src_folio); > > >>> folio_put(src_folio); > > >>> src_folio = NULL; > > >>> goto retry; > > >>> } > > >>> } > > >>> > > >>> Do we need a folio_wait_writeback() before calling split_folio()? > > >>> > > >>> By the way, I have also reported that userfaultfd_move() has a fundamental > > >>> conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common > > >>> kernel. In this scenario, folios in the virtual zone won’t be split in > > >>> split_folio(). Instead, the large folio migrates into nr_pages small folios. > > >> > > Thus, the best-case scenario would be: > > >>> > > >>> mTHP -> migrate to small folios in split_folio() -> move small folios to > > >>> dst_addr > > >>> > > >>> While this works, it negates the performance benefits of > > >>> userfaultfd_move(), as it introduces two PTE operations (migration in > > >>> split_folio() and move in userfaultfd_move() while retry), nr_pages memory > > >>> allocations, and still requires one memcpy(). This could end up > > >>> performing even worse than userfaultfd_copy(), I guess. > > >> > > The worst-case scenario would be failing to allocate small folios in > > >>> split_folio(), then userfaultfd_move() might return -ENOMEM? > > >> > > >> Although that's an Android problem and not an upstream problem, I'll > > >> note that there are other reasons why the split / move might fail, and > > >> user space either must retry or fallback to a COPY. > > >> > > >> Regarding mTHP, we could move the whole folio if the user space-provided > > >> range allows for batching over multiple PTEs (nr_ptes), they are in a > > >> single VMA, and folio_mapcount() == nr_ptes. > > >> > > >> There are corner cases to handle, such as moving mTHPs such that they > > >> suddenly cross two page tables I assume, that are harder to handle when > > >> not moving individual PTEs where that cannot happen. > > > > > > This is a useful suggestion. I’ve heard that Lokesh is also interested in > > > modifying ART to perform moves at the mTHP granularity, which would require > > > kernel modifications as well. It’s likely the direction we’ll take after > > > fixing the current urgent bugs. The current split_folio() really isn’t ideal. > > > > > > The corner cases you mentioned are definitely worth considering. However, > > > once we can perform batch UFFDIO_MOVE, I believe that in most cases, > > > the conflict between userfaultfd_move() and TAO will be resolved ? > > > > Well, as soon as you would have varying mTHP sizes, you'd still run into > > the split with TAO. Maybe that doesn't apply with Android today, but I > > can just guess that performing sub-mTHP moving would still be required > > for GC at some point. > > With patch v2[1], as discussed in my previous email, I have observed that > small folios consistently succeed without crashing. Similarly, mTHP no > longer crashes; however, it still returns -EBUSY during the raced time > window, even after adding folio_wait_writeback. While I previously > mentioned that folio_writeback prevents mTHP from splitting, this is not > the only factor. The split_folio() function still returns -EBUSY because > folio_get_anon_vma(folio) returns NULL when the folio is not mapped. > > int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > unsigned int new_order) > { > anon_vma = folio_get_anon_vma(folio); > if (!anon_vma) { > ret = -EBUSY; > goto out; > } > > end = -1; > mapping = NULL; > anon_vma_lock_write(anon_vma); > } > > Even if mTHP is not from TAO's virtual zone, userfaultfd_move() will still > fail when performing sub-mTHP moving in the swap cache case due to: Just to clarify my doubt. What do you mean by sub-mTHP? Also when you say 'small folio' above, do you mean single-page folios? Am I understanding correctly that your patch correctly handles moving single swap-cache page case? > > struct anon_vma *folio_get_anon_vma(const struct folio *folio) > { > ... > if (!folio_mapped(folio)) > goto out; > ... > } > > We likely need to modify split_folio() to support splitting unmapped anon > folios within the swap cache or introduce a new function like > split_unmapped_anon_folio()? Otherwise, userspace will have to fall back > to UFFDIO_COPY or retry. > > As it stands, I see no way for sub-mTHP to survive moving with the current > code and within the existing raced window. For mTHP, there is essentially > no difference between returning -EBUSY immediately upon detecting that it > is within the swap cache, as proposed in v1. > > [1] https://lore.kernel.org/linux-mm/20250220092101.71966-1-21cnbao@gmail.com/ > > > > > -- > > Cheers, > > > > David / dhildenb > > > > Thanks > Barry
On Fri, Feb 21, 2025 at 11:20 AM Lokesh Gidra <lokeshgidra@google.com> wrote: > > On Thu, Feb 20, 2025 at 1:45 PM Barry Song <21cnbao@gmail.com> wrote: > > > > On Thu, Feb 20, 2025 at 10:36 PM David Hildenbrand <david@redhat.com> wrote: > > > > > > On 20.02.25 10:31, Barry Song wrote: > > > > On Thu, Feb 20, 2025 at 9:51 PM David Hildenbrand <david@redhat.com> wrote: > > > >> > > > >> On 19.02.25 21:37, Barry Song wrote: > > > >>> On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > >>>> > > > >>>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > > >>>>> > > > >>>>> From: Barry Song <v-songbaohua@oppo.com> > > > >>>>> > > > >>>>> userfaultfd_move() checks whether the PTE entry is present or a > > > >>>>> swap entry. > > > >>>>> > > > >>>>> - If the PTE entry is present, move_present_pte() handles folio > > > >>>>> migration by setting: > > > >>>>> > > > >>>>> src_folio->index = linear_page_index(dst_vma, dst_addr); > > > >>>>> > > > >>>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies > > > >>>>> the PTE to the new dst_addr. > > > >>>>> > > > >>>>> This approach is incorrect because even if the PTE is a swap > > > >>>>> entry, it can still reference a folio that remains in the swap > > > >>>>> cache. > > > >>>>> > > > >>>>> If do_swap_page() is triggered, it may locate the folio in the > > > >>>>> swap cache. However, during add_rmap operations, a kernel panic > > > >>>>> can occur due to: > > > >>>>> page_pgoff(folio, page) != linear_page_index(vma, address) > > > >>>> > > > >>>> Thanks for the report and reproducer! > > > >>>> > > > >>>>> > > > >>>>> $./a.out > /dev/null > > > >>>>> [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > > > >>>>> [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > > > >>>>> [ 13.337716] memcg:ffff00000405f000 > > > >>>>> [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > > > >>>>> [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > >>>>> [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > >>>>> [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > >>>>> [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > >>>>> [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > > > >>>>> [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > > > >>>>> [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > > > >>>>> [ 13.340190] ------------[ cut here ]------------ > > > >>>>> [ 13.340316] kernel BUG at mm/rmap.c:1380! > > > >>>>> [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > > > >>>>> [ 13.340969] Modules linked in: > > > >>>>> [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > > > >>>>> [ 13.341470] Hardware name: linux,dummy-virt (DT) > > > >>>>> [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > > >>>>> [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > > >>>>> [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > > >>>>> [ 13.342018] sp : ffff80008752bb20 > > > >>>>> [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > > > >>>>> [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > > > >>>>> [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > > > >>>>> [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > > > >>>>> [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > > > >>>>> [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > > > >>>>> [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > > > >>>>> [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > > > >>>>> [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > > > >>>>> [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > > > >>>>> [ 13.343876] Call trace: > > > >>>>> [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > > >>>>> [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > > >>>>> [ 13.344333] do_swap_page+0x1060/0x1400 > > > >>>>> [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > > >>>>> [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > > >>>>> [ 13.344586] do_page_fault+0x20c/0x770 > > > >>>>> [ 13.344673] do_translation_fault+0xb4/0xf0 > > > >>>>> [ 13.344759] do_mem_abort+0x48/0xa0 > > > >>>>> [ 13.344842] el0_da+0x58/0x130 > > > >>>>> [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > > >>>>> [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > > >>>>> [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > > > >>>>> [ 13.345504] ---[ end trace 0000000000000000 ]--- > > > >>>>> [ 13.345715] note: a.out[107] exited with irqs disabled > > > >>>>> [ 13.345954] note: a.out[107] exited with preempt_count 2 > > > >>>>> > > > >>>>> Fully fixing it would be quite complex, requiring similar handling > > > >>>>> of folios as done in move_present_pte. > > > >>>> > > > >>>> How complex would that be? Is it a matter of adding > > > >>>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > > > >>>> folio->index = linear_page_index like in move_present_pte() or > > > >>>> something more? > > > >>> > > > >>> My main concern is still with large folios that require a split_folio() > > > >>> during move_pages(), as the entire folio shares the same index and > > > >>> anon_vma. However, userfaultfd_move() moves pages individually, > > > >>> making a split necessary. > > > >>> > > > >>> However, in split_huge_page_to_list_to_order(), there is a: > > > >>> > > > >>> if (folio_test_writeback(folio)) > > > >>> return -EBUSY; > > > >>> > > > >>> This is likely true for swapcache, right? However, even for move_present_pte(), > > > >>> it simply returns -EBUSY: > > > >>> > > > >>> move_pages_pte() > > > >>> { > > > >>> /* at this point we have src_folio locked */ > > > >>> if (folio_test_large(src_folio)) { > > > >>> /* split_folio() can block */ > > > >>> pte_unmap(&orig_src_pte); > > > >>> pte_unmap(&orig_dst_pte); > > > >>> src_pte = dst_pte = NULL; > > > >>> err = split_folio(src_folio); > > > >>> if (err) > > > >>> goto out; > > > >>> > > > >>> /* have to reacquire the folio after it got split */ > > > >>> folio_unlock(src_folio); > > > >>> folio_put(src_folio); > > > >>> src_folio = NULL; > > > >>> goto retry; > > > >>> } > > > >>> } > > > >>> > > > >>> Do we need a folio_wait_writeback() before calling split_folio()? > > > >>> > > > >>> By the way, I have also reported that userfaultfd_move() has a fundamental > > > >>> conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common > > > >>> kernel. In this scenario, folios in the virtual zone won’t be split in > > > >>> split_folio(). Instead, the large folio migrates into nr_pages small folios. > > > >> > > Thus, the best-case scenario would be: > > > >>> > > > >>> mTHP -> migrate to small folios in split_folio() -> move small folios to > > > >>> dst_addr > > > >>> > > > >>> While this works, it negates the performance benefits of > > > >>> userfaultfd_move(), as it introduces two PTE operations (migration in > > > >>> split_folio() and move in userfaultfd_move() while retry), nr_pages memory > > > >>> allocations, and still requires one memcpy(). This could end up > > > >>> performing even worse than userfaultfd_copy(), I guess. > > > >> > > The worst-case scenario would be failing to allocate small folios in > > > >>> split_folio(), then userfaultfd_move() might return -ENOMEM? > > > >> > > > >> Although that's an Android problem and not an upstream problem, I'll > > > >> note that there are other reasons why the split / move might fail, and > > > >> user space either must retry or fallback to a COPY. > > > >> > > > >> Regarding mTHP, we could move the whole folio if the user space-provided > > > >> range allows for batching over multiple PTEs (nr_ptes), they are in a > > > >> single VMA, and folio_mapcount() == nr_ptes. > > > >> > > > >> There are corner cases to handle, such as moving mTHPs such that they > > > >> suddenly cross two page tables I assume, that are harder to handle when > > > >> not moving individual PTEs where that cannot happen. > > > > > > > > This is a useful suggestion. I’ve heard that Lokesh is also interested in > > > > modifying ART to perform moves at the mTHP granularity, which would require > > > > kernel modifications as well. It’s likely the direction we’ll take after > > > > fixing the current urgent bugs. The current split_folio() really isn’t ideal. > > > > > > > > The corner cases you mentioned are definitely worth considering. However, > > > > once we can perform batch UFFDIO_MOVE, I believe that in most cases, > > > > the conflict between userfaultfd_move() and TAO will be resolved ? > > > > > > Well, as soon as you would have varying mTHP sizes, you'd still run into > > > the split with TAO. Maybe that doesn't apply with Android today, but I > > > can just guess that performing sub-mTHP moving would still be required > > > for GC at some point. > > > > With patch v2[1], as discussed in my previous email, I have observed that > > small folios consistently succeed without crashing. Similarly, mTHP no > > longer crashes; however, it still returns -EBUSY during the raced time > > window, even after adding folio_wait_writeback. While I previously > > mentioned that folio_writeback prevents mTHP from splitting, this is not > > the only factor. The split_folio() function still returns -EBUSY because > > folio_get_anon_vma(folio) returns NULL when the folio is not mapped. > > > > int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > > unsigned int new_order) > > { > > anon_vma = folio_get_anon_vma(folio); > > if (!anon_vma) { > > ret = -EBUSY; > > goto out; > > } > > > > end = -1; > > mapping = NULL; > > anon_vma_lock_write(anon_vma); > > } > > > > Even if mTHP is not from TAO's virtual zone, userfaultfd_move() will still > > fail when performing sub-mTHP moving in the swap cache case due to: > > Just to clarify my doubt. What do you mean by sub-mTHP? Also when you > say 'small folio' above, do you mean single-page folios? This means any moving size smaller than the size of mTHP, or moving a partial mTHP. > > Am I understanding correctly that your patch correctly handles moving > single swap-cache page case? Yes, the crash is fixed for both small and large folios, and for small folios, moving is consistently successful(even for the swapcache case). The only issue is that sub-mTHP moving constantly fails for the swapcache case because split_folio() fails, even after waiting for writeback as split_folio() can only split mapped folios - which is false for swapcache since try_to_unmap_one() has been done. So I'd say for mTHP, returning -EBUSY as early as possible is the better choice to avoid wasting much time and eventually returning -EBUSY anyway unless we want to modify split_folio() things. > > > > struct anon_vma *folio_get_anon_vma(const struct folio *folio) > > { > > ... > > if (!folio_mapped(folio)) > > goto out; > > ... > > } > > > > We likely need to modify split_folio() to support splitting unmapped anon > > folios within the swap cache or introduce a new function like > > split_unmapped_anon_folio()? Otherwise, userspace will have to fall back > > to UFFDIO_COPY or retry. > > > > As it stands, I see no way for sub-mTHP to survive moving with the current > > code and within the existing raced window. For mTHP, there is essentially > > no difference between returning -EBUSY immediately upon detecting that it > > is within the swap cache, as proposed in v1. > > > > [1] https://lore.kernel.org/linux-mm/20250220092101.71966-1-21cnbao@gmail.com/ > > > > > > > > -- > > > Cheers, > > > > > > David / dhildenb > > > > > Thanks Barry
On 20.02.25 23:26, Barry Song wrote: > On Fri, Feb 21, 2025 at 11:20 AM Lokesh Gidra <lokeshgidra@google.com> wrote: >> >> On Thu, Feb 20, 2025 at 1:45 PM Barry Song <21cnbao@gmail.com> wrote: >>> >>> On Thu, Feb 20, 2025 at 10:36 PM David Hildenbrand <david@redhat.com> wrote: >>>> >>>> On 20.02.25 10:31, Barry Song wrote: >>>>> On Thu, Feb 20, 2025 at 9:51 PM David Hildenbrand <david@redhat.com> wrote: >>>>>> >>>>>> On 19.02.25 21:37, Barry Song wrote: >>>>>>> On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: >>>>>>>> >>>>>>>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: >>>>>>>>> >>>>>>>>> From: Barry Song <v-songbaohua@oppo.com> >>>>>>>>> >>>>>>>>> userfaultfd_move() checks whether the PTE entry is present or a >>>>>>>>> swap entry. >>>>>>>>> >>>>>>>>> - If the PTE entry is present, move_present_pte() handles folio >>>>>>>>> migration by setting: >>>>>>>>> >>>>>>>>> src_folio->index = linear_page_index(dst_vma, dst_addr); >>>>>>>>> >>>>>>>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies >>>>>>>>> the PTE to the new dst_addr. >>>>>>>>> >>>>>>>>> This approach is incorrect because even if the PTE is a swap >>>>>>>>> entry, it can still reference a folio that remains in the swap >>>>>>>>> cache. >>>>>>>>> >>>>>>>>> If do_swap_page() is triggered, it may locate the folio in the >>>>>>>>> swap cache. However, during add_rmap operations, a kernel panic >>>>>>>>> can occur due to: >>>>>>>>> page_pgoff(folio, page) != linear_page_index(vma, address) >>>>>>>> >>>>>>>> Thanks for the report and reproducer! >>>>>>>> >>>>>>>>> >>>>>>>>> $./a.out > /dev/null >>>>>>>>> [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c >>>>>>>>> [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 >>>>>>>>> [ 13.337716] memcg:ffff00000405f000 >>>>>>>>> [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) >>>>>>>>> [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 >>>>>>>>> [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 >>>>>>>>> [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 >>>>>>>>> [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 >>>>>>>>> [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 >>>>>>>>> [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 >>>>>>>>> [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) >>>>>>>>> [ 13.340190] ------------[ cut here ]------------ >>>>>>>>> [ 13.340316] kernel BUG at mm/rmap.c:1380! >>>>>>>>> [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP >>>>>>>>> [ 13.340969] Modules linked in: >>>>>>>>> [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 >>>>>>>>> [ 13.341470] Hardware name: linux,dummy-virt (DT) >>>>>>>>> [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>>>>>>>> [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 >>>>>>>>> [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 >>>>>>>>> [ 13.342018] sp : ffff80008752bb20 >>>>>>>>> [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 >>>>>>>>> [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 >>>>>>>>> [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 >>>>>>>>> [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff >>>>>>>>> [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f >>>>>>>>> [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 >>>>>>>>> [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 >>>>>>>>> [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 >>>>>>>>> [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 >>>>>>>>> [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f >>>>>>>>> [ 13.343876] Call trace: >>>>>>>>> [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) >>>>>>>>> [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 >>>>>>>>> [ 13.344333] do_swap_page+0x1060/0x1400 >>>>>>>>> [ 13.344417] __handle_mm_fault+0x61c/0xbc8 >>>>>>>>> [ 13.344504] handle_mm_fault+0xd8/0x2e8 >>>>>>>>> [ 13.344586] do_page_fault+0x20c/0x770 >>>>>>>>> [ 13.344673] do_translation_fault+0xb4/0xf0 >>>>>>>>> [ 13.344759] do_mem_abort+0x48/0xa0 >>>>>>>>> [ 13.344842] el0_da+0x58/0x130 >>>>>>>>> [ 13.344914] el0t_64_sync_handler+0xc4/0x138 >>>>>>>>> [ 13.345002] el0t_64_sync+0x1ac/0x1b0 >>>>>>>>> [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) >>>>>>>>> [ 13.345504] ---[ end trace 0000000000000000 ]--- >>>>>>>>> [ 13.345715] note: a.out[107] exited with irqs disabled >>>>>>>>> [ 13.345954] note: a.out[107] exited with preempt_count 2 >>>>>>>>> >>>>>>>>> Fully fixing it would be quite complex, requiring similar handling >>>>>>>>> of folios as done in move_present_pte. >>>>>>>> >>>>>>>> How complex would that be? Is it a matter of adding >>>>>>>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and >>>>>>>> folio->index = linear_page_index like in move_present_pte() or >>>>>>>> something more? >>>>>>> >>>>>>> My main concern is still with large folios that require a split_folio() >>>>>>> during move_pages(), as the entire folio shares the same index and >>>>>>> anon_vma. However, userfaultfd_move() moves pages individually, >>>>>>> making a split necessary. >>>>>>> >>>>>>> However, in split_huge_page_to_list_to_order(), there is a: >>>>>>> >>>>>>> if (folio_test_writeback(folio)) >>>>>>> return -EBUSY; >>>>>>> >>>>>>> This is likely true for swapcache, right? However, even for move_present_pte(), >>>>>>> it simply returns -EBUSY: >>>>>>> >>>>>>> move_pages_pte() >>>>>>> { >>>>>>> /* at this point we have src_folio locked */ >>>>>>> if (folio_test_large(src_folio)) { >>>>>>> /* split_folio() can block */ >>>>>>> pte_unmap(&orig_src_pte); >>>>>>> pte_unmap(&orig_dst_pte); >>>>>>> src_pte = dst_pte = NULL; >>>>>>> err = split_folio(src_folio); >>>>>>> if (err) >>>>>>> goto out; >>>>>>> >>>>>>> /* have to reacquire the folio after it got split */ >>>>>>> folio_unlock(src_folio); >>>>>>> folio_put(src_folio); >>>>>>> src_folio = NULL; >>>>>>> goto retry; >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> Do we need a folio_wait_writeback() before calling split_folio()? >>>>>>> >>>>>>> By the way, I have also reported that userfaultfd_move() has a fundamental >>>>>>> conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common >>>>>>> kernel. In this scenario, folios in the virtual zone won’t be split in >>>>>>> split_folio(). Instead, the large folio migrates into nr_pages small folios. >>>>>> > > Thus, the best-case scenario would be: >>>>>>> >>>>>>> mTHP -> migrate to small folios in split_folio() -> move small folios to >>>>>>> dst_addr >>>>>>> >>>>>>> While this works, it negates the performance benefits of >>>>>>> userfaultfd_move(), as it introduces two PTE operations (migration in >>>>>>> split_folio() and move in userfaultfd_move() while retry), nr_pages memory >>>>>>> allocations, and still requires one memcpy(). This could end up >>>>>>> performing even worse than userfaultfd_copy(), I guess. >>>>>> > > The worst-case scenario would be failing to allocate small folios in >>>>>>> split_folio(), then userfaultfd_move() might return -ENOMEM? >>>>>> >>>>>> Although that's an Android problem and not an upstream problem, I'll >>>>>> note that there are other reasons why the split / move might fail, and >>>>>> user space either must retry or fallback to a COPY. >>>>>> >>>>>> Regarding mTHP, we could move the whole folio if the user space-provided >>>>>> range allows for batching over multiple PTEs (nr_ptes), they are in a >>>>>> single VMA, and folio_mapcount() == nr_ptes. >>>>>> >>>>>> There are corner cases to handle, such as moving mTHPs such that they >>>>>> suddenly cross two page tables I assume, that are harder to handle when >>>>>> not moving individual PTEs where that cannot happen. >>>>> >>>>> This is a useful suggestion. I’ve heard that Lokesh is also interested in >>>>> modifying ART to perform moves at the mTHP granularity, which would require >>>>> kernel modifications as well. It’s likely the direction we’ll take after >>>>> fixing the current urgent bugs. The current split_folio() really isn’t ideal. >>>>> >>>>> The corner cases you mentioned are definitely worth considering. However, >>>>> once we can perform batch UFFDIO_MOVE, I believe that in most cases, >>>>> the conflict between userfaultfd_move() and TAO will be resolved ? >>>> >>>> Well, as soon as you would have varying mTHP sizes, you'd still run into >>>> the split with TAO. Maybe that doesn't apply with Android today, but I >>>> can just guess that performing sub-mTHP moving would still be required >>>> for GC at some point. >>> >>> With patch v2[1], as discussed in my previous email, I have observed that >>> small folios consistently succeed without crashing. Similarly, mTHP no >>> longer crashes; however, it still returns -EBUSY during the raced time >>> window, even after adding folio_wait_writeback. While I previously >>> mentioned that folio_writeback prevents mTHP from splitting, this is not >>> the only factor. The split_folio() function still returns -EBUSY because >>> folio_get_anon_vma(folio) returns NULL when the folio is not mapped. >>> >>> int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, >>> unsigned int new_order) >>> { >>> anon_vma = folio_get_anon_vma(folio); >>> if (!anon_vma) { >>> ret = -EBUSY; >>> goto out; >>> } >>> >>> end = -1; >>> mapping = NULL; >>> anon_vma_lock_write(anon_vma); >>> } >>> >>> Even if mTHP is not from TAO's virtual zone, userfaultfd_move() will still >>> fail when performing sub-mTHP moving in the swap cache case due to: >> >> Just to clarify my doubt. What do you mean by sub-mTHP? Also when you >> say 'small folio' above, do you mean single-page folios? > > This means any moving size smaller than the size of mTHP, or moving > a partial mTHP. > >> >> Am I understanding correctly that your patch correctly handles moving >> single swap-cache page case? > > Yes, the crash is fixed for both small and large folios, and for small > folios, moving is consistently successful(even for the swapcache case). > The only issue is that sub-mTHP moving constantly fails for the swapcache > case because split_folio() fails, even after waiting for writeback as > split_folio() > can only split mapped folios - which is false for swapcache since > try_to_unmap_one() has been done. I mean, we (as the caller of split_folio()) have the VMA + anon_vma in our hands. Do we only have to bypass that mapping check, or is there something else that would block us?
On Thu, Feb 20, 2025 at 2:27 PM Barry Song <21cnbao@gmail.com> wrote: > > On Fri, Feb 21, 2025 at 11:20 AM Lokesh Gidra <lokeshgidra@google.com> wrote: > > > > On Thu, Feb 20, 2025 at 1:45 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > On Thu, Feb 20, 2025 at 10:36 PM David Hildenbrand <david@redhat.com> wrote: > > > > > > > > On 20.02.25 10:31, Barry Song wrote: > > > > > On Thu, Feb 20, 2025 at 9:51 PM David Hildenbrand <david@redhat.com> wrote: > > > > >> > > > > >> On 19.02.25 21:37, Barry Song wrote: > > > > >>> On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > >>>> > > > > >>>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > > > >>>>> > > > > >>>>> From: Barry Song <v-songbaohua@oppo.com> > > > > >>>>> > > > > >>>>> userfaultfd_move() checks whether the PTE entry is present or a > > > > >>>>> swap entry. > > > > >>>>> > > > > >>>>> - If the PTE entry is present, move_present_pte() handles folio > > > > >>>>> migration by setting: > > > > >>>>> > > > > >>>>> src_folio->index = linear_page_index(dst_vma, dst_addr); > > > > >>>>> > > > > >>>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies > > > > >>>>> the PTE to the new dst_addr. > > > > >>>>> > > > > >>>>> This approach is incorrect because even if the PTE is a swap > > > > >>>>> entry, it can still reference a folio that remains in the swap > > > > >>>>> cache. > > > > >>>>> > > > > >>>>> If do_swap_page() is triggered, it may locate the folio in the > > > > >>>>> swap cache. However, during add_rmap operations, a kernel panic > > > > >>>>> can occur due to: > > > > >>>>> page_pgoff(folio, page) != linear_page_index(vma, address) > > > > >>>> > > > > >>>> Thanks for the report and reproducer! > > > > >>>> > > > > >>>>> > > > > >>>>> $./a.out > /dev/null > > > > >>>>> [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > > > > >>>>> [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > > > > >>>>> [ 13.337716] memcg:ffff00000405f000 > > > > >>>>> [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > > > > >>>>> [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > >>>>> [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > >>>>> [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > >>>>> [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > >>>>> [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > > > > >>>>> [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > > > > >>>>> [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > > > > >>>>> [ 13.340190] ------------[ cut here ]------------ > > > > >>>>> [ 13.340316] kernel BUG at mm/rmap.c:1380! > > > > >>>>> [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > > > > >>>>> [ 13.340969] Modules linked in: > > > > >>>>> [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > > > > >>>>> [ 13.341470] Hardware name: linux,dummy-virt (DT) > > > > >>>>> [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > > > >>>>> [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > > > >>>>> [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > > > >>>>> [ 13.342018] sp : ffff80008752bb20 > > > > >>>>> [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > > > > >>>>> [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > > > > >>>>> [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > > > > >>>>> [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > > > > >>>>> [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > > > > >>>>> [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > > > > >>>>> [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > > > > >>>>> [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > > > > >>>>> [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > > > > >>>>> [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > > > > >>>>> [ 13.343876] Call trace: > > > > >>>>> [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > > > >>>>> [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > > > >>>>> [ 13.344333] do_swap_page+0x1060/0x1400 > > > > >>>>> [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > > > >>>>> [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > > > >>>>> [ 13.344586] do_page_fault+0x20c/0x770 > > > > >>>>> [ 13.344673] do_translation_fault+0xb4/0xf0 > > > > >>>>> [ 13.344759] do_mem_abort+0x48/0xa0 > > > > >>>>> [ 13.344842] el0_da+0x58/0x130 > > > > >>>>> [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > > > >>>>> [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > > > >>>>> [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > > > > >>>>> [ 13.345504] ---[ end trace 0000000000000000 ]--- > > > > >>>>> [ 13.345715] note: a.out[107] exited with irqs disabled > > > > >>>>> [ 13.345954] note: a.out[107] exited with preempt_count 2 > > > > >>>>> > > > > >>>>> Fully fixing it would be quite complex, requiring similar handling > > > > >>>>> of folios as done in move_present_pte. > > > > >>>> > > > > >>>> How complex would that be? Is it a matter of adding > > > > >>>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > > > > >>>> folio->index = linear_page_index like in move_present_pte() or > > > > >>>> something more? > > > > >>> > > > > >>> My main concern is still with large folios that require a split_folio() > > > > >>> during move_pages(), as the entire folio shares the same index and > > > > >>> anon_vma. However, userfaultfd_move() moves pages individually, > > > > >>> making a split necessary. > > > > >>> > > > > >>> However, in split_huge_page_to_list_to_order(), there is a: > > > > >>> > > > > >>> if (folio_test_writeback(folio)) > > > > >>> return -EBUSY; > > > > >>> > > > > >>> This is likely true for swapcache, right? However, even for move_present_pte(), > > > > >>> it simply returns -EBUSY: > > > > >>> > > > > >>> move_pages_pte() > > > > >>> { > > > > >>> /* at this point we have src_folio locked */ > > > > >>> if (folio_test_large(src_folio)) { > > > > >>> /* split_folio() can block */ > > > > >>> pte_unmap(&orig_src_pte); > > > > >>> pte_unmap(&orig_dst_pte); > > > > >>> src_pte = dst_pte = NULL; > > > > >>> err = split_folio(src_folio); > > > > >>> if (err) > > > > >>> goto out; > > > > >>> > > > > >>> /* have to reacquire the folio after it got split */ > > > > >>> folio_unlock(src_folio); > > > > >>> folio_put(src_folio); > > > > >>> src_folio = NULL; > > > > >>> goto retry; > > > > >>> } > > > > >>> } > > > > >>> > > > > >>> Do we need a folio_wait_writeback() before calling split_folio()? > > > > >>> > > > > >>> By the way, I have also reported that userfaultfd_move() has a fundamental > > > > >>> conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common > > > > >>> kernel. In this scenario, folios in the virtual zone won’t be split in > > > > >>> split_folio(). Instead, the large folio migrates into nr_pages small folios. > > > > >> > > Thus, the best-case scenario would be: > > > > >>> > > > > >>> mTHP -> migrate to small folios in split_folio() -> move small folios to > > > > >>> dst_addr > > > > >>> > > > > >>> While this works, it negates the performance benefits of > > > > >>> userfaultfd_move(), as it introduces two PTE operations (migration in > > > > >>> split_folio() and move in userfaultfd_move() while retry), nr_pages memory > > > > >>> allocations, and still requires one memcpy(). This could end up > > > > >>> performing even worse than userfaultfd_copy(), I guess. > > > > >> > > The worst-case scenario would be failing to allocate small folios in > > > > >>> split_folio(), then userfaultfd_move() might return -ENOMEM? > > > > >> > > > > >> Although that's an Android problem and not an upstream problem, I'll > > > > >> note that there are other reasons why the split / move might fail, and > > > > >> user space either must retry or fallback to a COPY. > > > > >> > > > > >> Regarding mTHP, we could move the whole folio if the user space-provided > > > > >> range allows for batching over multiple PTEs (nr_ptes), they are in a > > > > >> single VMA, and folio_mapcount() == nr_ptes. > > > > >> > > > > >> There are corner cases to handle, such as moving mTHPs such that they > > > > >> suddenly cross two page tables I assume, that are harder to handle when > > > > >> not moving individual PTEs where that cannot happen. > > > > > > > > > > This is a useful suggestion. I’ve heard that Lokesh is also interested in > > > > > modifying ART to perform moves at the mTHP granularity, which would require > > > > > kernel modifications as well. It’s likely the direction we’ll take after > > > > > fixing the current urgent bugs. The current split_folio() really isn’t ideal. > > > > > > > > > > The corner cases you mentioned are definitely worth considering. However, > > > > > once we can perform batch UFFDIO_MOVE, I believe that in most cases, > > > > > the conflict between userfaultfd_move() and TAO will be resolved ? > > > > > > > > Well, as soon as you would have varying mTHP sizes, you'd still run into > > > > the split with TAO. Maybe that doesn't apply with Android today, but I > > > > can just guess that performing sub-mTHP moving would still be required > > > > for GC at some point. > > > > > > With patch v2[1], as discussed in my previous email, I have observed that > > > small folios consistently succeed without crashing. Similarly, mTHP no > > > longer crashes; however, it still returns -EBUSY during the raced time > > > window, even after adding folio_wait_writeback. While I previously > > > mentioned that folio_writeback prevents mTHP from splitting, this is not > > > the only factor. The split_folio() function still returns -EBUSY because > > > folio_get_anon_vma(folio) returns NULL when the folio is not mapped. > > > > > > int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, > > > unsigned int new_order) > > > { > > > anon_vma = folio_get_anon_vma(folio); > > > if (!anon_vma) { > > > ret = -EBUSY; > > > goto out; > > > } > > > > > > end = -1; > > > mapping = NULL; > > > anon_vma_lock_write(anon_vma); > > > } > > > > > > Even if mTHP is not from TAO's virtual zone, userfaultfd_move() will still > > > fail when performing sub-mTHP moving in the swap cache case due to: > > > > Just to clarify my doubt. What do you mean by sub-mTHP? Also when you > > say 'small folio' above, do you mean single-page folios? > > This means any moving size smaller than the size of mTHP, or moving > a partial mTHP. > > > > > Am I understanding correctly that your patch correctly handles moving > > single swap-cache page case? > > Yes, the crash is fixed for both small and large folios, and for small > folios, moving is consistently successful(even for the swapcache case). > The only issue is that sub-mTHP moving constantly fails for the swapcache > case because split_folio() fails, even after waiting for writeback as > split_folio() > can only split mapped folios - which is false for swapcache since > try_to_unmap_one() has been done. > > So I'd say for mTHP, returning -EBUSY as early as possible is the > better choice to avoid wasting much time and eventually returning > -EBUSY anyway unless we want to modify split_folio() things. > Great! In this case, can we please fix the kernel panic bug as soon as possible. Until that is fixed, the ioctl is practically unusable. > > > > > > struct anon_vma *folio_get_anon_vma(const struct folio *folio) > > > { > > > ... > > > if (!folio_mapped(folio)) > > > goto out; > > > ... > > > } > > > > > > We likely need to modify split_folio() to support splitting unmapped anon > > > folios within the swap cache or introduce a new function like > > > split_unmapped_anon_folio()? Otherwise, userspace will have to fall back > > > to UFFDIO_COPY or retry. > > > > > > As it stands, I see no way for sub-mTHP to survive moving with the current > > > code and within the existing raced window. For mTHP, there is essentially > > > no difference between returning -EBUSY immediately upon detecting that it > > > is within the swap cache, as proposed in v1. > > > > > > [1] https://lore.kernel.org/linux-mm/20250220092101.71966-1-21cnbao@gmail.com/ > > > > > > > > > > > -- > > > > Cheers, > > > > > > > > David / dhildenb > > > > > > > > Thanks > Barry
On Thu, Feb 20, 2025 at 12:04:40PM +1300, Barry Song wrote: > On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote: > > > > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote: > > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > > > > > userfaultfd_move() checks whether the PTE entry is present or a > > > > > swap entry. > > > > > > > > > > - If the PTE entry is present, move_present_pte() handles folio > > > > > migration by setting: > > > > > > > > > > src_folio->index = linear_page_index(dst_vma, dst_addr); > > > > > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies > > > > > the PTE to the new dst_addr. > > > > > > > > > > This approach is incorrect because even if the PTE is a swap > > > > > entry, it can still reference a folio that remains in the swap > > > > > cache. > > > > > > > > > > If do_swap_page() is triggered, it may locate the folio in the > > > > > swap cache. However, during add_rmap operations, a kernel panic > > > > > can occur due to: > > > > > page_pgoff(folio, page) != linear_page_index(vma, address) > > > > > > > > Thanks for the report and reproducer! > > > > > > > > > > > > > > $./a.out > /dev/null > > > > > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > > > > > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > > > > > [ 13.337716] memcg:ffff00000405f000 > > > > > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > > > > > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > > > > > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > > > > > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > > > > > [ 13.340190] ------------[ cut here ]------------ > > > > > [ 13.340316] kernel BUG at mm/rmap.c:1380! > > > > > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > > > > > [ 13.340969] Modules linked in: > > > > > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > > > > > [ 13.341470] Hardware name: linux,dummy-virt (DT) > > > > > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > > > > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > > > > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > > > > [ 13.342018] sp : ffff80008752bb20 > > > > > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > > > > > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > > > > > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > > > > > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > > > > > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > > > > > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > > > > > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > > > > > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > > > > > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > > > > > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > > > > > [ 13.343876] Call trace: > > > > > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > > > > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > > > > [ 13.344333] do_swap_page+0x1060/0x1400 > > > > > [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > > > > [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > > > > [ 13.344586] do_page_fault+0x20c/0x770 > > > > > [ 13.344673] do_translation_fault+0xb4/0xf0 > > > > > [ 13.344759] do_mem_abort+0x48/0xa0 > > > > > [ 13.344842] el0_da+0x58/0x130 > > > > > [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > > > > [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > > > > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > > > > > [ 13.345504] ---[ end trace 0000000000000000 ]--- > > > > > [ 13.345715] note: a.out[107] exited with irqs disabled > > > > > [ 13.345954] note: a.out[107] exited with preempt_count 2 > > > > > > > > > > Fully fixing it would be quite complex, requiring similar handling > > > > > of folios as done in move_present_pte. > > > > > > > > How complex would that be? Is it a matter of adding > > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > > > > folio->index = linear_page_index like in move_present_pte() or > > > > something more? > > > > > > My main concern is still with large folios that require a split_folio() > > > during move_pages(), as the entire folio shares the same index and > > > anon_vma. However, userfaultfd_move() moves pages individually, > > > making a split necessary. > > > > > > However, in split_huge_page_to_list_to_order(), there is a: > > > > > > if (folio_test_writeback(folio)) > > > return -EBUSY; > > > > > > This is likely true for swapcache, right? However, even for move_present_pte(), > > > it simply returns -EBUSY: > > > > > > move_pages_pte() > > > { > > > /* at this point we have src_folio locked */ > > > if (folio_test_large(src_folio)) { > > > /* split_folio() can block */ > > > pte_unmap(&orig_src_pte); > > > pte_unmap(&orig_dst_pte); > > > src_pte = dst_pte = NULL; > > > err = split_folio(src_folio); > > > if (err) > > > goto out; > > > > > > /* have to reacquire the folio after it got split */ > > > folio_unlock(src_folio); > > > folio_put(src_folio); > > > src_folio = NULL; > > > goto retry; > > > } > > > } > > > > > > Do we need a folio_wait_writeback() before calling split_folio()? > > > > Maybe no need in the first version to fix the immediate bug? > > > > It's also not always the case to hit writeback here. IIUC, writeback only > > happens for a short window when the folio was just added into swapcache. > > MOVE can happen much later after that anytime before a swapin. My > > understanding is that's also what Matthew wanted to point out. It may be > > better justified of that in a separate change with some performance > > measurements. > > The bug we’re discussing occurs precisely within the short window you > mentioned. > > 1. add_to_swap: The folio is added to swapcache. > 2. try_to_unmap: PTEs are converted to swap entries. > 3. pageout > 4. Swapcache is cleared. Hmm, I see. I was expecting step 4 to be "writeback is cleared".. or at least that should be step 3.5, as IIUC "writeback" needs to be cleared before "swapcache" bit being cleared. > > The issue happens between steps 2 and 4, where the PTE is not present, but > the folio is still in swapcache - the current code does move_swap_pte() but does > not fixup folio->index within swapcache. One thing I'm still not clear here is why it's a race condition, rather than more severe than that. I mean, folio->index is definitely wrong, then as long as the page still in swapcache, we should be able to move the swp entry over to dest addr of UFFDIO_MOVE, read on dest addr, then it'll see the page in swapcache with the wrong folio->index already and trigger. I wrote a quick test like that, it actually won't trigger.. I had a closer look in the code, I think it's because do_swap_page() has the logic to detect folio->index matching first, and allocate a new folio if it doesn't match in ksm_might_need_to_copy(). IIUC that was for ksm.. but it looks like it's functioning too here. ksm_might_need_to_copy: if (folio_test_ksm(folio)) { if (folio_stable_node(folio) && !(ksm_run & KSM_RUN_UNMERGE)) return folio; /* no need to copy it */ } else if (!anon_vma) { return folio; /* no need to copy it */ } else if (folio->index == linear_page_index(vma, addr) && <---------- [1] anon_vma->root == vma->anon_vma->root) { return folio; /* still no need to copy it */ } ... new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, addr); <---- [2] ... So I believe what I hit is at [1] it sees index doesn't match, then it decided to allocate a new folio. In this case, it won't hit your BUG because it'll be "folio != swapcache" later, so it'll setup the folio->index for the new one, rather than the sanity check. Do you know how your case got triggered, being able to bypass the above [1] which should check folio->index already? > > My point is that if we want a proper fix for mTHP, we'd better handle writeback. > Otherwise, this isn’t much different from directly returning -EBUSY as proposed > in this RFC. > > For small folios, there’s no split_folio issue, making it relatively > simpler. Lokesh > mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely > the first priority. Agreed.
On Thu, Feb 20, 2025 at 10:21:01PM +1300, Barry Song wrote:
> 2. src_anon_vma and its lock – swapcache doesn’t require it(folio is not mapped)
Could you help explain what guarantees the rmap walk not happen on a
swapcache page?
I'm not familiar with this path, though at least I see damon can start a
rmap walk on PageAnon almost with no locking.. some explanations would be
appreciated.
On Thu, Feb 20, 2025 at 2:59 PM Peter Xu <peterx@redhat.com> wrote: > > On Thu, Feb 20, 2025 at 12:04:40PM +1300, Barry Song wrote: > > On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote: > > > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > > > > > > > userfaultfd_move() checks whether the PTE entry is present or a > > > > > > swap entry. > > > > > > > > > > > > - If the PTE entry is present, move_present_pte() handles folio > > > > > > migration by setting: > > > > > > > > > > > > src_folio->index = linear_page_index(dst_vma, dst_addr); > > > > > > > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies > > > > > > the PTE to the new dst_addr. > > > > > > > > > > > > This approach is incorrect because even if the PTE is a swap > > > > > > entry, it can still reference a folio that remains in the swap > > > > > > cache. > > > > > > > > > > > > If do_swap_page() is triggered, it may locate the folio in the > > > > > > swap cache. However, during add_rmap operations, a kernel panic > > > > > > can occur due to: > > > > > > page_pgoff(folio, page) != linear_page_index(vma, address) > > > > > > > > > > Thanks for the report and reproducer! > > > > > > > > > > > > > > > > > $./a.out > /dev/null > > > > > > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > > > > > > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > > > > > > [ 13.337716] memcg:ffff00000405f000 > > > > > > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > > > > > > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > > > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > > > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > > > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > > > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > > > > > > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > > > > > > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > > > > > > [ 13.340190] ------------[ cut here ]------------ > > > > > > [ 13.340316] kernel BUG at mm/rmap.c:1380! > > > > > > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > > > > > > [ 13.340969] Modules linked in: > > > > > > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > > > > > > [ 13.341470] Hardware name: linux,dummy-virt (DT) > > > > > > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > > > > > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > > > > > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > > > > > [ 13.342018] sp : ffff80008752bb20 > > > > > > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > > > > > > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > > > > > > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > > > > > > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > > > > > > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > > > > > > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > > > > > > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > > > > > > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > > > > > > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > > > > > > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > > > > > > [ 13.343876] Call trace: > > > > > > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > > > > > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > > > > > [ 13.344333] do_swap_page+0x1060/0x1400 > > > > > > [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > > > > > [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > > > > > [ 13.344586] do_page_fault+0x20c/0x770 > > > > > > [ 13.344673] do_translation_fault+0xb4/0xf0 > > > > > > [ 13.344759] do_mem_abort+0x48/0xa0 > > > > > > [ 13.344842] el0_da+0x58/0x130 > > > > > > [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > > > > > [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > > > > > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > > > > > > [ 13.345504] ---[ end trace 0000000000000000 ]--- > > > > > > [ 13.345715] note: a.out[107] exited with irqs disabled > > > > > > [ 13.345954] note: a.out[107] exited with preempt_count 2 > > > > > > > > > > > > Fully fixing it would be quite complex, requiring similar handling > > > > > > of folios as done in move_present_pte. > > > > > > > > > > How complex would that be? Is it a matter of adding > > > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > > > > > folio->index = linear_page_index like in move_present_pte() or > > > > > something more? > > > > > > > > My main concern is still with large folios that require a split_folio() > > > > during move_pages(), as the entire folio shares the same index and > > > > anon_vma. However, userfaultfd_move() moves pages individually, > > > > making a split necessary. > > > > > > > > However, in split_huge_page_to_list_to_order(), there is a: > > > > > > > > if (folio_test_writeback(folio)) > > > > return -EBUSY; > > > > > > > > This is likely true for swapcache, right? However, even for move_present_pte(), > > > > it simply returns -EBUSY: > > > > > > > > move_pages_pte() > > > > { > > > > /* at this point we have src_folio locked */ > > > > if (folio_test_large(src_folio)) { > > > > /* split_folio() can block */ > > > > pte_unmap(&orig_src_pte); > > > > pte_unmap(&orig_dst_pte); > > > > src_pte = dst_pte = NULL; > > > > err = split_folio(src_folio); > > > > if (err) > > > > goto out; > > > > > > > > /* have to reacquire the folio after it got split */ > > > > folio_unlock(src_folio); > > > > folio_put(src_folio); > > > > src_folio = NULL; > > > > goto retry; > > > > } > > > > } > > > > > > > > Do we need a folio_wait_writeback() before calling split_folio()? > > > > > > Maybe no need in the first version to fix the immediate bug? > > > > > > It's also not always the case to hit writeback here. IIUC, writeback only > > > happens for a short window when the folio was just added into swapcache. > > > MOVE can happen much later after that anytime before a swapin. My > > > understanding is that's also what Matthew wanted to point out. It may be > > > better justified of that in a separate change with some performance > > > measurements. > > > > The bug we’re discussing occurs precisely within the short window you > > mentioned. > > > > 1. add_to_swap: The folio is added to swapcache. > > 2. try_to_unmap: PTEs are converted to swap entries. > > 3. pageout > > 4. Swapcache is cleared. > > Hmm, I see. I was expecting step 4 to be "writeback is cleared".. or at > least that should be step 3.5, as IIUC "writeback" needs to be cleared > before "swapcache" bit being cleared. > > > > > The issue happens between steps 2 and 4, where the PTE is not present, but > > the folio is still in swapcache - the current code does move_swap_pte() but does > > not fixup folio->index within swapcache. > > One thing I'm still not clear here is why it's a race condition, rather > than more severe than that. I mean, folio->index is definitely wrong, then > as long as the page still in swapcache, we should be able to move the swp > entry over to dest addr of UFFDIO_MOVE, read on dest addr, then it'll see > the page in swapcache with the wrong folio->index already and trigger. > > I wrote a quick test like that, it actually won't trigger.. > > I had a closer look in the code, I think it's because do_swap_page() has > the logic to detect folio->index matching first, and allocate a new folio > if it doesn't match in ksm_might_need_to_copy(). IIUC that was for > ksm.. but it looks like it's functioning too here. > > ksm_might_need_to_copy: > if (folio_test_ksm(folio)) { > if (folio_stable_node(folio) && > !(ksm_run & KSM_RUN_UNMERGE)) > return folio; /* no need to copy it */ > } else if (!anon_vma) { > return folio; /* no need to copy it */ > } else if (folio->index == linear_page_index(vma, addr) && <---------- [1] > anon_vma->root == vma->anon_vma->root) { > return folio; /* still no need to copy it */ > } > ... > > new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, addr); <---- [2] > ... > > So I believe what I hit is at [1] it sees index doesn't match, then it > decided to allocate a new folio. In this case, it won't hit your BUG > because it'll be "folio != swapcache" later, so it'll setup the > folio->index for the new one, rather than the sanity check. > > Do you know how your case got triggered, being able to bypass the above [1] > which should check folio->index already? To understand the change I tried applying the proposed patch to both mm-unstable and Linus' ToT and got conflicts for both trees. Barry, which baseline are you using? > > > > > My point is that if we want a proper fix for mTHP, we'd better handle writeback. > > Otherwise, this isn’t much different from directly returning -EBUSY as proposed > > in this RFC. > > > > For small folios, there’s no split_folio issue, making it relatively > > simpler. Lokesh > > mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely > > the first priority. > > Agreed. > > -- > Peter Xu >
On Thu, Feb 20, 2025 at 3:47 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Thu, Feb 20, 2025 at 2:59 PM Peter Xu <peterx@redhat.com> wrote: > > > > On Thu, Feb 20, 2025 at 12:04:40PM +1300, Barry Song wrote: > > > On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote: > > > > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > > > > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > > > > > > > > > userfaultfd_move() checks whether the PTE entry is present or a > > > > > > > swap entry. > > > > > > > > > > > > > > - If the PTE entry is present, move_present_pte() handles folio > > > > > > > migration by setting: > > > > > > > > > > > > > > src_folio->index = linear_page_index(dst_vma, dst_addr); > > > > > > > > > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies > > > > > > > the PTE to the new dst_addr. > > > > > > > > > > > > > > This approach is incorrect because even if the PTE is a swap > > > > > > > entry, it can still reference a folio that remains in the swap > > > > > > > cache. > > > > > > > > > > > > > > If do_swap_page() is triggered, it may locate the folio in the > > > > > > > swap cache. However, during add_rmap operations, a kernel panic > > > > > > > can occur due to: > > > > > > > page_pgoff(folio, page) != linear_page_index(vma, address) > > > > > > > > > > > > Thanks for the report and reproducer! > > > > > > > > > > > > > > > > > > > > $./a.out > /dev/null > > > > > > > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > > > > > > > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > > > > > > > [ 13.337716] memcg:ffff00000405f000 > > > > > > > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > > > > > > > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > > > > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > > > > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > > > > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > > > > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > > > > > > > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > > > > > > > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > > > > > > > [ 13.340190] ------------[ cut here ]------------ > > > > > > > [ 13.340316] kernel BUG at mm/rmap.c:1380! > > > > > > > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > > > > > > > [ 13.340969] Modules linked in: > > > > > > > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > > > > > > > [ 13.341470] Hardware name: linux,dummy-virt (DT) > > > > > > > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > > > > > > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > > > > > > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > > > > > > [ 13.342018] sp : ffff80008752bb20 > > > > > > > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > > > > > > > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > > > > > > > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > > > > > > > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > > > > > > > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > > > > > > > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > > > > > > > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > > > > > > > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > > > > > > > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > > > > > > > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > > > > > > > [ 13.343876] Call trace: > > > > > > > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > > > > > > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > > > > > > [ 13.344333] do_swap_page+0x1060/0x1400 > > > > > > > [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > > > > > > [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > > > > > > [ 13.344586] do_page_fault+0x20c/0x770 > > > > > > > [ 13.344673] do_translation_fault+0xb4/0xf0 > > > > > > > [ 13.344759] do_mem_abort+0x48/0xa0 > > > > > > > [ 13.344842] el0_da+0x58/0x130 > > > > > > > [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > > > > > > [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > > > > > > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > > > > > > > [ 13.345504] ---[ end trace 0000000000000000 ]--- > > > > > > > [ 13.345715] note: a.out[107] exited with irqs disabled > > > > > > > [ 13.345954] note: a.out[107] exited with preempt_count 2 > > > > > > > > > > > > > > Fully fixing it would be quite complex, requiring similar handling > > > > > > > of folios as done in move_present_pte. > > > > > > > > > > > > How complex would that be? Is it a matter of adding > > > > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > > > > > > folio->index = linear_page_index like in move_present_pte() or > > > > > > something more? > > > > > > > > > > My main concern is still with large folios that require a split_folio() > > > > > during move_pages(), as the entire folio shares the same index and > > > > > anon_vma. However, userfaultfd_move() moves pages individually, > > > > > making a split necessary. > > > > > > > > > > However, in split_huge_page_to_list_to_order(), there is a: > > > > > > > > > > if (folio_test_writeback(folio)) > > > > > return -EBUSY; > > > > > > > > > > This is likely true for swapcache, right? However, even for move_present_pte(), > > > > > it simply returns -EBUSY: > > > > > > > > > > move_pages_pte() > > > > > { > > > > > /* at this point we have src_folio locked */ > > > > > if (folio_test_large(src_folio)) { > > > > > /* split_folio() can block */ > > > > > pte_unmap(&orig_src_pte); > > > > > pte_unmap(&orig_dst_pte); > > > > > src_pte = dst_pte = NULL; > > > > > err = split_folio(src_folio); > > > > > if (err) > > > > > goto out; > > > > > > > > > > /* have to reacquire the folio after it got split */ > > > > > folio_unlock(src_folio); > > > > > folio_put(src_folio); > > > > > src_folio = NULL; > > > > > goto retry; > > > > > } > > > > > } > > > > > > > > > > Do we need a folio_wait_writeback() before calling split_folio()? > > > > > > > > Maybe no need in the first version to fix the immediate bug? > > > > > > > > It's also not always the case to hit writeback here. IIUC, writeback only > > > > happens for a short window when the folio was just added into swapcache. > > > > MOVE can happen much later after that anytime before a swapin. My > > > > understanding is that's also what Matthew wanted to point out. It may be > > > > better justified of that in a separate change with some performance > > > > measurements. > > > > > > The bug we’re discussing occurs precisely within the short window you > > > mentioned. > > > > > > 1. add_to_swap: The folio is added to swapcache. > > > 2. try_to_unmap: PTEs are converted to swap entries. > > > 3. pageout > > > 4. Swapcache is cleared. > > > > Hmm, I see. I was expecting step 4 to be "writeback is cleared".. or at > > least that should be step 3.5, as IIUC "writeback" needs to be cleared > > before "swapcache" bit being cleared. > > > > > > > > The issue happens between steps 2 and 4, where the PTE is not present, but > > > the folio is still in swapcache - the current code does move_swap_pte() but does > > > not fixup folio->index within swapcache. > > > > One thing I'm still not clear here is why it's a race condition, rather > > than more severe than that. I mean, folio->index is definitely wrong, then > > as long as the page still in swapcache, we should be able to move the swp > > entry over to dest addr of UFFDIO_MOVE, read on dest addr, then it'll see > > the page in swapcache with the wrong folio->index already and trigger. > > > > I wrote a quick test like that, it actually won't trigger.. > > > > I had a closer look in the code, I think it's because do_swap_page() has > > the logic to detect folio->index matching first, and allocate a new folio > > if it doesn't match in ksm_might_need_to_copy(). IIUC that was for > > ksm.. but it looks like it's functioning too here. > > > > ksm_might_need_to_copy: > > if (folio_test_ksm(folio)) { > > if (folio_stable_node(folio) && > > !(ksm_run & KSM_RUN_UNMERGE)) > > return folio; /* no need to copy it */ > > } else if (!anon_vma) { > > return folio; /* no need to copy it */ > > } else if (folio->index == linear_page_index(vma, addr) && <---------- [1] > > anon_vma->root == vma->anon_vma->root) { > > return folio; /* still no need to copy it */ > > } > > ... > > > > new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, addr); <---- [2] > > ... > > > > So I believe what I hit is at [1] it sees index doesn't match, then it > > decided to allocate a new folio. In this case, it won't hit your BUG > > because it'll be "folio != swapcache" later, so it'll setup the > > folio->index for the new one, rather than the sanity check. > > > > Do you know how your case got triggered, being able to bypass the above [1] > > which should check folio->index already? > > To understand the change I tried applying the proposed patch to both > mm-unstable and Linus' ToT and got conflicts for both trees. Barry, > which baseline are you using? Oops, never mind. My mistake. Copying from the email messed up tabs... It applies cleanly. > > > > > > > > > My point is that if we want a proper fix for mTHP, we'd better handle writeback. > > > Otherwise, this isn’t much different from directly returning -EBUSY as proposed > > > in this RFC. > > > > > > For small folios, there’s no split_folio issue, making it relatively > > > simpler. Lokesh > > > mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely > > > the first priority. > > > > Agreed. > > > > -- > > Peter Xu > >
On Fri, Feb 21, 2025 at 12:32 PM Peter Xu <peterx@redhat.com> wrote: > > On Thu, Feb 20, 2025 at 10:21:01PM +1300, Barry Song wrote: > > 2. src_anon_vma and its lock – swapcache doesn’t require it(folio is not mapped) > > Could you help explain what guarantees the rmap walk not happen on a > swapcache page? > > I'm not familiar with this path, though at least I see damon can start a > rmap walk on PageAnon almost with no locking.. some explanations would be > appreciated. I am observing the following in folio_referenced(), which the anon_vma lock was originally intended to protect. if (!pra.mapcount) return 0; I assume all other rmap walks should do the same? int folio_referenced(struct folio *folio, int is_locked, struct mem_cgroup *memcg, unsigned long *vm_flags) { bool we_locked = false; struct folio_referenced_arg pra = { .mapcount = folio_mapcount(folio), .memcg = memcg, }; struct rmap_walk_control rwc = { .rmap_one = folio_referenced_one, .arg = (void *)&pra, .anon_lock = folio_lock_anon_vma_read, .try_lock = true, .invalid_vma = invalid_folio_referenced_vma, }; *vm_flags = 0; if (!pra.mapcount) return 0; ... } By the way, since the folio has been under reclamation in this case and isn't in the lru, this should also prevent the rmap walk, right? > > -- > Peter Xu > Thanks Barry
On Thu, Feb 20, 2025 at 3:52 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Thu, Feb 20, 2025 at 3:47 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Thu, Feb 20, 2025 at 2:59 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > On Thu, Feb 20, 2025 at 12:04:40PM +1300, Barry Song wrote: > > > > On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote: > > > > > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > > > > > > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > > > > > > > > > > > userfaultfd_move() checks whether the PTE entry is present or a > > > > > > > > swap entry. > > > > > > > > > > > > > > > > - If the PTE entry is present, move_present_pte() handles folio > > > > > > > > migration by setting: > > > > > > > > > > > > > > > > src_folio->index = linear_page_index(dst_vma, dst_addr); > > > > > > > > > > > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies > > > > > > > > the PTE to the new dst_addr. > > > > > > > > > > > > > > > > This approach is incorrect because even if the PTE is a swap > > > > > > > > entry, it can still reference a folio that remains in the swap > > > > > > > > cache. > > > > > > > > > > > > > > > > If do_swap_page() is triggered, it may locate the folio in the > > > > > > > > swap cache. However, during add_rmap operations, a kernel panic > > > > > > > > can occur due to: > > > > > > > > page_pgoff(folio, page) != linear_page_index(vma, address) > > > > > > > > > > > > > > Thanks for the report and reproducer! > > > > > > > > > > > > > > > > > > > > > > > $./a.out > /dev/null > > > > > > > > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > > > > > > > > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > > > > > > > > [ 13.337716] memcg:ffff00000405f000 > > > > > > > > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > > > > > > > > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > > > > > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > > > > > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > > > > > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > > > > > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > > > > > > > > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > > > > > > > > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > > > > > > > > [ 13.340190] ------------[ cut here ]------------ > > > > > > > > [ 13.340316] kernel BUG at mm/rmap.c:1380! > > > > > > > > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > > > > > > > > [ 13.340969] Modules linked in: > > > > > > > > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > > > > > > > > [ 13.341470] Hardware name: linux,dummy-virt (DT) > > > > > > > > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > > > > > > > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > > > > > > > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > > > > > > > [ 13.342018] sp : ffff80008752bb20 > > > > > > > > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > > > > > > > > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > > > > > > > > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > > > > > > > > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > > > > > > > > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > > > > > > > > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > > > > > > > > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > > > > > > > > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > > > > > > > > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > > > > > > > > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > > > > > > > > [ 13.343876] Call trace: > > > > > > > > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > > > > > > > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > > > > > > > [ 13.344333] do_swap_page+0x1060/0x1400 > > > > > > > > [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > > > > > > > [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > > > > > > > [ 13.344586] do_page_fault+0x20c/0x770 > > > > > > > > [ 13.344673] do_translation_fault+0xb4/0xf0 > > > > > > > > [ 13.344759] do_mem_abort+0x48/0xa0 > > > > > > > > [ 13.344842] el0_da+0x58/0x130 > > > > > > > > [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > > > > > > > [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > > > > > > > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > > > > > > > > [ 13.345504] ---[ end trace 0000000000000000 ]--- > > > > > > > > [ 13.345715] note: a.out[107] exited with irqs disabled > > > > > > > > [ 13.345954] note: a.out[107] exited with preempt_count 2 > > > > > > > > > > > > > > > > Fully fixing it would be quite complex, requiring similar handling > > > > > > > > of folios as done in move_present_pte. > > > > > > > > > > > > > > How complex would that be? Is it a matter of adding > > > > > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > > > > > > > folio->index = linear_page_index like in move_present_pte() or > > > > > > > something more? > > > > > > > > > > > > My main concern is still with large folios that require a split_folio() > > > > > > during move_pages(), as the entire folio shares the same index and > > > > > > anon_vma. However, userfaultfd_move() moves pages individually, > > > > > > making a split necessary. > > > > > > > > > > > > However, in split_huge_page_to_list_to_order(), there is a: > > > > > > > > > > > > if (folio_test_writeback(folio)) > > > > > > return -EBUSY; > > > > > > > > > > > > This is likely true for swapcache, right? However, even for move_present_pte(), > > > > > > it simply returns -EBUSY: > > > > > > > > > > > > move_pages_pte() > > > > > > { > > > > > > /* at this point we have src_folio locked */ > > > > > > if (folio_test_large(src_folio)) { > > > > > > /* split_folio() can block */ > > > > > > pte_unmap(&orig_src_pte); > > > > > > pte_unmap(&orig_dst_pte); > > > > > > src_pte = dst_pte = NULL; > > > > > > err = split_folio(src_folio); > > > > > > if (err) > > > > > > goto out; > > > > > > > > > > > > /* have to reacquire the folio after it got split */ > > > > > > folio_unlock(src_folio); > > > > > > folio_put(src_folio); > > > > > > src_folio = NULL; > > > > > > goto retry; > > > > > > } > > > > > > } > > > > > > > > > > > > Do we need a folio_wait_writeback() before calling split_folio()? > > > > > > > > > > Maybe no need in the first version to fix the immediate bug? > > > > > > > > > > It's also not always the case to hit writeback here. IIUC, writeback only > > > > > happens for a short window when the folio was just added into swapcache. > > > > > MOVE can happen much later after that anytime before a swapin. My > > > > > understanding is that's also what Matthew wanted to point out. It may be > > > > > better justified of that in a separate change with some performance > > > > > measurements. > > > > > > > > The bug we’re discussing occurs precisely within the short window you > > > > mentioned. > > > > > > > > 1. add_to_swap: The folio is added to swapcache. > > > > 2. try_to_unmap: PTEs are converted to swap entries. > > > > 3. pageout > > > > 4. Swapcache is cleared. > > > > > > Hmm, I see. I was expecting step 4 to be "writeback is cleared".. or at > > > least that should be step 3.5, as IIUC "writeback" needs to be cleared > > > before "swapcache" bit being cleared. > > > > > > > > > > > The issue happens between steps 2 and 4, where the PTE is not present, but > > > > the folio is still in swapcache - the current code does move_swap_pte() but does > > > > not fixup folio->index within swapcache. > > > > > > One thing I'm still not clear here is why it's a race condition, rather > > > than more severe than that. I mean, folio->index is definitely wrong, then > > > as long as the page still in swapcache, we should be able to move the swp > > > entry over to dest addr of UFFDIO_MOVE, read on dest addr, then it'll see > > > the page in swapcache with the wrong folio->index already and trigger. > > > > > > I wrote a quick test like that, it actually won't trigger.. > > > > > > I had a closer look in the code, I think it's because do_swap_page() has > > > the logic to detect folio->index matching first, and allocate a new folio > > > if it doesn't match in ksm_might_need_to_copy(). IIUC that was for > > > ksm.. but it looks like it's functioning too here. > > > > > > ksm_might_need_to_copy: > > > if (folio_test_ksm(folio)) { > > > if (folio_stable_node(folio) && > > > !(ksm_run & KSM_RUN_UNMERGE)) > > > return folio; /* no need to copy it */ > > > } else if (!anon_vma) { > > > return folio; /* no need to copy it */ > > > } else if (folio->index == linear_page_index(vma, addr) && <---------- [1] > > > anon_vma->root == vma->anon_vma->root) { > > > return folio; /* still no need to copy it */ > > > } > > > ... > > > > > > new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, addr); <---- [2] > > > ... > > > > > > So I believe what I hit is at [1] it sees index doesn't match, then it > > > decided to allocate a new folio. In this case, it won't hit your BUG > > > because it'll be "folio != swapcache" later, so it'll setup the > > > folio->index for the new one, rather than the sanity check. > > > > > > Do you know how your case got triggered, being able to bypass the above [1] > > > which should check folio->index already? > > > > To understand the change I tried applying the proposed patch to both > > mm-unstable and Linus' ToT and got conflicts for both trees. Barry, > > which baseline are you using? > > Oops, never mind. My mistake. Copying from the email messed up tabs... > It applies cleanly. Overall the code seems correct to me, however the new code has quite complex logical structure IMO. Original simplified code structure is like this: if (pte_present(orig_src_pte)) { if (is_zero_pfn) { move_zeropage_pte() return } // pin and lock src_folio spin_lock(src_ptl) folio_get(folio) folio_trylock(folio) if (folio_test_large(src_folio)) split_folio(src_folio) anon_vma_trylock_write(src_anon_vma) move_present_pte() } else { if (non_swap_entry(entry)) if (is_migration_entry(entry)) handle migration entry else move_swap_pte() } The new structure looks like this: if (!pte_present(orig_src_pte)) { if (is_migration_entry(entry)) { handle migration entry return } if (!non_swap_entry() || !pte_swp_exclusive()) return si = get_swap_device(entry); } if (pte_present(orig_src_pte) && is_zero_pfn(pte_pfn(orig_src_pte))) move_zeropage_pte() return } pin and lock src_folio spin_lock(src_ptl) if (pte_present(orig_src_pte)) folio_get(folio) else { folio = filemap_get_folio(swap_entry) if (IS_ERR(folio)) move_swap_pte() return } } folio_trylock(folio) if (folio_test_large(src_folio)) split_folio(src_folio) if (pte_present(orig_src_pte)) anon_vma_trylock_write(src_anon_vma) move_pte_and_folio() This looks more complex and harder to follow. Might be the reason David was not in favour of treating swapcache and present pages in the same path. And now I would agree that refactoring some common parts and not breaking the original structure might be cleaner. > > > > > > > > > > > > > > My point is that if we want a proper fix for mTHP, we'd better handle writeback. > > > > Otherwise, this isn’t much different from directly returning -EBUSY as proposed > > > > in this RFC. > > > > > > > > For small folios, there’s no split_folio issue, making it relatively > > > > simpler. Lokesh > > > > mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely > > > > the first priority. > > > > > > Agreed. > > > > > > -- > > > Peter Xu > > >
On Fri, Feb 21, 2025 at 11:59 AM Peter Xu <peterx@redhat.com> wrote: > > On Thu, Feb 20, 2025 at 12:04:40PM +1300, Barry Song wrote: > > On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote: > > > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > > > > > > > userfaultfd_move() checks whether the PTE entry is present or a > > > > > > swap entry. > > > > > > > > > > > > - If the PTE entry is present, move_present_pte() handles folio > > > > > > migration by setting: > > > > > > > > > > > > src_folio->index = linear_page_index(dst_vma, dst_addr); > > > > > > > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies > > > > > > the PTE to the new dst_addr. > > > > > > > > > > > > This approach is incorrect because even if the PTE is a swap > > > > > > entry, it can still reference a folio that remains in the swap > > > > > > cache. > > > > > > > > > > > > If do_swap_page() is triggered, it may locate the folio in the > > > > > > swap cache. However, during add_rmap operations, a kernel panic > > > > > > can occur due to: > > > > > > page_pgoff(folio, page) != linear_page_index(vma, address) > > > > > > > > > > Thanks for the report and reproducer! > > > > > > > > > > > > > > > > > $./a.out > /dev/null > > > > > > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > > > > > > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > > > > > > [ 13.337716] memcg:ffff00000405f000 > > > > > > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > > > > > > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > > > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > > > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > > > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > > > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > > > > > > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > > > > > > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > > > > > > [ 13.340190] ------------[ cut here ]------------ > > > > > > [ 13.340316] kernel BUG at mm/rmap.c:1380! > > > > > > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > > > > > > [ 13.340969] Modules linked in: > > > > > > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > > > > > > [ 13.341470] Hardware name: linux,dummy-virt (DT) > > > > > > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > > > > > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > > > > > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > > > > > [ 13.342018] sp : ffff80008752bb20 > > > > > > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > > > > > > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > > > > > > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > > > > > > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > > > > > > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > > > > > > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > > > > > > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > > > > > > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > > > > > > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > > > > > > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > > > > > > [ 13.343876] Call trace: > > > > > > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > > > > > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > > > > > [ 13.344333] do_swap_page+0x1060/0x1400 > > > > > > [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > > > > > [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > > > > > [ 13.344586] do_page_fault+0x20c/0x770 > > > > > > [ 13.344673] do_translation_fault+0xb4/0xf0 > > > > > > [ 13.344759] do_mem_abort+0x48/0xa0 > > > > > > [ 13.344842] el0_da+0x58/0x130 > > > > > > [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > > > > > [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > > > > > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > > > > > > [ 13.345504] ---[ end trace 0000000000000000 ]--- > > > > > > [ 13.345715] note: a.out[107] exited with irqs disabled > > > > > > [ 13.345954] note: a.out[107] exited with preempt_count 2 > > > > > > > > > > > > Fully fixing it would be quite complex, requiring similar handling > > > > > > of folios as done in move_present_pte. > > > > > > > > > > How complex would that be? Is it a matter of adding > > > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > > > > > folio->index = linear_page_index like in move_present_pte() or > > > > > something more? > > > > > > > > My main concern is still with large folios that require a split_folio() > > > > during move_pages(), as the entire folio shares the same index and > > > > anon_vma. However, userfaultfd_move() moves pages individually, > > > > making a split necessary. > > > > > > > > However, in split_huge_page_to_list_to_order(), there is a: > > > > > > > > if (folio_test_writeback(folio)) > > > > return -EBUSY; > > > > > > > > This is likely true for swapcache, right? However, even for move_present_pte(), > > > > it simply returns -EBUSY: > > > > > > > > move_pages_pte() > > > > { > > > > /* at this point we have src_folio locked */ > > > > if (folio_test_large(src_folio)) { > > > > /* split_folio() can block */ > > > > pte_unmap(&orig_src_pte); > > > > pte_unmap(&orig_dst_pte); > > > > src_pte = dst_pte = NULL; > > > > err = split_folio(src_folio); > > > > if (err) > > > > goto out; > > > > > > > > /* have to reacquire the folio after it got split */ > > > > folio_unlock(src_folio); > > > > folio_put(src_folio); > > > > src_folio = NULL; > > > > goto retry; > > > > } > > > > } > > > > > > > > Do we need a folio_wait_writeback() before calling split_folio()? > > > > > > Maybe no need in the first version to fix the immediate bug? > > > > > > It's also not always the case to hit writeback here. IIUC, writeback only > > > happens for a short window when the folio was just added into swapcache. > > > MOVE can happen much later after that anytime before a swapin. My > > > understanding is that's also what Matthew wanted to point out. It may be > > > better justified of that in a separate change with some performance > > > measurements. > > > > The bug we’re discussing occurs precisely within the short window you > > mentioned. > > > > 1. add_to_swap: The folio is added to swapcache. > > 2. try_to_unmap: PTEs are converted to swap entries. > > 3. pageout > > 4. Swapcache is cleared. > > Hmm, I see. I was expecting step 4 to be "writeback is cleared".. or at > least that should be step 3.5, as IIUC "writeback" needs to be cleared > before "swapcache" bit being cleared. > > > > > The issue happens between steps 2 and 4, where the PTE is not present, but > > the folio is still in swapcache - the current code does move_swap_pte() but does > > not fixup folio->index within swapcache. > > One thing I'm still not clear here is why it's a race condition, rather > than more severe than that. I mean, folio->index is definitely wrong, then > as long as the page still in swapcache, we should be able to move the swp > entry over to dest addr of UFFDIO_MOVE, read on dest addr, then it'll see > the page in swapcache with the wrong folio->index already and trigger. > > I wrote a quick test like that, it actually won't trigger.. > > I had a closer look in the code, I think it's because do_swap_page() has > the logic to detect folio->index matching first, and allocate a new folio > if it doesn't match in ksm_might_need_to_copy(). IIUC that was for > ksm.. but it looks like it's functioning too here. > > ksm_might_need_to_copy: > if (folio_test_ksm(folio)) { > if (folio_stable_node(folio) && > !(ksm_run & KSM_RUN_UNMERGE)) > return folio; /* no need to copy it */ > } else if (!anon_vma) { > return folio; /* no need to copy it */ > } else if (folio->index == linear_page_index(vma, addr) && <---------- [1] > anon_vma->root == vma->anon_vma->root) { > return folio; /* still no need to copy it */ > } > ... > > new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, addr); <---- [2] > ... > > So I believe what I hit is at [1] it sees index doesn't match, then it > decided to allocate a new folio. In this case, it won't hit your BUG > because it'll be "folio != swapcache" later, so it'll setup the > folio->index for the new one, rather than the sanity check. You're absolutely right. The problem goes beyond just crashes; we're also dealing with CoW when KSM is enabled. As long as we disable KSM(which is true for Android), or when we are dealing with a large folio, ksm_might_need_to_copy() will not allocate a new copy: struct folio *ksm_might_need_to_copy(struct folio *folio, struct vm_area_struct *vma, unsigned long addr) { struct page *page = folio_page(folio, 0); struct anon_vma *anon_vma = folio_anon_vma(folio); struct folio *new_folio; if (folio_test_large(folio)) return folio; .... } Thanks for your great findings! For the KSM-enabled and small folio case, it's pretty funny how UFFDIO_MOVE finally turns into a new allocation and copy— somehow automatically falling back to "UFFDIO_COPY" :-) It's amusing, but debugging it is fun. I'll add your findings to the changelog when I formally send v2, after gathering all the code refinement suggestions and implementing the improvements. > > Do you know how your case got triggered, being able to bypass the above [1] > which should check folio->index already? > > > > > My point is that if we want a proper fix for mTHP, we'd better handle writeback. > > Otherwise, this isn’t much different from directly returning -EBUSY as proposed > > in this RFC. > > > > For small folios, there’s no split_folio issue, making it relatively > > simpler. Lokesh > > mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely > > the first priority. > > Agreed. > > -- > Peter Xu > Thanks Barry
On Fri, Feb 21, 2025 at 01:07:24PM +1300, Barry Song wrote: > On Fri, Feb 21, 2025 at 12:32 PM Peter Xu <peterx@redhat.com> wrote: > > > > On Thu, Feb 20, 2025 at 10:21:01PM +1300, Barry Song wrote: > > > 2. src_anon_vma and its lock – swapcache doesn’t require it(folio is not mapped) > > > > Could you help explain what guarantees the rmap walk not happen on a > > swapcache page? > > > > I'm not familiar with this path, though at least I see damon can start a > > rmap walk on PageAnon almost with no locking.. some explanations would be > > appreciated. > > I am observing the following in folio_referenced(), which the anon_vma lock > was originally intended to protect. > > if (!pra.mapcount) > return 0; > > I assume all other rmap walks should do the same? Yes normally there'll be a folio_mapcount() check, however.. > > int folio_referenced(struct folio *folio, int is_locked, > struct mem_cgroup *memcg, unsigned long *vm_flags) > { > > bool we_locked = false; > struct folio_referenced_arg pra = { > .mapcount = folio_mapcount(folio), > .memcg = memcg, > }; > > struct rmap_walk_control rwc = { > .rmap_one = folio_referenced_one, > .arg = (void *)&pra, > .anon_lock = folio_lock_anon_vma_read, > .try_lock = true, > .invalid_vma = invalid_folio_referenced_vma, > }; > > *vm_flags = 0; > if (!pra.mapcount) > return 0; > ... > } > > By the way, since the folio has been under reclamation in this case and > isn't in the lru, this should also prevent the rmap walk, right? .. I'm not sure whether it's always working. The thing is anon doesn't even require folio lock held during (1) checking mapcount and (2) doing the rmap walk, in all similar cases as above. I see nothing blocks it from a concurrent thread zapping that last mapcount: thread 1 thread 2 -------- -------- [whatever scanner] check folio_mapcount(), non-zero zap the last map.. then mapcount==0 rmap_walk() Not sure if I missed something. The other thing is IIUC swapcache page can also have chance to be faulted in but only if a read not write. I actually had a feeling that your reproducer triggered that exact path, causing a read swap in, reusing the swapcache page, and hit the sanity check there somehow (even as mentioned in the other reply, I don't yet know why the 1st check didn't seem to work.. as we do check folio->index twice..). Said that, I'm not sure if above concern will happen in this specific case, as UIFFDIO_MOVE is pretty special, that we check exclusive bit first in swp entry so we know it's definitely not mapped elsewhere, meanwhile if we hold pgtable lock so maybe it can't get mapped back.. it is just still tricky, at least we do some dances all over releasing and retaking locks. We could either justify that's safe, or maybe still ok and simpler if we could take anon_vma write lock, making sure nobody will be able to read the folio->index when it's prone to an update. Thanks,
On Fri, Feb 21, 2025 at 02:36:27PM +1300, Barry Song wrote: > On Fri, Feb 21, 2025 at 11:59 AM Peter Xu <peterx@redhat.com> wrote: > > > > On Thu, Feb 20, 2025 at 12:04:40PM +1300, Barry Song wrote: > > > On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote: > > > > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > > > > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > > > > > > > > > userfaultfd_move() checks whether the PTE entry is present or a > > > > > > > swap entry. > > > > > > > > > > > > > > - If the PTE entry is present, move_present_pte() handles folio > > > > > > > migration by setting: > > > > > > > > > > > > > > src_folio->index = linear_page_index(dst_vma, dst_addr); > > > > > > > > > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies > > > > > > > the PTE to the new dst_addr. > > > > > > > > > > > > > > This approach is incorrect because even if the PTE is a swap > > > > > > > entry, it can still reference a folio that remains in the swap > > > > > > > cache. > > > > > > > > > > > > > > If do_swap_page() is triggered, it may locate the folio in the > > > > > > > swap cache. However, during add_rmap operations, a kernel panic > > > > > > > can occur due to: > > > > > > > page_pgoff(folio, page) != linear_page_index(vma, address) > > > > > > > > > > > > Thanks for the report and reproducer! > > > > > > > > > > > > > > > > > > > > $./a.out > /dev/null > > > > > > > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c > > > > > > > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 > > > > > > > [ 13.337716] memcg:ffff00000405f000 > > > > > > > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) > > > > > > > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > > > > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > > > > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 > > > > > > > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 > > > > > > > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 > > > > > > > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > > > > > > > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) > > > > > > > [ 13.340190] ------------[ cut here ]------------ > > > > > > > [ 13.340316] kernel BUG at mm/rmap.c:1380! > > > > > > > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > > > > > > > [ 13.340969] Modules linked in: > > > > > > > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 > > > > > > > [ 13.341470] Hardware name: linux,dummy-virt (DT) > > > > > > > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > > > > > > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > > > > > > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > > > > > > [ 13.342018] sp : ffff80008752bb20 > > > > > > > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 > > > > > > > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 > > > > > > > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 > > > > > > > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff > > > > > > > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f > > > > > > > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 > > > > > > > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 > > > > > > > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 > > > > > > > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 > > > > > > > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f > > > > > > > [ 13.343876] Call trace: > > > > > > > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > > > > > > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > > > > > > [ 13.344333] do_swap_page+0x1060/0x1400 > > > > > > > [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > > > > > > [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > > > > > > [ 13.344586] do_page_fault+0x20c/0x770 > > > > > > > [ 13.344673] do_translation_fault+0xb4/0xf0 > > > > > > > [ 13.344759] do_mem_abort+0x48/0xa0 > > > > > > > [ 13.344842] el0_da+0x58/0x130 > > > > > > > [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > > > > > > [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > > > > > > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > > > > > > > [ 13.345504] ---[ end trace 0000000000000000 ]--- > > > > > > > [ 13.345715] note: a.out[107] exited with irqs disabled > > > > > > > [ 13.345954] note: a.out[107] exited with preempt_count 2 > > > > > > > > > > > > > > Fully fixing it would be quite complex, requiring similar handling > > > > > > > of folios as done in move_present_pte. > > > > > > > > > > > > How complex would that be? Is it a matter of adding > > > > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > > > > > > folio->index = linear_page_index like in move_present_pte() or > > > > > > something more? > > > > > > > > > > My main concern is still with large folios that require a split_folio() > > > > > during move_pages(), as the entire folio shares the same index and > > > > > anon_vma. However, userfaultfd_move() moves pages individually, > > > > > making a split necessary. > > > > > > > > > > However, in split_huge_page_to_list_to_order(), there is a: > > > > > > > > > > if (folio_test_writeback(folio)) > > > > > return -EBUSY; > > > > > > > > > > This is likely true for swapcache, right? However, even for move_present_pte(), > > > > > it simply returns -EBUSY: > > > > > > > > > > move_pages_pte() > > > > > { > > > > > /* at this point we have src_folio locked */ > > > > > if (folio_test_large(src_folio)) { > > > > > /* split_folio() can block */ > > > > > pte_unmap(&orig_src_pte); > > > > > pte_unmap(&orig_dst_pte); > > > > > src_pte = dst_pte = NULL; > > > > > err = split_folio(src_folio); > > > > > if (err) > > > > > goto out; > > > > > > > > > > /* have to reacquire the folio after it got split */ > > > > > folio_unlock(src_folio); > > > > > folio_put(src_folio); > > > > > src_folio = NULL; > > > > > goto retry; > > > > > } > > > > > } > > > > > > > > > > Do we need a folio_wait_writeback() before calling split_folio()? > > > > > > > > Maybe no need in the first version to fix the immediate bug? > > > > > > > > It's also not always the case to hit writeback here. IIUC, writeback only > > > > happens for a short window when the folio was just added into swapcache. > > > > MOVE can happen much later after that anytime before a swapin. My > > > > understanding is that's also what Matthew wanted to point out. It may be > > > > better justified of that in a separate change with some performance > > > > measurements. > > > > > > The bug we’re discussing occurs precisely within the short window you > > > mentioned. > > > > > > 1. add_to_swap: The folio is added to swapcache. > > > 2. try_to_unmap: PTEs are converted to swap entries. > > > 3. pageout > > > 4. Swapcache is cleared. > > > > Hmm, I see. I was expecting step 4 to be "writeback is cleared".. or at > > least that should be step 3.5, as IIUC "writeback" needs to be cleared > > before "swapcache" bit being cleared. > > > > > > > > The issue happens between steps 2 and 4, where the PTE is not present, but > > > the folio is still in swapcache - the current code does move_swap_pte() but does > > > not fixup folio->index within swapcache. > > > > One thing I'm still not clear here is why it's a race condition, rather > > than more severe than that. I mean, folio->index is definitely wrong, then > > as long as the page still in swapcache, we should be able to move the swp > > entry over to dest addr of UFFDIO_MOVE, read on dest addr, then it'll see > > the page in swapcache with the wrong folio->index already and trigger. > > > > I wrote a quick test like that, it actually won't trigger.. > > > > I had a closer look in the code, I think it's because do_swap_page() has > > the logic to detect folio->index matching first, and allocate a new folio > > if it doesn't match in ksm_might_need_to_copy(). IIUC that was for > > ksm.. but it looks like it's functioning too here. > > > > ksm_might_need_to_copy: > > if (folio_test_ksm(folio)) { > > if (folio_stable_node(folio) && > > !(ksm_run & KSM_RUN_UNMERGE)) > > return folio; /* no need to copy it */ > > } else if (!anon_vma) { > > return folio; /* no need to copy it */ > > } else if (folio->index == linear_page_index(vma, addr) && <---------- [1] > > anon_vma->root == vma->anon_vma->root) { > > return folio; /* still no need to copy it */ > > } > > ... > > > > new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, addr); <---- [2] > > ... > > > > So I believe what I hit is at [1] it sees index doesn't match, then it > > decided to allocate a new folio. In this case, it won't hit your BUG > > because it'll be "folio != swapcache" later, so it'll setup the > > folio->index for the new one, rather than the sanity check. > > You're absolutely right. The problem goes beyond just crashes; we're > also dealing with CoW when KSM is enabled. As long as we disable > KSM(which is true for Android), or when we are dealing with a large folio, > ksm_might_need_to_copy() will not allocate a new copy: Ah! That explains it.. > > struct folio *ksm_might_need_to_copy(struct folio *folio, > struct vm_area_struct *vma, unsigned long addr) > { > > struct page *page = folio_page(folio, 0); > struct anon_vma *anon_vma = folio_anon_vma(folio); > struct folio *new_folio; > > if (folio_test_large(folio)) > return folio; > .... > } > > Thanks for your great findings! For the KSM-enabled and small folio case, > it's pretty funny how UFFDIO_MOVE finally turns into a new allocation and > copy— somehow automatically falling back to "UFFDIO_COPY" :-) > It's amusing, but debugging it is fun. > > I'll add your findings to the changelog when I formally send v2, after gathering > all the code refinement suggestions and implementing the improvements. Thanks, that'll be helpful. I wanted to try with !KSM build but it's pretty late today (and it's company-wise PTO tomorrow..). Just in case useful, this is the reproducer I mentioned that didn't yet trigger when with KSM: https://github.com/xzpeter/clibs/blob/master/uffd-test/uffd-move-bug.c I'm not sure whether it'll also reproduce there, but there's chance it is a simpler reproducer. > > > > > Do you know how your case got triggered, being able to bypass the above [1] > > which should check folio->index already? > > > > > > > > My point is that if we want a proper fix for mTHP, we'd better handle writeback. > > > Otherwise, this isn’t much different from directly returning -EBUSY as proposed > > > in this RFC. > > > > > > For small folios, there’s no split_folio issue, making it relatively > > > simpler. Lokesh > > > mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely > > > the first priority. > > > > Agreed. > > > > -- > > Peter Xu > > > > Thanks > Barry >
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 867898c4e30b..34cf1c8c725d 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -18,6 +18,7 @@ #include <asm/tlbflush.h> #include <asm/tlb.h> #include "internal.h" +#include "swap.h" static __always_inline bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t dst_pmdval, spinlock_t *dst_ptl, spinlock_t *src_ptl) { + struct folio *folio; + swp_entry_t entry; + if (!pte_swp_exclusive(orig_src_pte)) return -EBUSY; + entry = pte_to_swp_entry(orig_src_pte); + folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry)); + if (!IS_ERR(folio)) { + folio_put(folio); + return -EBUSY; + } + double_pt_lock(dst_ptl, src_ptl); if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,