Message ID | 20230111133351.807024-1-jannh@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/khugepaged: Fix ->anon_vma race | expand |
On Wed, Jan 11, 2023 at 5:33 AM Jann Horn <jannh@google.com> wrote: > > If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires > it to be locked. retract_page_tables() bails out if an ->anon_vma is > attached, but does this check before holding the mmap lock (as the comment > above the check explains). > > If we racily merge an existing ->anon_vma (shared with a child process) > from a neighboring VMA, subsequent rmap traversals on pages belonging to > the child will be able to see the page tables that we are concurrently > removing while assuming that nothing else can access them. > > Repeat the ->anon_vma check once we hold the mmap lock to ensure that there > really is no concurrent page table access. > > Reported-by: Zach O'Keefe <zokeefe@google.com> > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") > Cc: stable@vger.kernel.org > Signed-off-by: Jann Horn <jannh@google.com> > --- > zokeefe@ pointed out to me that the current code (after my last round of patches) > can hit a lockdep assert by racing, and after staring at it a bit I've I'm supposed the lockdep is the one in collapse_and_free_pmd(). It is better to have the splat included in the commit log. > convinced myself that this is a real, preexisting bug. > (I haven't written a reproducer for it though. One way to hit it might be > something along the lines of: > > - set up a process A with a private-file-mapping VMA V1 > - let A fork() to create process B, thereby copying V1 in A to V1' in B > - let B extend the end of V1' > - let B put some anon pages into the extended part of V1' I don't quite get why we need this step. A cow fault on A's V1 isn't enough to have anon_vma for V1? This should not prevent V1 and V2 from sharing anon_vma. Did I miss something? > - let A map a new private-file-mapping VMA V2 directly behind V1, without > an anon_vma > [race begins here] > - in A's thread 1: begin retract_page_tables() on V2, run through first > ->anon_vma check > - in A's thread 2: run __anon_vma_prepare() on V2 and ensure that it > merges the anon_vma of V1 (which implies V1 and V2 must be mapping the > same file at compatible offsets) > - in B: trigger rmap traversal on anon page in V1' > > mm/khugepaged.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 5cb401aa2b9d..0bfed37f3a3b 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1644,7 +1644,7 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > * has higher cost too. It would also probably require locking > * the anon_vma. > */ > - if (vma->anon_vma) { > + if (READ_ONCE(vma->anon_vma)) { > result = SCAN_PAGE_ANON; > goto next; > } > @@ -1672,6 +1672,18 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > result = SCAN_PTE_MAPPED_HUGEPAGE; > if ((cc->is_khugepaged || is_target) && > mmap_write_trylock(mm)) { > + /* > + * Re-check whether we have an ->anon_vma, because > + * collapse_and_free_pmd() requires that either no > + * ->anon_vma exists or the anon_vma is locked. > + * We already checked ->anon_vma above, but that check > + * is racy because ->anon_vma can be populated under the > + * mmap lock in read mode. > + */ > + if (vma->anon_vma) { > + result = SCAN_PAGE_ANON; > + goto unlock_next; > + } > /* > * When a vma is registered with uffd-wp, we can't > * recycle the pmd pgtable because there can be pte > > base-commit: 7dd4b804e08041ff56c88bdd8da742d14b17ed25 > -- > 2.39.0.314.g84b9a713c41-goog >
On Wed, Jan 11, 2023 at 02:33:51PM +0100, Jann Horn wrote: > If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires > it to be locked. retract_page_tables() bails out if an ->anon_vma is > attached, but does this check before holding the mmap lock (as the comment > above the check explains). > > If we racily merge an existing ->anon_vma (shared with a child process) > from a neighboring VMA, subsequent rmap traversals on pages belonging to > the child will be able to see the page tables that we are concurrently > removing while assuming that nothing else can access them. > > Repeat the ->anon_vma check once we hold the mmap lock to ensure that there > really is no concurrent page table access. > > Reported-by: Zach O'Keefe <zokeefe@google.com> > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") > Cc: stable@vger.kernel.org > Signed-off-by: Jann Horn <jannh@google.com> > --- > zokeefe@ pointed out to me that the current code (after my last round of patches) > can hit a lockdep assert by racing, and after staring at it a bit I've > convinced myself that this is a real, preexisting bug. > (I haven't written a reproducer for it though. One way to hit it might be > something along the lines of: > > - set up a process A with a private-file-mapping VMA V1 > - let A fork() to create process B, thereby copying V1 in A to V1' in B > - let B extend the end of V1' > - let B put some anon pages into the extended part of V1' At this point V1' gets it's own ->anon_vma, not connected to V1, right? > - let A map a new private-file-mapping VMA V2 directly behind V1, without > an anon_vma > [race begins here] > - in A's thread 1: begin retract_page_tables() on V2, run through first > ->anon_vma check > - in A's thread 2: run __anon_vma_prepare() on V2 and ensure that it > merges the anon_vma of V1 (which implies V1 and V2 must be mapping the > same file at compatible offsets) > - in B: trigger rmap traversal on anon page in V1' I don't follow the race. rmap on V1' will not reach V1. > mm/khugepaged.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 5cb401aa2b9d..0bfed37f3a3b 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1644,7 +1644,7 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > * has higher cost too. It would also probably require locking > * the anon_vma. > */ > - if (vma->anon_vma) { > + if (READ_ONCE(vma->anon_vma)) { > result = SCAN_PAGE_ANON; > goto next; > } This makes perfect sense. At least for readability. But I think false-negative should not lead to bad results. > @@ -1672,6 +1672,18 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > result = SCAN_PTE_MAPPED_HUGEPAGE; > if ((cc->is_khugepaged || is_target) && > mmap_write_trylock(mm)) { > + /* > + * Re-check whether we have an ->anon_vma, because > + * collapse_and_free_pmd() requires that either no > + * ->anon_vma exists or the anon_vma is locked. > + * We already checked ->anon_vma above, but that check > + * is racy because ->anon_vma can be populated under the > + * mmap lock in read mode. > + */ > + if (vma->anon_vma) { > + result = SCAN_PAGE_ANON; > + goto unlock_next; > + } This is totally wrong direction. Or I don't understand the race. At this point we already paid nearly all price of of pagetable retraction. I don't see any correctness reason to stop here, except for the assert. I think lockdep assert in collapse_and_free_pmd() is wrong and has to be dropped. > /* > * When a vma is registered with uffd-wp, we can't > * recycle the pmd pgtable because there can be pte > > base-commit: 7dd4b804e08041ff56c88bdd8da742d14b17ed25 > -- > 2.39.0.314.g84b9a713c41-goog >
On Thu, Jan 12, 2023 at 12:56 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > On Wed, Jan 11, 2023 at 02:33:51PM +0100, Jann Horn wrote: > > If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires > > it to be locked. retract_page_tables() bails out if an ->anon_vma is > > attached, but does this check before holding the mmap lock (as the comment > > above the check explains). > > > > If we racily merge an existing ->anon_vma (shared with a child process) > > from a neighboring VMA, subsequent rmap traversals on pages belonging to > > the child will be able to see the page tables that we are concurrently > > removing while assuming that nothing else can access them. > > > > Repeat the ->anon_vma check once we hold the mmap lock to ensure that there > > really is no concurrent page table access. > > > > Reported-by: Zach O'Keefe <zokeefe@google.com> > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") > > Cc: stable@vger.kernel.org > > Signed-off-by: Jann Horn <jannh@google.com> > > --- > > zokeefe@ pointed out to me that the current code (after my last round of patches) > > can hit a lockdep assert by racing, and after staring at it a bit I've > > convinced myself that this is a real, preexisting bug. > > (I haven't written a reproducer for it though. One way to hit it might be > > something along the lines of: > > > > - set up a process A with a private-file-mapping VMA V1 > > - let A fork() to create process B, thereby copying V1 in A to V1' in B > > - let B extend the end of V1' > > - let B put some anon pages into the extended part of V1' > > At this point V1' gets it's own ->anon_vma, not connected to V1, right? This is what I got confused too. > > > - let A map a new private-file-mapping VMA V2 directly behind V1, without > > an anon_vma > > [race begins here] > > - in A's thread 1: begin retract_page_tables() on V2, run through first > > ->anon_vma check > > - in A's thread 2: run __anon_vma_prepare() on V2 and ensure that it > > merges the anon_vma of V1 (which implies V1 and V2 must be mapping the > > same file at compatible offsets) > > - in B: trigger rmap traversal on anon page in V1' > > I don't follow the race. rmap on V1' will not reach V1. > > > mm/khugepaged.c | 14 +++++++++++++- > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index 5cb401aa2b9d..0bfed37f3a3b 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -1644,7 +1644,7 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > > * has higher cost too. It would also probably require locking > > * the anon_vma. > > */ > > - if (vma->anon_vma) { > > + if (READ_ONCE(vma->anon_vma)) { > > result = SCAN_PAGE_ANON; > > goto next; > > } > > This makes perfect sense. At least for readability. But I think > false-negative should not lead to bad results. > > > @@ -1672,6 +1672,18 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > > result = SCAN_PTE_MAPPED_HUGEPAGE; > > if ((cc->is_khugepaged || is_target) && > > mmap_write_trylock(mm)) { > > + /* > > + * Re-check whether we have an ->anon_vma, because > > + * collapse_and_free_pmd() requires that either no > > + * ->anon_vma exists or the anon_vma is locked. > > + * We already checked ->anon_vma above, but that check > > + * is racy because ->anon_vma can be populated under the > > + * mmap lock in read mode. > > + */ > > + if (vma->anon_vma) { > > + result = SCAN_PAGE_ANON; > > + goto unlock_next; > > + } > > This is totally wrong direction. Or I don't understand the race. > > At this point we already paid nearly all price of of pagetable retraction. > I don't see any correctness reason to stop here, except for the assert. Isn't it possible that collapse_and_free_pmd() clear the pmd which may point to a PTE which maps the COW'ed anon page if this race happens? > > I think lockdep assert in collapse_and_free_pmd() is wrong and has to be > dropped. > > > /* > > * When a vma is registered with uffd-wp, we can't > > * recycle the pmd pgtable because there can be pte > > > > base-commit: 7dd4b804e08041ff56c88bdd8da742d14b17ed25 > > -- > > 2.39.0.314.g84b9a713c41-goog > > > > -- > Kiryl Shutsemau / Kirill A. Shutemov
On Thu, Jan 12, 2023 at 10:12:53AM -0800, Yang Shi wrote: > > This is totally wrong direction. Or I don't understand the race. > > > > At this point we already paid nearly all price of of pagetable retraction. > > I don't see any correctness reason to stop here, except for the assert. > > Isn't it possible that collapse_and_free_pmd() clear the pmd which may > point to a PTE which maps the COW'ed anon page if this race happens? No. At this point we have huge page in the place in the page cache and it is locked. COW fault would serialize on the page lock.
On Thu, Jan 12, 2023 at 4:10 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > On Thu, Jan 12, 2023 at 10:12:53AM -0800, Yang Shi wrote: > > > This is totally wrong direction. Or I don't understand the race. > > > > > > At this point we already paid nearly all price of of pagetable retraction. > > > I don't see any correctness reason to stop here, except for the assert. > > > > Isn't it possible that collapse_and_free_pmd() clear the pmd which may > > point to a PTE which maps the COW'ed anon page if this race happens? > > No. At this point we have huge page in the place in the page cache and it > is locked. COW fault would serialize on the page lock. Aha, yeah, I missed the page lock. So IIUC even though anon_vma is set there should be not any COW'ed anon pages since cow fault would be blocked by the page lock. If so I tend to agree with you the lockdep assert seems unnecessary. > > -- > Kiryl Shutsemau / Kirill A. Shutemov
On Thu, Jan 12, 2023 at 9:56 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > On Wed, Jan 11, 2023 at 02:33:51PM +0100, Jann Horn wrote: > > If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires > > it to be locked. retract_page_tables() bails out if an ->anon_vma is > > attached, but does this check before holding the mmap lock (as the comment > > above the check explains). > > > > If we racily merge an existing ->anon_vma (shared with a child process) > > from a neighboring VMA, subsequent rmap traversals on pages belonging to > > the child will be able to see the page tables that we are concurrently > > removing while assuming that nothing else can access them. > > > > Repeat the ->anon_vma check once we hold the mmap lock to ensure that there > > really is no concurrent page table access. > > > > Reported-by: Zach O'Keefe <zokeefe@google.com> > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") > > Cc: stable@vger.kernel.org > > Signed-off-by: Jann Horn <jannh@google.com> > > --- > > zokeefe@ pointed out to me that the current code (after my last round of patches) > > can hit a lockdep assert by racing, and after staring at it a bit I've > > convinced myself that this is a real, preexisting bug. > > (I haven't written a reproducer for it though. One way to hit it might be > > something along the lines of: > > > > - set up a process A with a private-file-mapping VMA V1 > > - let A fork() to create process B, thereby copying V1 in A to V1' in B > > - let B extend the end of V1' > > - let B put some anon pages into the extended part of V1' > > At this point V1' gets it's own ->anon_vma, not connected to V1, right? Yes, true, it doesn't work in a child process because reusable_anon_vma() refuses to reuse the anon_vma of an adjacent VMA if that VMA is attached to more than one VMA. So to actually make it work, you have to rearrange things a bit such that this happens in the parent process, so that prepare_anon_vma() will reuse a neighboring anon_vma via reusable_anon_vma(). Sorry, my description of the race was slightly off. See the attached code for how to actually do it. Basically the trick is to use mremap() to take one private file VMA with an anon_vma and turn it into two private file VMAs ("mapping1" and "mapping2" in my code) with the same anon_vma that map the same file range. Put an anonymous page into mapping1. Then you can replace the first 2 MiB of "mapping2" with another VMA that is not mergeable into mapping2, is anon_vma-compatible with mapping2, has no anon_vma, and looks THP-collapsible. And then you can, while khugepaged is in the race window, attach the new VMA to the existing anon_vma and trigger an rmap walk on the anonymous page. > > - let A map a new private-file-mapping VMA V2 directly behind V1, without > > an anon_vma > > [race begins here] > > - in A's thread 1: begin retract_page_tables() on V2, run through first > > ->anon_vma check > > - in A's thread 2: run __anon_vma_prepare() on V2 and ensure that it > > merges the anon_vma of V1 (which implies V1 and V2 must be mapping the > > same file at compatible offsets) > > - in B: trigger rmap traversal on anon page in V1' > > I don't follow the race. rmap on V1' will not reach V1. > > > mm/khugepaged.c | 14 +++++++++++++- > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index 5cb401aa2b9d..0bfed37f3a3b 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -1644,7 +1644,7 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > > * has higher cost too. It would also probably require locking > > * the anon_vma. > > */ > > - if (vma->anon_vma) { > > + if (READ_ONCE(vma->anon_vma)) { > > result = SCAN_PAGE_ANON; > > goto next; > > } > > This makes perfect sense. At least for readability. But I think > false-negative should not lead to bad results. > > > @@ -1672,6 +1672,18 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > > result = SCAN_PTE_MAPPED_HUGEPAGE; > > if ((cc->is_khugepaged || is_target) && > > mmap_write_trylock(mm)) { > > + /* > > + * Re-check whether we have an ->anon_vma, because > > + * collapse_and_free_pmd() requires that either no > > + * ->anon_vma exists or the anon_vma is locked. > > + * We already checked ->anon_vma above, but that check > > + * is racy because ->anon_vma can be populated under the > > + * mmap lock in read mode. > > + */ > > + if (vma->anon_vma) { > > + result = SCAN_PAGE_ANON; > > + goto unlock_next; > > + } > > This is totally wrong direction. Or I don't understand the race. > > At this point we already paid nearly all price of of pagetable retraction. > I don't see any correctness reason to stop here, except for the assert. This condition can only be hit with a narrow race, since vma->anon_vma was NULL a moment ago. So this branch will only run very rarely, which means it is unlikely to have an effect on performance. > I think lockdep assert in collapse_and_free_pmd() is wrong and has to be > dropped. No, that lockdep assert has to be there. Page table traversal is allowed under any one of the mmap lock, the anon_vma lock (if the VMA is associated with an anon_vma), and the mapping lock (if the VMA is associated with a mapping); and so to be able to remove page tables, we must hold all three of them. Even if we couldn't figure out a way to actually exploit a lack of locking here, we still shouldn't be skipping locking because of that. Unless we have clearly defined API contracts such as "you are allowed to traverse page tables if you hold one of these three locks", it is impossible to maintain code or to make it robust and secure. Even if you could show that the lack of locking can't lead to a crash in the current kernel, that still wouldn't be a good reason to skip the locking. Without something at least resembling an API contract, an innocent-looking change in another part of the kernel can turn things in a completely different part of the kernel into dangerous bugs, and it is impossible to change anything while feeling confident that you haven't broken something halfway across the kernel. (Or at least that's what it feels like to me when I try to edit kernel code.) Anyway, here are proper instructions for reproducing the bug. First build the kernel with the attached patch HACK-thp-anonvma-logging-and-delay.diff, which adds a bunch of debug logging and injects extra timing delays to make the race much easier to hit. Then build with CONFIG_KASAN=y and CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y and lockdep, and run that kernel. Run "echo force > /sys/kernel/mm/transparent_hugepage/shmem_enabled" to enable shmem THP. Compile the attached reproducer thp_newbug.c ("gcc -o thp_newbug thp_newbug.c"). Let the reproducer run (it'll take a few seconds to run because it has to wait for khugepaged and stuff). Then you should see a lockdep assert failure followed by an ASAN splat due to page table UAF in dmesg: [ 1076.064348][ T116] khugepaged: retract_page_tables: retract_page_tables(), cc->is_khugepaged=1, target_mm=ffff88817b4acc80 [ 1076.066331][ T116] khugepaged: retract_page_tables: running on vma at 0x1fe000 in mm=ffff88817b4acc80 [ 1076.068464][ T116] khugepaged: retract_page_tables: about to call collapse_and_free_pmd() [ 1076.071078][ T116] khugepaged: retract_page_tables: done with collapse_and_free_pmd() [ 1076.072360][ T116] khugepaged: retract_page_tables: sleeping before mmap lock... [ 1076.135883][ T797] map_pte: executing under TEST1 pvmw->vma->vm_mm==ffff88817b4acc80 pvmw->vma->vm_start=0x23ff000 [ 1076.147212][ T797] map_pte: ... begin delay... [ 1079.074924][ T116] khugepaged: retract_page_tables: done sleeping [ 1079.075874][ T116] khugepaged: retract_page_tables: about to call collapse_and_free_pmd() [ 1079.076964][ T116] ------------[ cut here ]------------ [ 1079.077663][ T116] WARNING: CPU: 14 PID: 116 at mm/khugepaged.c:1406 collapse_and_free_pmd+0x364/0x420 [ 1079.078940][ T116] CPU: 14 PID: 116 Comm: khugepaged Not tainted 6.2.0-rc3-00060-gc757fc92a3f7-dirty #257 [ 1079.080191][ T116] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014 [ 1079.081462][ T116] RIP: 0010:collapse_and_free_pmd+0x364/0x420 [ 1079.082290][ T116] Code: ae 7e e6 03 85 d2 0f 84 b3 fd ff ff 4c 89 ff e8 52 86 fe ff 49 8b 3f 31 f6 48 83 c7 78 e8 34 56 e3 01 85 c0 0f 85 95 fd ff ff <0f> 0b e9 8e fd ff ff e8 d0 13 e8 ff ba ff ff ff ff be 26 00 00 00 [ 1079.084735][ T116] RSP: 0018:ffffc900009b79b0 EFLAGS: 00010246 [ 1079.085527][ T116] RAX: 0000000000000000 RBX: ffff88817b4acc80 RCX: 0000000000000001 [ 1079.089956][ T116] RDX: 0000000000000000 RSI: ffffffffa5cab020 RDI: ffffffffa5e382e0 [ 1079.090967][ T116] RBP: 1ffff92000136f36 R08: 0000000000000000 R09: ffff88817b4acd6f [ 1079.091942][ T116] R10: ffffed102f6959ad R11: 0000000000000001 R12: ffff888117b0a3e8 [ 1079.092953][ T116] R13: 0000000002400000 R14: ffff888171ad1090 R15: ffff8881149b1100 [ 1079.094066][ T116] FS: 0000000000000000(0000) GS:ffff8881f7100000(0000) knlGS:0000000000000000 [ 1079.095387][ T116] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1079.096379][ T116] CR2: 000055c6faaf9578 CR3: 0000000111738005 CR4: 0000000000370ee0 [ 1079.097695][ T116] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1079.098909][ T116] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1079.100117][ T116] Call Trace: [ 1079.100668][ T116] <TASK> [...] [ 1079.106003][ T116] collapse_file.cold+0x3a4/0x400 [...] [ 1079.110117][ T116] hpage_collapse_scan_file+0x72a/0xad0 [ 1079.110856][ T116] ? lock_release+0x22e/0x460 [ 1079.111483][ T116] ? __pfx_hpage_collapse_scan_file+0x10/0x10 [ 1079.112289][ T116] khugepaged+0x66b/0xbf0 [...] [ 1079.118911][ T116] kthread+0x17e/0x1b0 [ 1079.119440][ T116] ? __pfx_kthread+0x10/0x10 [ 1079.120012][ T116] ret_from_fork+0x2c/0x50 [ 1079.120593][ T116] </TASK> [ 1079.120971][ T116] irq event stamp: 12259 [ 1079.121520][ T116] hardirqs last enabled at (12271): [<ffffffffa388204e>] __up_console_sem+0x5e/0x70 [ 1079.122735][ T116] hardirqs last disabled at (12282): [<ffffffffa3882033>] __up_console_sem+0x43/0x70 [ 1079.123950][ T116] softirqs last enabled at (12018): [<ffffffffa37cd108>] __irq_exit_rcu+0xd8/0x140 [ 1079.125151][ T116] softirqs last disabled at (12013): [<ffffffffa37cd108>] __irq_exit_rcu+0xd8/0x140 [ 1079.126342][ T116] ---[ end trace 0000000000000000 ]--- [ 1079.140447][ T116] khugepaged: retract_page_tables: done with collapse_and_free_pmd() [ 1079.143002][ T116] khugepaged: retract_page_tables: skipping vma at 0x1400000 in mm=ffff88817b4acc80 because anon_vma [ 1079.146490][ T116] khugepaged: retract_page_tables: skipping vma at 0x1400000 in mm=ffff88817b4a9980 because anon_vma [ 1079.159601][ T116] khugepaged: retract_page_tables: retract_page_tables(), cc->is_khugepaged=1, target_mm=ffff88817b4acc80 [ 1079.161186][ T116] khugepaged: retract_page_tables: skipping vma at 0x1400000 in mm=ffff88817b4acc80 because anon_vma [ 1079.162648][ T116] khugepaged: retract_page_tables: skipping vma at 0x1400000 in mm=ffff88817b4a9980 because anon_vma [ 1079.164000][ T116] khugepaged: retract_page_tables: skipping vma at 0x2600000 in mm=ffff88817b4acc80 because anon_vma [ 1079.649601][ T797] map_pte: ... end delay [ 1079.650358][ T797] ================================================================== [ 1079.651257][ T797] BUG: KASAN: use-after-free in page_vma_mapped_walk+0x5f3/0xdf0 [ 1079.652120][ T797] Read of size 8 at addr ffff888171e54000 by task TEST1/797 [ 1079.652915][ T797] [ 1079.653179][ T797] CPU: 13 PID: 797 Comm: TEST1 Tainted: G W 6.2.0-rc3-00060-gc757fc92a3f7-dirty #257 [ 1079.654410][ T797] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014 [ 1079.655534][ T797] Call Trace: [ 1079.655907][ T797] <TASK> [ 1079.656268][ T797] dump_stack_lvl+0x5b/0x77 [ 1079.656812][ T797] print_report+0x168/0x458 [...] [ 1079.659003][ T797] kasan_report+0xbb/0xf0 [...] [ 1079.662829][ T797] page_vma_mapped_walk+0x5f3/0xdf0 [ 1079.663474][ T797] folio_referenced_one+0x1b3/0x3e0 [...] [ 1079.667670][ T797] rmap_walk_anon+0x137/0x2b0 [ 1079.668252][ T797] folio_referenced+0x298/0x2c0 [...] [ 1079.673519][ T797] shrink_folio_list+0xd30/0x1660 [...] [ 1079.679030][ T797] reclaim_folio_list.constprop.0+0xcf/0x200 [...] [ 1079.681772][ T797] reclaim_pages+0x154/0x1c0 [...] [ 1079.683727][ T797] madvise_cold_or_pageout_pte_range+0xf64/0x1900 [...] [ 1079.686125][ T797] walk_pgd_range+0x68a/0xcb0 [...] [ 1079.687393][ T797] __walk_page_range+0x25d/0x260 [...] [ 1079.689327][ T797] walk_page_range+0x253/0x2b0 [...] [ 1079.691922][ T797] madvise_pageout+0x1ab/0x2a0 [...] [ 1079.694355][ T797] do_madvise.part.0+0x3a3/0x1370 [...] [ 1079.701143][ T797] __x64_sys_madvise+0x7b/0xa0 [ 1079.701719][ T797] do_syscall_64+0x3a/0x90 [ 1079.702207][ T797] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 1079.702863][ T797] RIP: 0033:0x7f18ae2dfd07 [ 1079.703352][ T797] Code: ff ff ff ff c3 48 8b 15 87 61 0c 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 66 2e 0f 1f 84 00 00 00 00 00 90 b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 61 0c 00 f7 d8 64 89 01 48 [ 1079.705469][ T797] RSP: 002b:00007fffaaf31198 EFLAGS: 00000206 ORIG_RAX: 000000000000001c [ 1079.706382][ T797] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f18ae2dfd07 [ 1079.707244][ T797] RDX: 0000000000000015 RSI: 0000000000001000 RDI: 0000000001400000 [ 1079.710207][ T797] RBP: 00007fffaaf322c0 R08: 00007f18ae3a8800 R09: 00007f18ae3ad540 [ 1079.711092][ T797] R10: 00007f18ae2d6274 R11: 0000000000000206 R12: 00005577fac45220 [ 1079.712021][ T797] R13: 00007fffaaf323a0 R14: 0000000000000000 R15: 0000000000000000 [ 1079.712883][ T797] </TASK> [ 1079.713219][ T797] [ 1079.713495][ T797] The buggy address belongs to the physical page: [ 1079.714273][ T797] page:000000009614ae4b refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x171e54 [ 1079.715595][ T797] flags: 0x8000000000000000(zone=2) [ 1079.716275][ T797] raw: 8000000000000000 ffffea0004562e48 ffff8881f713d810 0000000000000000 [ 1079.717389][ T797] raw: 0000000000000000 ffff8881183f50c0 00000000ffffffff 0000000000000000 [ 1079.718508][ T797] page dumped because: kasan: bad access detected [ 1079.719358][ T797] page_owner tracks the page as freed [ 1079.720053][ T797] page last allocated via order 0, migratetype Unmovable, gfp_mask 0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), pid 796, tgid 796 (thp_newbug), ts 1065542415232, free_ts 1079140429026 [ 1079.722276][ T797] get_page_from_freelist+0xb0a/0x1300 [ 1079.722872][ T797] __alloc_pages+0x17e/0x3c0 [ 1079.723376][ T797] pte_alloc_one+0x18/0xe0 [ 1079.723885][ T797] __handle_mm_fault+0x15fa/0x1660 [ 1079.724474][ T797] handle_mm_fault+0x18c/0x4e0 [ 1079.724995][ T797] do_user_addr_fault+0x232/0x7f0 [ 1079.725551][ T797] exc_page_fault+0x5d/0xe0 [ 1079.726045][ T797] asm_exc_page_fault+0x22/0x30 [ 1079.726577][ T797] page last free stack trace: [ 1079.727089][ T797] free_pcp_prepare+0x276/0x530 [ 1079.727680][ T797] free_unref_page+0x37/0x250 [ 1079.728193][ T797] collapse_and_free_pmd+0x2a3/0x420 [ 1079.728773][ T797] collapse_file.cold+0x3a4/0x400 [ 1079.729321][ T797] hpage_collapse_scan_file+0x72a/0xad0 [ 1079.729939][ T797] khugepaged+0x66b/0xbf0 [ 1079.730426][ T797] kthread+0x17e/0x1b0 [ 1079.730874][ T797] ret_from_fork+0x2c/0x50 [ 1079.731393][ T797] [ 1079.731674][ T797] Memory state around the buggy address: [ 1079.732284][ T797] ffff888171e53f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 1079.733154][ T797] ffff888171e53f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 1079.734032][ T797] >ffff888171e54000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 1079.734932][ T797] ^ [ 1079.735380][ T797] ffff888171e54080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 1079.736249][ T797] ffff888171e54100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 1079.737119][ T797] ================================================================== [ 1079.738129][ T797] Disabling lock debugging due to kernel taint [ 1079.738897][ T797] map_pte: executing under TEST1 pvmw->vma->vm_mm==ffff88817b4acc80 pvmw->vma->vm_start=0x1400000 [ 1079.744091][ T797] map_pte: ... continuing [ 1079.744621][ T797] map_pte: executing under TEST1 pvmw->vma->vm_mm==ffff88817b4a9980 pvmw->vma->vm_start=0x1400000 [ 1079.745802][ T797] map_pte: ... continuing Here's the corresponding reproducer output (including dumps of the anon_vma hierarchies attached to the VMAs, thanks to the extra debug info I added to procfs in the attached diff): user@vm:~/thp/thp_newbug$ ./thp_newbug Rss = 2048 kB [...] Rss = 2048 kB Rss = 2048 kB Rss = 2048 kB Rss = 2048 kB Rss = 0 kB khugepaged is about to write_trylock! associating anon_vma... anon_vma associated, telling child to hit rmap now... child hitting rmap... child done with rmap =========== PARENT =========== 001fe000-00400000 r--p 003fe000 00:01 13313 /memfd:memfd (deleted) Size: 2056 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 0 kB Pss: 0 kB Pss_Dirty: 0 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 0 kB Referenced: 0 kB Anonymous: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 VmFlags: rd mr mw me dc hg 01400000-01800000 rw-p 00400000 00:01 13313 /memfd:memfd (deleted) Size: 4096 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 0 kB Pss: 0 kB Pss_Dirty: 0 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 0 kB Referenced: 0 kB Anonymous: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 VmFlags: rd wr mr mw me ac AnonVMAs: # [ 0] ffff8881149b1100 root=ffff8881149b1100 refcount=2 anon_vmas=4 num_children=2 num_active_vmas=3 parent=ffff8881149b1100 023ff000-02600000 rwxp 003ff000 00:01 13313 /memfd:memfd (deleted) Size: 2052 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 4 kB Pss: 4 kB Pss_Dirty: 4 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 4 kB Referenced: 4 kB Anonymous: 4 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 VmFlags: rd wr ex mr mw me dc ac AnonVMAs: # [ 0] ffff8881149b1100 root=ffff8881149b1100 refcount=2 anon_vmas=4 num_children=2 num_active_vmas=3 parent=ffff8881149b1100 02600000-02800000 rw-p 00600000 00:01 13313 /memfd:memfd (deleted) Size: 2048 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 0 kB Pss: 0 kB Pss_Dirty: 0 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 0 kB Referenced: 0 kB Anonymous: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 VmFlags: rd wr mr mw me dc ac AnonVMAs: # [ 0] ffff8881149b1100 root=ffff8881149b1100 refcount=2 anon_vmas=4 num_children=2 num_active_vmas=3 parent=ffff8881149b1100 [...] =========== CHILD =========== 01400000-01800000 rw-p 00400000 00:01 13313 /memfd:memfd (deleted) Size: 4096 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 4 kB Pss: 4 kB Pss_Dirty: 4 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 4 kB Referenced: 0 kB Anonymous: 4 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 VmFlags: rd wr mr mw me ac AnonVMAs: [ 0] ffff8881149b1100 root=ffff8881149b1100 refcount=2 anon_vmas=4 num_children=2 num_active_vmas=3 parent=ffff8881149b1100 # [ 1] ffff8881149b0cc0 root=ffff8881149b1100 refcount=1 anon_vmas=1 num_children=0 num_active_vmas=1 parent=ffff8881149b1100 [...] user@vm:~/thp/thp_newbug$
On Thu, Jan 12, 2023 at 2:06 AM Yang Shi <shy828301@gmail.com> wrote: > > On Wed, Jan 11, 2023 at 5:33 AM Jann Horn <jannh@google.com> wrote: > > > > If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires > > it to be locked. retract_page_tables() bails out if an ->anon_vma is > > attached, but does this check before holding the mmap lock (as the comment > > above the check explains). > > > > If we racily merge an existing ->anon_vma (shared with a child process) > > from a neighboring VMA, subsequent rmap traversals on pages belonging to > > the child will be able to see the page tables that we are concurrently > > removing while assuming that nothing else can access them. > > > > Repeat the ->anon_vma check once we hold the mmap lock to ensure that there > > really is no concurrent page table access. > > > > Reported-by: Zach O'Keefe <zokeefe@google.com> > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") > > Cc: stable@vger.kernel.org > > Signed-off-by: Jann Horn <jannh@google.com> > > --- > > zokeefe@ pointed out to me that the current code (after my last round of patches) > > can hit a lockdep assert by racing, and after staring at it a bit I've > > I'm supposed the lockdep is the one in collapse_and_free_pmd(). It is > better to have the splat included in the commit log. I pasted the splat in https://lore.kernel.org/linux-mm/CAG48ez3434wZBKFFbdx4M9j6eUwSUVPd4dxhzW_k_POneSDF+A@mail.gmail.com/ - which part do you think should go into the commit log? Just the "WARNING: CPU: 14 PID: 116 at mm/khugepaged.c:1406 collapse_and_free_pmd+0x364/0x420"? Or the whole ASAN splat below the lockdep complaint with all three backtraces? > > convinced myself that this is a real, preexisting bug. > > (I haven't written a reproducer for it though. One way to hit it might be > > something along the lines of: > > > > - set up a process A with a private-file-mapping VMA V1 > > - let A fork() to create process B, thereby copying V1 in A to V1' in B > > - let B extend the end of V1' > > - let B put some anon pages into the extended part of V1' > > I don't quite get why we need this step. A cow fault on A's V1 isn't > enough to have anon_vma for V1? This should not prevent V1 and V2 from > sharing anon_vma. Did I miss something? You're right, these steps don't work.
On Fri, Jan 13, 2023 at 08:28:59PM +0100, Jann Horn wrote: > No, that lockdep assert has to be there. Page table traversal is > allowed under any one of the mmap lock, the anon_vma lock (if the VMA > is associated with an anon_vma), and the mapping lock (if the VMA is > associated with a mapping); and so to be able to remove page tables, > we must hold all three of them. Okay, that's fair. I agree with the patch now. Maybe adjust the commit message a bit? Anyway: Acked-by: Kirill A. Shutemov <kirill.shutemov@intel.linux.com> BTW, I've noticied that you recently added tlb_remove_table_sync_one(). I'm not sure why it is needed. Why IPI in pmdp_collapse_flush() in not good enough to serialize against GUP fast?
On Sun, Jan 15, 2023 at 8:07 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > On Fri, Jan 13, 2023 at 08:28:59PM +0100, Jann Horn wrote: > > No, that lockdep assert has to be there. Page table traversal is > > allowed under any one of the mmap lock, the anon_vma lock (if the VMA > > is associated with an anon_vma), and the mapping lock (if the VMA is > > associated with a mapping); and so to be able to remove page tables, > > we must hold all three of them. > > Okay, that's fair. I agree with the patch now. Maybe adjust the commit > message a bit? Just to make sure we're on the same page: Are you suggesting that I add this text? "Page table traversal is allowed under any one of the mmap lock, the anon_vma lock (if the VMA is associated with an anon_vma), and the mapping lock (if the VMA is associated with a mapping); and so to be able to remove page tables, we must hold all three of them." Or something else? > Anyway: > > Acked-by: Kirill A. Shutemov <kirill.shutemov@intel.linux.com> Thanks! > BTW, I've noticied that you recently added tlb_remove_table_sync_one(). > I'm not sure why it is needed. Why IPI in pmdp_collapse_flush() in not > good enough to serialize against GUP fast? If that sent an IPI, it would be good enough; but pmdp_collapse_flush() is not guaranteed to send an IPI. It does a TLB flush, but on some architectures (including arm64 and also virtualized x86), a remote TLB flush can be done without an IPI. For example, arm64 has some fancy hardware support for remote TLB invalidation without IPIs ("broadcast TLB invalidation"), and virtualized x86 has (depending on the hypervisor) things like TLB shootdown hypercalls (under Hyper-V, see hyperv_flush_tlb_multi) or TLB shootdown signalling for preempted CPUs through shared memory (under KVM, see kvm_flush_tlb_multi).
On Mon, Jan 16, 2023 at 01:06:59PM +0100, Jann Horn wrote: > On Sun, Jan 15, 2023 at 8:07 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > On Fri, Jan 13, 2023 at 08:28:59PM +0100, Jann Horn wrote: > > > No, that lockdep assert has to be there. Page table traversal is > > > allowed under any one of the mmap lock, the anon_vma lock (if the VMA > > > is associated with an anon_vma), and the mapping lock (if the VMA is > > > associated with a mapping); and so to be able to remove page tables, > > > we must hold all three of them. > > > > Okay, that's fair. I agree with the patch now. Maybe adjust the commit > > message a bit? > > Just to make sure we're on the same page: Are you suggesting that I > add this text? > "Page table traversal is allowed under any one of the mmap lock, the > anon_vma lock (if the VMA is associated with an anon_vma), and the > mapping lock (if the VMA is associated with a mapping); and so to be > able to remove page tables, we must hold all three of them." > Or something else? Looks good to me. > > Anyway: > > > > Acked-by: Kirill A. Shutemov <kirill.shutemov@intel.linux.com> > > Thanks! > > > BTW, I've noticied that you recently added tlb_remove_table_sync_one(). > > I'm not sure why it is needed. Why IPI in pmdp_collapse_flush() in not > > good enough to serialize against GUP fast? > > If that sent an IPI, it would be good enough; but > pmdp_collapse_flush() is not guaranteed to send an IPI. > It does a TLB flush, but on some architectures (including arm64 and > also virtualized x86), a remote TLB flush can be done without an IPI. > For example, arm64 has some fancy hardware support for remote TLB > invalidation without IPIs ("broadcast TLB invalidation"), and > virtualized x86 has (depending on the hypervisor) things like TLB > shootdown hypercalls (under Hyper-V, see hyperv_flush_tlb_multi) or > TLB shootdown signalling for preempted CPUs through shared memory > (under KVM, see kvm_flush_tlb_multi). I think such architectures must provide proper pmdp_collapse_flush() with the required serialization. Power and S390 already do that.
+cc mmu-gather maintainers On Mon, Jan 16, 2023 at 1:34 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > On Mon, Jan 16, 2023 at 01:06:59PM +0100, Jann Horn wrote: > > On Sun, Jan 15, 2023 at 8:07 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > BTW, I've noticied that you recently added tlb_remove_table_sync_one(). > > > I'm not sure why it is needed. Why IPI in pmdp_collapse_flush() in not > > > good enough to serialize against GUP fast? > > > > If that sent an IPI, it would be good enough; but > > pmdp_collapse_flush() is not guaranteed to send an IPI. > > It does a TLB flush, but on some architectures (including arm64 and > > also virtualized x86), a remote TLB flush can be done without an IPI. > > For example, arm64 has some fancy hardware support for remote TLB > > invalidation without IPIs ("broadcast TLB invalidation"), and > > virtualized x86 has (depending on the hypervisor) things like TLB > > shootdown hypercalls (under Hyper-V, see hyperv_flush_tlb_multi) or > > TLB shootdown signalling for preempted CPUs through shared memory > > (under KVM, see kvm_flush_tlb_multi). > > I think such architectures must provide proper pmdp_collapse_flush() > with the required serialization. FWIW, the IPI that I added is not unconditional; tlb_remove_table_sync_one() is a no-op depending on CONFIG_MMU_GATHER_RCU_TABLE_FREE, which an architecture can use to signal that it uses "Semi RCU freeing of the page directories". The kernel has arch-independent support for these semantics in the normal TLB flushing code. But yeah, I guess you could move the tlb_remove_table_sync_one() calls into pmdp_collapse_flush() (including the generic version)? I'm CC-ing the mmu-gather maintainers in case they have an opinion. Anyway, I'm not going to do that refactor; feel free to do that if you want. > Power and S390 already do that. What's the call graph from pmdp_collapse_flush() to IPI on powerpc and s390?
On 16.01.23 13:34, Kirill A. Shutemov wrote: > On Mon, Jan 16, 2023 at 01:06:59PM +0100, Jann Horn wrote: >> On Sun, Jan 15, 2023 at 8:07 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: >>> On Fri, Jan 13, 2023 at 08:28:59PM +0100, Jann Horn wrote: >>>> No, that lockdep assert has to be there. Page table traversal is >>>> allowed under any one of the mmap lock, the anon_vma lock (if the VMA >>>> is associated with an anon_vma), and the mapping lock (if the VMA is >>>> associated with a mapping); and so to be able to remove page tables, >>>> we must hold all three of them. >>> >>> Okay, that's fair. I agree with the patch now. Maybe adjust the commit >>> message a bit? >> >> Just to make sure we're on the same page: Are you suggesting that I >> add this text? >> "Page table traversal is allowed under any one of the mmap lock, the >> anon_vma lock (if the VMA is associated with an anon_vma), and the >> mapping lock (if the VMA is associated with a mapping); and so to be >> able to remove page tables, we must hold all three of them." >> Or something else? > > Looks good to me. > >>> Anyway: >>> >>> Acked-by: Kirill A. Shutemov <kirill.shutemov@intel.linux.com> >> >> Thanks! >> >>> BTW, I've noticied that you recently added tlb_remove_table_sync_one(). >>> I'm not sure why it is needed. Why IPI in pmdp_collapse_flush() in not >>> good enough to serialize against GUP fast? >> >> If that sent an IPI, it would be good enough; but >> pmdp_collapse_flush() is not guaranteed to send an IPI. >> It does a TLB flush, but on some architectures (including arm64 and >> also virtualized x86), a remote TLB flush can be done without an IPI. >> For example, arm64 has some fancy hardware support for remote TLB >> invalidation without IPIs ("broadcast TLB invalidation"), and >> virtualized x86 has (depending on the hypervisor) things like TLB >> shootdown hypercalls (under Hyper-V, see hyperv_flush_tlb_multi) or >> TLB shootdown signalling for preempted CPUs through shared memory >> (under KVM, see kvm_flush_tlb_multi). > > I think such architectures must provide proper pmdp_collapse_flush() > with the required serialization. Power and S390 already do that. > The plan is to eventually move away from (ab)using IPI to synchronize with GUP-fast. Moving further into that direction a is wrong. The flush was added as a quick fix for all architectures by Jann, until we can do better. Even for ppc64, see: commit bedf03416913d88c796288f9dca109a53608c745 Author: Yang Shi <shy828301@gmail.com> Date: Wed Sep 7 11:01:44 2022 -0700 powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flush The IPI broadcast is used to serialize against fast-GUP, but fast-GUP will move to use RCU instead of disabling local interrupts in fast-GUP. Using an IPI is the old-styled way of serializing against fast-GUP although it still works as expected now. And fast-GUP now fixed the potential race with THP collapse by checking whether PMD is changed or not. So IPI broadcast in radix pmd collapse flush is not necessary anymore. But it is still needed for hash TLB.
On Mon, Jan 16, 2023 at 02:07:41PM +0100, David Hildenbrand wrote: > On 16.01.23 13:34, Kirill A. Shutemov wrote: > > On Mon, Jan 16, 2023 at 01:06:59PM +0100, Jann Horn wrote: > > > On Sun, Jan 15, 2023 at 8:07 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > On Fri, Jan 13, 2023 at 08:28:59PM +0100, Jann Horn wrote: > > > > > No, that lockdep assert has to be there. Page table traversal is > > > > > allowed under any one of the mmap lock, the anon_vma lock (if the VMA > > > > > is associated with an anon_vma), and the mapping lock (if the VMA is > > > > > associated with a mapping); and so to be able to remove page tables, > > > > > we must hold all three of them. > > > > > > > > Okay, that's fair. I agree with the patch now. Maybe adjust the commit > > > > message a bit? > > > > > > Just to make sure we're on the same page: Are you suggesting that I > > > add this text? > > > "Page table traversal is allowed under any one of the mmap lock, the > > > anon_vma lock (if the VMA is associated with an anon_vma), and the > > > mapping lock (if the VMA is associated with a mapping); and so to be > > > able to remove page tables, we must hold all three of them." > > > Or something else? > > > > Looks good to me. > > > > > > Anyway: > > > > > > > > Acked-by: Kirill A. Shutemov <kirill.shutemov@intel.linux.com> > > > > > > Thanks! > > > > > > > BTW, I've noticied that you recently added tlb_remove_table_sync_one(). > > > > I'm not sure why it is needed. Why IPI in pmdp_collapse_flush() in not > > > > good enough to serialize against GUP fast? > > > > > > If that sent an IPI, it would be good enough; but > > > pmdp_collapse_flush() is not guaranteed to send an IPI. > > > It does a TLB flush, but on some architectures (including arm64 and > > > also virtualized x86), a remote TLB flush can be done without an IPI. > > > For example, arm64 has some fancy hardware support for remote TLB > > > invalidation without IPIs ("broadcast TLB invalidation"), and > > > virtualized x86 has (depending on the hypervisor) things like TLB > > > shootdown hypercalls (under Hyper-V, see hyperv_flush_tlb_multi) or > > > TLB shootdown signalling for preempted CPUs through shared memory > > > (under KVM, see kvm_flush_tlb_multi). > > > > I think such architectures must provide proper pmdp_collapse_flush() > > with the required serialization. Power and S390 already do that. > > > > The plan is to eventually move away from (ab)using IPI to synchronize with > GUP-fast. Moving further into that direction a is wrong. > > The flush was added as a quick fix for all architectures by Jann, until > we can do better. > > Even for ppc64, see: > > commit bedf03416913d88c796288f9dca109a53608c745 > Author: Yang Shi <shy828301@gmail.com> > Date: Wed Sep 7 11:01:44 2022 -0700 > > powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flush > The IPI broadcast is used to serialize against fast-GUP, but fast-GUP will > move to use RCU instead of disabling local interrupts in fast-GUP. Using > an IPI is the old-styled way of serializing against fast-GUP although it > still works as expected now. > And fast-GUP now fixed the potential race with THP collapse by checking > whether PMD is changed or not. So IPI broadcast in radix pmd collapse > flush is not necessary anymore. But it is still needed for hash TLB. Okay. But I think tlb_remove_table_sync_one() belongs inside pmdp_collapse_flush(). Collapsing pmd table into huge page without serialization is a bug. They should not be separate.
On Sun, Jan 15, 2023 at 11:07 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > On Fri, Jan 13, 2023 at 08:28:59PM +0100, Jann Horn wrote: > > No, that lockdep assert has to be there. Page table traversal is > > allowed under any one of the mmap lock, the anon_vma lock (if the VMA > > is associated with an anon_vma), and the mapping lock (if the VMA is > > associated with a mapping); and so to be able to remove page tables, > > we must hold all three of them. > > Okay, that's fair. I agree with the patch now. Maybe adjust the commit > message a bit? Yeah, the explanation makes sense to me too. Reviewed-by: Yang Shi <shy828301@gmail.com> > > Anyway: > > Acked-by: Kirill A. Shutemov <kirill.shutemov@intel.linux.com> > > BTW, I've noticied that you recently added tlb_remove_table_sync_one(). > I'm not sure why it is needed. Why IPI in pmdp_collapse_flush() in not > good enough to serialize against GUP fast? > > -- > Kiryl Shutsemau / Kirill A. Shutemov
On Wed, Jan 11, 2023 at 2:33 PM Jann Horn <jannh@google.com> wrote: > If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires > it to be locked. retract_page_tables() bails out if an ->anon_vma is > attached, but does this check before holding the mmap lock (as the comment > above the check explains). @akpm please replace the commit message with the following, and maybe also add a "Link:" entry pointing to https://lore.kernel.org/linux-mm/CAG48ez3434wZBKFFbdx4M9j6eUwSUVPd4dxhzW_k_POneSDF+A@mail.gmail.com/ for the reproducer. If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires it to be locked. Page table traversal is allowed under any one of the mmap lock, the anon_vma lock (if the VMA is associated with an anon_vma), and the mapping lock (if the VMA is associated with a mapping); and so to be able to remove page tables, we must hold all three of them. retract_page_tables() bails out if an ->anon_vma is attached, but does this check before holding the mmap lock (as the comment above the check explains). If we racily merge an existing ->anon_vma (shared with a child process) from a neighboring VMA, subsequent rmap traversals on pages belonging to the child will be able to see the page tables that we are concurrently removing while assuming that nothing else can access them. Repeat the ->anon_vma check once we hold the mmap lock to ensure that there really is no concurrent page table access. Hitting this bug causes a lockdep warning in collapse_and_free_pmd(), in the line "lockdep_assert_held_write(&vma->anon_vma->root->rwsem)". It can also lead to use-after-free access.
On Tue, 17 Jan 2023 20:12:49 +0100 Jann Horn <jannh@google.com> wrote: > On Wed, Jan 11, 2023 at 2:33 PM Jann Horn <jannh@google.com> wrote: > > If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires > > it to be locked. retract_page_tables() bails out if an ->anon_vma is > > attached, but does this check before holding the mmap lock (as the comment > > above the check explains). > > @akpm please replace the commit message with the following, and maybe > also add a "Link:" entry pointing to > https://lore.kernel.org/linux-mm/CAG48ez3434wZBKFFbdx4M9j6eUwSUVPd4dxhzW_k_POneSDF+A@mail.gmail.com/ > for the reproducer. Done, thanks.
On 16.01.23 14:47, Kirill A. Shutemov wrote: > On Mon, Jan 16, 2023 at 02:07:41PM +0100, David Hildenbrand wrote: >> On 16.01.23 13:34, Kirill A. Shutemov wrote: >>> On Mon, Jan 16, 2023 at 01:06:59PM +0100, Jann Horn wrote: >>>> On Sun, Jan 15, 2023 at 8:07 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: >>>>> On Fri, Jan 13, 2023 at 08:28:59PM +0100, Jann Horn wrote: >>>>>> No, that lockdep assert has to be there. Page table traversal is >>>>>> allowed under any one of the mmap lock, the anon_vma lock (if the VMA >>>>>> is associated with an anon_vma), and the mapping lock (if the VMA is >>>>>> associated with a mapping); and so to be able to remove page tables, >>>>>> we must hold all three of them. >>>>> >>>>> Okay, that's fair. I agree with the patch now. Maybe adjust the commit >>>>> message a bit? >>>> >>>> Just to make sure we're on the same page: Are you suggesting that I >>>> add this text? >>>> "Page table traversal is allowed under any one of the mmap lock, the >>>> anon_vma lock (if the VMA is associated with an anon_vma), and the >>>> mapping lock (if the VMA is associated with a mapping); and so to be >>>> able to remove page tables, we must hold all three of them." >>>> Or something else? >>> >>> Looks good to me. >>> >>>>> Anyway: >>>>> >>>>> Acked-by: Kirill A. Shutemov <kirill.shutemov@intel.linux.com> >>>> >>>> Thanks! >>>> >>>>> BTW, I've noticied that you recently added tlb_remove_table_sync_one(). >>>>> I'm not sure why it is needed. Why IPI in pmdp_collapse_flush() in not >>>>> good enough to serialize against GUP fast? >>>> >>>> If that sent an IPI, it would be good enough; but >>>> pmdp_collapse_flush() is not guaranteed to send an IPI. >>>> It does a TLB flush, but on some architectures (including arm64 and >>>> also virtualized x86), a remote TLB flush can be done without an IPI. >>>> For example, arm64 has some fancy hardware support for remote TLB >>>> invalidation without IPIs ("broadcast TLB invalidation"), and >>>> virtualized x86 has (depending on the hypervisor) things like TLB >>>> shootdown hypercalls (under Hyper-V, see hyperv_flush_tlb_multi) or >>>> TLB shootdown signalling for preempted CPUs through shared memory >>>> (under KVM, see kvm_flush_tlb_multi). >>> >>> I think such architectures must provide proper pmdp_collapse_flush() >>> with the required serialization. Power and S390 already do that. >>> >> >> The plan is to eventually move away from (ab)using IPI to synchronize with >> GUP-fast. Moving further into that direction a is wrong. >> >> The flush was added as a quick fix for all architectures by Jann, until >> we can do better. >> >> Even for ppc64, see: >> >> commit bedf03416913d88c796288f9dca109a53608c745 >> Author: Yang Shi <shy828301@gmail.com> >> Date: Wed Sep 7 11:01:44 2022 -0700 >> >> powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flush >> The IPI broadcast is used to serialize against fast-GUP, but fast-GUP will >> move to use RCU instead of disabling local interrupts in fast-GUP. Using >> an IPI is the old-styled way of serializing against fast-GUP although it >> still works as expected now. >> And fast-GUP now fixed the potential race with THP collapse by checking >> whether PMD is changed or not. So IPI broadcast in radix pmd collapse >> flush is not necessary anymore. But it is still needed for hash TLB. > > Okay. But I think tlb_remove_table_sync_one() belongs inside > pmdp_collapse_flush(). Collapsing pmd table into huge page without > serialization is a bug. They should not be separate. Agreed. But I wonder if it should be moved into a generic pmdp_collapse_flush(), that calls an arch specific __pmdp_collapse_flush().
On Mon, Jan 23, 2023 at 12:07:05PM +0100, David Hildenbrand wrote: > > > commit bedf03416913d88c796288f9dca109a53608c745 > > > Author: Yang Shi <shy828301@gmail.com> > > > Date: Wed Sep 7 11:01:44 2022 -0700 > > > > > > powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flush > > > The IPI broadcast is used to serialize against fast-GUP, but fast-GUP will > > > move to use RCU instead of disabling local interrupts in fast-GUP. Using > > > an IPI is the old-styled way of serializing against fast-GUP although it > > > still works as expected now. > > > And fast-GUP now fixed the potential race with THP collapse by checking > > > whether PMD is changed or not. So IPI broadcast in radix pmd collapse > > > flush is not necessary anymore. But it is still needed for hash TLB. > > > > Okay. But I think tlb_remove_table_sync_one() belongs inside > > pmdp_collapse_flush(). Collapsing pmd table into huge page without > > serialization is a bug. They should not be separate. > > Agreed. But I wonder if it should be moved into a generic > pmdp_collapse_flush(), that calls an arch specific __pmdp_collapse_flush(). Um? There's no __pmdp_collapse_flush(). Architectural code overrides pmdp_collapse_flush() altogether. It has know how to serialize properly for the arch.
On 24.01.23 01:51, Kirill A. Shutemov wrote: > On Mon, Jan 23, 2023 at 12:07:05PM +0100, David Hildenbrand wrote: >>>> commit bedf03416913d88c796288f9dca109a53608c745 >>>> Author: Yang Shi <shy828301@gmail.com> >>>> Date: Wed Sep 7 11:01:44 2022 -0700 >>>> >>>> powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flush >>>> The IPI broadcast is used to serialize against fast-GUP, but fast-GUP will >>>> move to use RCU instead of disabling local interrupts in fast-GUP. Using >>>> an IPI is the old-styled way of serializing against fast-GUP although it >>>> still works as expected now. >>>> And fast-GUP now fixed the potential race with THP collapse by checking >>>> whether PMD is changed or not. So IPI broadcast in radix pmd collapse >>>> flush is not necessary anymore. But it is still needed for hash TLB. >>> >>> Okay. But I think tlb_remove_table_sync_one() belongs inside >>> pmdp_collapse_flush(). Collapsing pmd table into huge page without >>> serialization is a bug. They should not be separate. >> >> Agreed. But I wonder if it should be moved into a generic >> pmdp_collapse_flush(), that calls an arch specific __pmdp_collapse_flush(). > > Um? There's no __pmdp_collapse_flush(). Did I imply that there is already one? ;)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 5cb401aa2b9d..0bfed37f3a3b 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1644,7 +1644,7 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, * has higher cost too. It would also probably require locking * the anon_vma. */ - if (vma->anon_vma) { + if (READ_ONCE(vma->anon_vma)) { result = SCAN_PAGE_ANON; goto next; } @@ -1672,6 +1672,18 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, result = SCAN_PTE_MAPPED_HUGEPAGE; if ((cc->is_khugepaged || is_target) && mmap_write_trylock(mm)) { + /* + * Re-check whether we have an ->anon_vma, because + * collapse_and_free_pmd() requires that either no + * ->anon_vma exists or the anon_vma is locked. + * We already checked ->anon_vma above, but that check + * is racy because ->anon_vma can be populated under the + * mmap lock in read mode. + */ + if (vma->anon_vma) { + result = SCAN_PAGE_ANON; + goto unlock_next; + } /* * When a vma is registered with uffd-wp, we can't * recycle the pmd pgtable because there can be pte
If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires it to be locked. retract_page_tables() bails out if an ->anon_vma is attached, but does this check before holding the mmap lock (as the comment above the check explains). If we racily merge an existing ->anon_vma (shared with a child process) from a neighboring VMA, subsequent rmap traversals on pages belonging to the child will be able to see the page tables that we are concurrently removing while assuming that nothing else can access them. Repeat the ->anon_vma check once we hold the mmap lock to ensure that there really is no concurrent page table access. Reported-by: Zach O'Keefe <zokeefe@google.com> Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> --- zokeefe@ pointed out to me that the current code (after my last round of patches) can hit a lockdep assert by racing, and after staring at it a bit I've convinced myself that this is a real, preexisting bug. (I haven't written a reproducer for it though. One way to hit it might be something along the lines of: - set up a process A with a private-file-mapping VMA V1 - let A fork() to create process B, thereby copying V1 in A to V1' in B - let B extend the end of V1' - let B put some anon pages into the extended part of V1' - let A map a new private-file-mapping VMA V2 directly behind V1, without an anon_vma [race begins here] - in A's thread 1: begin retract_page_tables() on V2, run through first ->anon_vma check - in A's thread 2: run __anon_vma_prepare() on V2 and ensure that it merges the anon_vma of V1 (which implies V1 and V2 must be mapping the same file at compatible offsets) - in B: trigger rmap traversal on anon page in V1' mm/khugepaged.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) base-commit: 7dd4b804e08041ff56c88bdd8da742d14b17ed25