Message ID | 20200410073209.11164-1-hdanton@sina.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | f45ec5ff16 ("userfaultfd: wp: support swap and page migration"): [ 140.777858] BUG: Bad rss-counter state mm:b278fc66 type:MM_ANONPAGES val:1 | expand |
Hi, Hillf, On Fri, Apr 10, 2020 at 03:32:09PM +0800, Hillf Danton wrote: > > On Fri, 10 Apr 2020 08:25:18 +0800 > > Greetings, > > > > 0day kernel testing robot got the below dmesg and the first bad commit is > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master > > > > commit f45ec5ff16a75f96dac8c89862d75f1d8739efd4 > > Author: Peter Xu <peterx@redhat.com> > > AuthorDate: Mon Apr 6 20:06:01 2020 -0700 > > Commit: Linus Torvalds <torvalds@linux-foundation.org> > > CommitDate: Tue Apr 7 10:43:39 2020 -0700 > > > > userfaultfd: wp: support swap and page migration > > > > For either swap and page migration, we all use the bit 2 of the entry to > > identify whether this entry is uffd write-protected. It plays a similar > > role as the existing soft dirty bit in swap entries but only for keeping > > the uffd-wp tracking for a specific PTE/PMD. > > > > Something special here is that when we want to recover the uffd-wp bit > > from a swap/migration entry to the PTE bit we'll also need to take care of > > the _PAGE_RW bit and make sure it's cleared, otherwise even with the > > _PAGE_UFFD_WP bit we can't trap it at all. > > > > In change_pte_range() we do nothing for uffd if the PTE is a swap entry. > > That can lead to data mismatch if the page that we are going to write > > protect is swapped out when sending the UFFDIO_WRITEPROTECT. This patch > > also applies/removes the uffd-wp bit even for the swap entries. > > > Have trouble understanding the last sentence in the paragraph above and > particularly linking it to the first one. This patch handles the swap entry tracking for userfault-wp. I wanted to express the fact that we didn't do that before. Sorry if my wording is confusing. > > > If you fix the issue, kindly add following tag > > Reported-by: kernel test robot <lkp@intel.com> > > > > [child3:925] eventfd (323) returned ENOSYS, marking as inactive. > > [ 132.014801] can: request_module (can-proto-2) failed. > > [ 132.063717] can: request_module (can-proto-2) failed. > > [ 137.186037] trinity-c2 (943) used greatest stack depth: 5804 bytes left > > [ 140.771486] MCE: Killing trinity-c2:956 due to hardware memory corruption fault at 8bd2060 > > [ 140.777858] BUG: Bad rss-counter state mm:b278fc66 type:MM_ANONPAGES val:1 > > [ 140.778736] BUG: Bad rss-counter state mm:b278fc66 type:MM_SHMEMPAGES val:2 > > [ 141.589424] MCE: Killing trinity-c3:940 due to hardware memory corruption fault at 8a8c860 > > [ 141.590730] swap_info_get: Bad swap file entry 700b8216 Should be a 32bit guest, swap type == 11100b. I guess this also means MAX_SWAPFILES is bigger than 11100b. > > [ 141.591400] BUG: Bad page map in process trinity-c3 pte:17042c3c pmd:b1809067 > > [ 141.592304] addr:08bcf000 vm_flags:00100073 anon_vma:f1f29528 mapping:00000000 index:8bcf > > [ 141.593399] file:(null) fault:0x0 mmap:0x0 readpage:0x0 > > [ 141.594065] CPU: 0 PID: 940 Comm: trinity-c3 Not tainted 5.6.0-11490-gf45ec5ff16a75 #1 > > [ 141.595055] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 > > [ 141.596093] Call Trace: > > [ 141.596443] dump_stack+0x16/0x18 > > [ 141.596868] print_bad_pte+0x13f/0x159 > > [ 141.597367] unmap_page_range+0x2a7/0x3e7 > > [ 141.597893] unmap_single_vma+0x53/0x5d > > [ 141.598383] unmap_vmas+0x2c/0x3b > > [ 141.598811] exit_mmap+0x81/0xfc > > [ 141.599238] __mmput+0x25/0x8d > > [ 141.599633] mmput+0x28/0x2b > > [ 141.600007] do_exit+0x2f0/0x84a > > [ 141.600449] ? ___might_sleep+0x3f/0x11f > > [ 141.600949] do_group_exit+0x86/0x86 > > [ 141.601421] __ia32_sys_exit_group+0x15/0x15 > > [ 141.601965] do_fast_syscall_32+0x86/0xbf > > [ 141.602481] entry_SYSENTER_32+0xaf/0x101 > > > Because is_swap_pte(oldpte) != IS_ENABLED(CONFIG_MIGRATION)), restore the > old behavior by modifying uffd_wp only for pte that is __not__ swap entry, > as the commit log says. > > --- b/mm/mprotect.c > +++ c/mm/mprotect.c > @@ -139,7 +139,7 @@ static unsigned long change_pte_range(st > } > ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent); > pages++; > - } else if (is_swap_pte(oldpte)) { > + } else if (IS_ENABLED(CONFIG_MIGRATION)) { > swp_entry_t entry = pte_to_swp_entry(oldpte); > pte_t newpte; > > @@ -154,7 +154,9 @@ static unsigned long change_pte_range(st > newpte = pte_swp_mksoft_dirty(newpte); > if (pte_swp_uffd_wp(oldpte)) > newpte = pte_swp_mkuffd_wp(newpte); > - } else if (is_write_device_private_entry(entry)) { > + } > + > + if (is_write_device_private_entry(entry)) { > /* > * We do not preserve soft-dirtiness. See > * copy_one_pte() for explanation. > @@ -163,11 +165,18 @@ static unsigned long change_pte_range(st > newpte = swp_entry_to_pte(entry); > if (pte_swp_uffd_wp(oldpte)) > newpte = pte_swp_mkuffd_wp(newpte); > - } else { > - newpte = oldpte; > } > > - if (uffd_wp) > + /* > + * do nothing for changing uffd_wp if oldpte is a > + * swap entry. > + * That can lead to data mismatch if the page we > + * are going to write protect is swapped out when > + * sending the UFFDIO_WRITEPROTECT. > + */ > + if (is_swap_pte(oldpte)) > + newpte = oldpte; > + else if (uffd_wp) > newpte = pte_swp_mkuffd_wp(newpte); > else if (uffd_wp_resolve) > newpte = pte_swp_clear_uffd_wp(newpte); I'm not sure this is correct. As I mentioned, the commit wanted to apply the uffd-wp bit even for the swap entries so that even the swap entries got swapped in, the page will still be write protected. So IIUC think we can't remove that. Above report happened at __swap_info_get() where we should have triggered type >= READ_ONCE(nr_swapfiles) in swap_type_to_swap_info(), which, iiuc, means the type is considered wrong. The mistery to me is how did the swap type went wrong. E.g., iiuc the "is_swap_pte(oldpte)" replacement to "IS_ENABLED(CONFIG_MIGRATION)" shouldn't affect this, because even if MIGRATION is not enabled, both is_write_migration_entry() and is_write_device_private_entry() should return constant zero after all, so they shouldn't be thouching the swap type. I'm still trying to digest on what's happened... It would be good too if more information on the test could be given, e.g., what is the behavior of trinity-c2. A reproducer is of course even better. Thanks,
On Fri, Apr 10, 2020 at 11:32:34AM -0400, Peter Xu wrote: > I'm still trying to digest on what's happened... It would be good too > if more information on the test could be given, e.g., what is the > behavior of trinity-c2. A reproducer is of course even better. Trinity is a syscall fuzzer. Don't expect what it's doing to make any sense, it's just executing syscalls at random.
On Fri, Apr 10, 2020 at 08:38:05AM -0700, Matthew Wilcox wrote: > On Fri, Apr 10, 2020 at 11:32:34AM -0400, Peter Xu wrote: > > I'm still trying to digest on what's happened... It would be good too > > if more information on the test could be given, e.g., what is the > > behavior of trinity-c2. A reproducer is of course even better. > > Trinity is a syscall fuzzer. Don't expect what it's doing to make any > sense, it's just executing syscalls at random. OK thanks. Though I just noticed that the original report is actually with some attachments which I totally missed initially. There's the config file showing that we're with: CONFIG_MIGRATION=y CONFIG_MEMORY_FAILURE=y CONFIG_DEVICE_PRIVATE=n And even a reproducer. However the reproducer script will fail at wget, until I fixed it using: initrd=openwrt-trinity-i386.cgz to replace: initrd=openwrt-i386-trinity.cgz Then I can download the initrd and boot the VM with a decent QEMU. However I didn't see any test running after the VM booted, and it will reboot/shutdown after 100 sec without any error triggered (I believe the rc.local tries to run something under /etc/kernel-tests/ but I'm not sure it's running the right thing). If there's any way to reproduce (I believe so because it can even bisect in the original report, I just don't know how...), I'm thinking maybe we can try to dump every swp entry change that could have been touched in change_pte_range(), which is the only place that I thought could be related to this in the commit, to see whether there's anything suspecious. 8<---------------------------------------------- diff --git a/mm/mprotect.c b/mm/mprotect.c index 1d823b050329..1b6daf7d03aa 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -173,6 +173,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, newpte = pte_swp_clear_uffd_wp(newpte); if (!pte_same(oldpte, newpte)) { + pr_info("%s: Update swp entry, 0x%lx -> 0x%lx\n", + __func__, pte_val(oldpte), pte_val(newpte)); set_pte_at(vma->vm_mm, addr, pte, newpte); pages++; }
Hi Peter On Fri, 10 Apr 2020 11:32:34 -0400 Peter Xu wrote: > > I'm not sure this is correct. As I mentioned, the commit wanted to > apply the uffd-wp bit even for the swap entries so that even the swap > entries got swapped in, the page will still be write protected. So > IIUC think we can't remove that. > Thank you for explaining that we would flip-flop uffd-wp in f45ec5ff16. The input to test robot now looks like the diff below after restoring CONFIG_MIGRATION. --- b/mm/mprotect.c +++ c/mm/mprotect.c @@ -139,9 +139,9 @@ static unsigned long change_pte_range(st } ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent); pages++; - } else if (is_swap_pte(oldpte)) { + } else if (IS_ENABLED(CONFIG_MIGRATION)) { swp_entry_t entry = pte_to_swp_entry(oldpte); - pte_t newpte; + pte_t newpte = oldpte; if (is_write_migration_entry(entry)) { /* @@ -154,7 +154,9 @@ static unsigned long change_pte_range(st newpte = pte_swp_mksoft_dirty(newpte); if (pte_swp_uffd_wp(oldpte)) newpte = pte_swp_mkuffd_wp(newpte); - } else if (is_write_device_private_entry(entry)) { + } + + if (is_write_device_private_entry(entry)) { /* * We do not preserve soft-dirtiness. See * copy_one_pte() for explanation. @@ -163,8 +165,6 @@ static unsigned long change_pte_range(st newpte = swp_entry_to_pte(entry); if (pte_swp_uffd_wp(oldpte)) newpte = pte_swp_mkuffd_wp(newpte); - } else { - newpte = oldpte; } if (uffd_wp)
On Fri, 10 Apr 2020 11:32:34 -0400 Peter Xu wrote: > > I'm not sure this is correct. As I mentioned, the commit wanted to > apply the uffd-wp bit even for the swap entries so that even the swap > entries got swapped in, the page will still be write protected. So > IIUC think we can't remove that. Yes you are right. Now both CONFIG_MIGRATION and swap entry are restored after making uffd_wq survive migrate the same way as soft_dirty. --- a/mm/migrate.c +++ b/mm/migrate.c @@ -236,6 +236,8 @@ static bool remove_migration_pte(struct pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot))); if (pte_swp_soft_dirty(*pvmw.pte)) pte = pte_mksoft_dirty(pte); + if (pte_swp_uffd_wp(*pvmw.pte)) + pte = pte_mkuffd_wp(pte); /* * Recheck VMA as permissions can change since migration started @@ -243,15 +245,11 @@ static bool remove_migration_pte(struct entry = pte_to_swp_entry(*pvmw.pte); if (is_write_migration_entry(entry)) pte = maybe_mkwrite(pte, vma); - else if (pte_swp_uffd_wp(*pvmw.pte)) - pte = pte_mkuffd_wp(pte); if (unlikely(is_zone_device_page(new))) { if (is_device_private_page(new)) { entry = make_device_private_entry(new, pte_write(pte)); pte = swp_entry_to_pte(entry); - if (pte_swp_uffd_wp(*pvmw.pte)) - pte = pte_mkuffd_wp(pte); } } --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -139,11 +139,13 @@ static unsigned long change_pte_range(st } ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent); pages++; - } else if (is_swap_pte(oldpte)) { + } else if (IS_ENABLED(CONFIG_MIGRATION)) { swp_entry_t entry = pte_to_swp_entry(oldpte); pte_t newpte; - if (is_write_migration_entry(entry)) { + if (!non_swap_entry(entry)) { + newpte = oldpte; + } else if (is_write_migration_entry(entry)) { /* * A protection check is difficult so * just be safe and disable write @@ -164,7 +166,7 @@ static unsigned long change_pte_range(st if (pte_swp_uffd_wp(oldpte)) newpte = pte_swp_mkuffd_wp(newpte); } else { - newpte = oldpte; + continue; } if (uffd_wp)
On Sun, Apr 12, 2020 at 08:54:08PM +0800, Hillf Danton wrote: > > On Fri, 10 Apr 2020 11:32:34 -0400 Peter Xu wrote: > > > > I'm not sure this is correct. As I mentioned, the commit wanted to > > apply the uffd-wp bit even for the swap entries so that even the swap > > entries got swapped in, the page will still be write protected. So > > IIUC think we can't remove that. > > Yes you are right. > > Now both CONFIG_MIGRATION and swap entry are restored after making uffd_wq > survive migrate the same way as soft_dirty. > > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -236,6 +236,8 @@ static bool remove_migration_pte(struct > pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot))); > if (pte_swp_soft_dirty(*pvmw.pte)) > pte = pte_mksoft_dirty(pte); > + if (pte_swp_uffd_wp(*pvmw.pte)) > + pte = pte_mkuffd_wp(pte); > > /* > * Recheck VMA as permissions can change since migration started > @@ -243,15 +245,11 @@ static bool remove_migration_pte(struct > entry = pte_to_swp_entry(*pvmw.pte); > if (is_write_migration_entry(entry)) > pte = maybe_mkwrite(pte, vma); > - else if (pte_swp_uffd_wp(*pvmw.pte)) > - pte = pte_mkuffd_wp(pte); > > if (unlikely(is_zone_device_page(new))) { > if (is_device_private_page(new)) { > entry = make_device_private_entry(new, pte_write(pte)); > pte = swp_entry_to_pte(entry); > - if (pte_swp_uffd_wp(*pvmw.pte)) > - pte = pte_mkuffd_wp(pte); > } > } > > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -139,11 +139,13 @@ static unsigned long change_pte_range(st > } > ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent); > pages++; > - } else if (is_swap_pte(oldpte)) { > + } else if (IS_ENABLED(CONFIG_MIGRATION)) { > swp_entry_t entry = pte_to_swp_entry(oldpte); > pte_t newpte; > > - if (is_write_migration_entry(entry)) { > + if (!non_swap_entry(entry)) { > + newpte = oldpte; > + } else if (is_write_migration_entry(entry)) { > /* > * A protection check is difficult so > * just be safe and disable write > @@ -164,7 +166,7 @@ static unsigned long change_pte_range(st > if (pte_swp_uffd_wp(oldpte)) > newpte = pte_swp_mkuffd_wp(newpte); > } else { > - newpte = oldpte; > + continue; > } > > if (uffd_wp) > > Hi, Hillf, Feel free to have a look at another report, which I think is the same issue of this: https://lore.kernel.org/lkml/CA+G9fYsRGvkqtpdGv_aVr+Hn17KgYq04Q=EE=pB774qVxRqOeg@mail.gmail.com/ IMHO this bisected commit is correct itself, it's just that we shouldn't enable uffd-wp on 32bit system. Thanks,
--- b/mm/mprotect.c +++ c/mm/mprotect.c @@ -139,7 +139,7 @@ static unsigned long change_pte_range(st } ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent); pages++; - } else if (is_swap_pte(oldpte)) { + } else if (IS_ENABLED(CONFIG_MIGRATION)) { swp_entry_t entry = pte_to_swp_entry(oldpte); pte_t newpte; @@ -154,7 +154,9 @@ static unsigned long change_pte_range(st newpte = pte_swp_mksoft_dirty(newpte); if (pte_swp_uffd_wp(oldpte)) newpte = pte_swp_mkuffd_wp(newpte); - } else if (is_write_device_private_entry(entry)) { + } + + if (is_write_device_private_entry(entry)) { /* * We do not preserve soft-dirtiness. See * copy_one_pte() for explanation. @@ -163,11 +165,18 @@ static unsigned long change_pte_range(st newpte = swp_entry_to_pte(entry); if (pte_swp_uffd_wp(oldpte)) newpte = pte_swp_mkuffd_wp(newpte); - } else { - newpte = oldpte; } - if (uffd_wp) + /* + * do nothing for changing uffd_wp if oldpte is a + * swap entry. + * That can lead to data mismatch if the page we + * are going to write protect is swapped out when + * sending the UFFDIO_WRITEPROTECT. + */ + if (is_swap_pte(oldpte)) + newpte = oldpte; + else if (uffd_wp) newpte = pte_swp_mkuffd_wp(newpte); else if (uffd_wp_resolve) newpte = pte_swp_clear_uffd_wp(newpte);